Opened 12 years ago
Closed 12 years ago
#38 closed (worksforme)
OpenVZ container of sliver pgeni-gpolab-bbn-com/ecgtest bitrotted into a bad state
Reported by: | chaos@bbn.com | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | Experiment | Version: | SPIRAL4 |
Keywords: | Cc: | ||
Dependencies: |
Description
I created the experiment peni-gpolab-bbn-com/ecgtest
on Saturday morning, and used it on Saturday and (possibly) yesterday. Today, https://boss.utah.geniracks.net/showexp.php3?experiment=952 shows the node status of pcvm3-1
as "up (TBFAILED)", and i can't SSH into the VM. I get the error:
capybara,[~],14:23(0)$ sshtmp -p 30010 pc3.utah.geniracks.net The authenticity of host 'pc3.utah.geniracks.net (155.98.34.13)' can't be established. RSA key fingerprint is 6d:1d:76:53:a5:25:99:39:e2:89:ea:b0:99:e3:d3:b9. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'pc3.utah.geniracks.net,155.98.34.13' (RSA) to the list of known hosts. Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.
and then the shell hangs (sshtmp is an alias for "ssh -o UserKnownHostsFile?=/dev/null".)
Determine what happened to the experiment and why it got into a bad state.
Change History (3)
comment:1 Changed 12 years ago by
comment:2 Changed 12 years ago by
We exchanged some e-mail about this last week:
- Leigh:
> For the record, i was copying data to the container from my physical node > via the dataplane interface in a loop, and as far as i know that copy > continued to happen until the container fell over. So, I looked at this and did not come up with anything definitive, so lets try a wild ass guess instead. Is it possible that you filled up the container filesystem? We build each container with a 4GB file system (remember, you want 100 on a node!), which leaves about 3GB of free space. Filling up the filesystem should not be fatal of course, but perhaps openvz is doing something wrong somewhere. The actual contents of the filesystem are gone, but I kinda doubt you did an rm * in the root directory.
- My response:
> Is it possible that you filled up the container filesystem? We build each > container with a 4GB file system (remember, you want 100 on a node!), which > leaves about 3GB of free space. Filling up the filesystem should not be > fatal of course, but perhaps openvz is doing something wrong somewhere. So, it's possible... but i was copying the same file to the same target filename over and over again, and i don't *believe* scp should be doing that particular thing to me. > The actual contents of the filesystem are gone, but I kinda doubt you did > an rm * in the root directory. Indeed, not to my knowledge. Anyway, if it's gone, it's gone. Anything else we can look at, or should i delete the sliver once the AM is back online?
Rob says this is troubling, but they haven't been able to reproduce it. What should we do about this ticket?
- Do a deeper search for logs/tea leaves related to the instance about which i created this ticket (may be impossible at this point)
- Try harder to reproduce the problem
- Close as wontfix/worksforme/some-other-nonfixed-state and just keep the ticket as a data point in case something like this happens again
- Something else
I don't have a strong preference.
comment:3 Changed 12 years ago by
Resolution: | → worksforme |
---|---|
Status: | new → closed |
With Leigh's agreement, i'm closing this as worksforme. The ticket will stay in the system, of course, so we'll have a record of it if this comes up again, or if someone gets an idea about what might have happened.
Hmm. Here's the log on pc3:
and some contents from that log:
So, that looks like 8.5 hours after execution of the container started, it just fell over (perhaps while trying to mount something which i'm not sure why it was trying to mount, since presumably it was already running).
For the record, i was copying data to the container from my physical node via the dataplane interface in a loop, and as far as i know that copy continued to happen until the container fell over.
Anyway, that's what i know, and i'll leave the experiment running.