Opened 12 years ago

Closed 12 years ago

#38 closed (worksforme)

OpenVZ container of sliver pgeni-gpolab-bbn-com/ecgtest bitrotted into a bad state

Reported by: chaos@bbn.com Owned by: somebody
Priority: major Milestone:
Component: Experiment Version: SPIRAL4
Keywords: Cc:
Dependencies:

Description

I created the experiment peni-gpolab-bbn-com/ecgtest on Saturday morning, and used it on Saturday and (possibly) yesterday. Today, https://boss.utah.geniracks.net/showexp.php3?experiment=952 shows the node status of pcvm3-1 as "up (TBFAILED)", and i can't SSH into the VM. I get the error:

capybara,[~],14:23(0)$ sshtmp -p 30010 pc3.utah.geniracks.net
The authenticity of host 'pc3.utah.geniracks.net (155.98.34.13)' can't be established.
RSA key fingerprint is 6d:1d:76:53:a5:25:99:39:e2:89:ea:b0:99:e3:d3:b9.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'pc3.utah.geniracks.net,155.98.34.13' (RSA) to the list of known hosts.
Warning: no access to tty (Bad file descriptor).
                                                Thus no job control in this shell.

and then the shell hangs (sshtmp is an alias for "ssh -o UserKnownHostsFile?=/dev/null".)

Determine what happened to the experiment and why it got into a bad state.

Change History (3)

comment:1 Changed 12 years ago by chaos@bbn.com

Hmm. Here's the log on pc3:

vhost2,[~],12:31(0)$ ls -l /var/emulab/logs/tbvnode-pcvm3-1.log 
-rw-r--r-- 1 root root 13556 May 26 14:04 /var/emulab/logs/tbvnode-pcvm3-1.log

and some contents from that log:

TIMESTAMP: 05:30:19:787564 vnodesetup starting the real work
TIMESTAMP: 05:30:19:804707 vnodesetup calling into libsetup
Checking Testbed generic VM configuration ...
TIMESTAMP: 05:30:19:895419 vnodesetup back from libsetup
TIMESTAMP: 05:30:19:895775 vnodesetup (561359) waiting on vnode (561378)
TIMESTAMP: 05:30:19:895903 vnodesetup (561378) execing mkvnode.pl
Got Linux kernel version numbers 2 6 32
Using Linux netem instead of custom qdiscs.
Using Linux macvlan instead of OpenVZ veths.
libvnode: debug=1
TIMESTAMP: 05/26/2012 05:30:20:70610 starting openvz rootPreConfig()
TIMESTAMP: 05/26/2012 05:30:20:72144 finished openvz rootPreConfig()
...
TIMESTAMP: 05/26/2012 05:30:23:482272 finished openvz vnodeBoot()
TIMESTAMP: 05/26/2012 05:30:23:485576 starting openvz vnodePostConfig()
TIMESTAMP: 05/26/2012 05:30:23:485686 finished openvz vnodePostConfig()
TIMESTAMP: 05/26/2012 05:30:23:485725 starting openvz rootPostConfig()
TIMESTAMP: 05/26/2012 05:30:23:485755 finished openvz rootPostConfig()
VN State:
$VAR1 = {
          'sshd_iprule' => {
                             'ext_ctrlip' => '155.98.34.13',
                             'ctrlip' => '172.17.3.1',
                             'port' => '30010'
                           },
          'private' => {
                         'iplinks' => {
                                        'mv1.1' => 'eth1.260'
                                      }
                       },
          'uuid' => '308aadac-a726-11e1-998d-00009b6224df',
          'vmtype' => 'openvz',
          'vmid' => '1'
        };
Storing state to disk ...
mysystem: 'touch /var/emulab/vms/pcvm3-1/running'
Trying to mount LVM logical device for vnode pcvm3-1: mysystem: 'mount /dev/open
vz/pcvm3-1 /mnt/pcvm3-1'
done.
Trying to mount LVM logical device for vnode pcvm3-1: mysystem: 'mount /dev/open
vz/pcvm3-1 /mnt/pcvm3-1'
done.
Trying to mount LVM logical device for vnode pcvm3-1: mysystem: 'mount /dev/open
vz/pcvm3-1 /mnt/pcvm3-1'
mount: /dev/mapper/openvz-pcvm3--1 already mounted or /mnt/pcvm3-1 busy
*** /usr/local/etc/emulab/mkvnode.pl:
    *** ERROR: vnodeState: *** /usr/local/etc/emulab/mkvnode.pl:
    Command failed: 8192 - mount /dev/openvz/pcvm3-1 /mnt/pcvm3-1


TIMESTAMP: 14:04:03:672630 vnodesetup (561378) exiting with status 512
*** /usr/local/etc/emulab/vnodesetup:
    Jail exited unexpectedly with status 512!

So, that looks like 8.5 hours after execution of the container started, it just fell over (perhaps while trying to mount something which i'm not sure why it was trying to mount, since presumably it was already running).

For the record, i was copying data to the container from my physical node via the dataplane interface in a loop, and as far as i know that copy continued to happen until the container fell over.

Anyway, that's what i know, and i'll leave the experiment running.

comment:2 Changed 12 years ago by chaos@bbn.com

We exchanged some e-mail about this last week:

  • Leigh:
    > For the record, i was copying data to the container from my physical node
    > via the dataplane interface in a loop, and as far as i know that copy
    > continued to happen until the container fell over.
    
    So, I looked at this and did not come up with anything definitive, so lets
    try a wild ass guess instead.
    
    Is it possible that you filled up the container filesystem? We build each
    container with a 4GB file system (remember, you want 100 on a node!), which
    leaves about 3GB of free space. Filling up the filesystem should not be
    fatal of course, but perhaps openvz is doing something wrong somewhere.
    
    The actual contents of the filesystem are gone, but I kinda doubt you did
    an rm * in the root directory.
    
  • My response:
    > Is it possible that you filled up the container filesystem? We build each
    > container with a 4GB file system (remember, you want 100 on a node!), which
    > leaves about 3GB of free space. Filling up the filesystem should not be 
    > fatal of course, but perhaps openvz is doing something wrong somewhere.
    
    So, it's possible... but i was copying the same file to the same target  
    filename over and over again, and i don't *believe* scp should be doing    
    that particular thing to me.
    
    > The actual contents of the filesystem are gone, but I kinda doubt you did
    > an rm * in the root directory.
    
    Indeed, not to my knowledge.
    
    Anyway, if it's gone, it's gone.  Anything else we can look at, or should
    i delete the sliver once the AM is back online?
    

Rob says this is troubling, but they haven't been able to reproduce it. What should we do about this ticket?

  1. Do a deeper search for logs/tea leaves related to the instance about which i created this ticket (may be impossible at this point)
  2. Try harder to reproduce the problem
  3. Close as wontfix/worksforme/some-other-nonfixed-state and just keep the ticket as a data point in case something like this happens again
  4. Something else

I don't have a strong preference.

comment:3 Changed 12 years ago by chaos@bbn.com

Resolution: worksforme
Status: newclosed

With Leigh's agreement, i'm closing this as worksforme. The ticket will stay in the system, of course, so we'll have a record of it if this comes up again, or if someone gets an idea about what might have happened.

Note: See TracTickets for help on using tickets.