wiki:GENIRacksHome/InstageniRacks/AcceptanceTestStatus/IG-ADM-3

Version 2 (modified by Josh Smift, 11 years ago) (diff)

--

Detailed test plan for IG-ADM-3: Full Rack Reboot Test

This page is GPO's working page for performing IG-ADM-3. It is public for informational purposes, but it is not an official status report. See GENIRacksHome/InstageniRacks/AcceptanceTestStatus for the current status of InstaGENI acceptance tests.

Page format

  • The status chart summarizes the state of this test
  • The high-level description from test plan contains text copied exactly from the public test plan and acceptance criteria pages.
  • The steps contain things i will actually do/verify:
    • Steps may be composed of related substeps where i find this useful for clarity
    • Each step is either a preparatory step (identified by "(prep)") or a verification step (the default):
      • Preparatory steps are just things we have to do. They're not tests of the rack, but are prerequisites for subsequent verification steps
      • Verification steps are steps in which we will actually look at rack output and make sure it is as expected. They contain a Using: block, which lists the steps to run the verification, and an Expect: block which lists what outcome is expected for the test to pass.

Status of test

See GENIRacksHome/InstageniRacks/AcceptanceTestStatus for the meanings of test states.

Step State Date completed Open Tickets Closed Tickets/Comments
1A Color(green,Pass)? 2013-03-08
1B Color(green,Pass)? 2013-03-08
2A Color(green,Pass)? 2013-03-08
2B Color(green,Pass)? 2013-03-08


High-level description from test plan

In this test, a full rack reboot is performed as a drill of a procedure which a site administrator may need to perform for site maintenance.

Procedure

  1. Review relevant rack documentation about shutdown options and make a plan for the order in which to shutdown each component.
  2. Cleanly shutdown and/or hard-power-off all devices in the rack, and verify that everything in the rack is powered down.
  3. Power on all devices, bring all logical components back online, and use monitoring and comprehensive health tests to verify that the rack is healthy again.

Criteria to verify as part of this test

  • IV.01. All experimental hosts are configured to boot (rather than stay off pending manual intervention) when they are cleanly shut down and then remotely power-cycled. (C.3.c)
  • V.10. Site administrators can authenticate remotely and power on, power off, or power-cycle, all physical rack devices, including experimental hosts, servers, and network devices. (C.3.c)
  • V.11. Site administrators can authenticate remotely and virtually power on, power off, or power-cycle all virtual rack resources, including server and experimental VMs. (C.3.c)
  • VI.16. A procedure is documented for cleanly shutting down the entire rack in case of a scheduled site outage. (C.3.c)
  • VII.16. A public document explains how to perform comprehensive health checks for a rack (or, if those health checks are being run automatically, how to view the current/recent results). (F.8)

Step 1: Shut down the rack

Step 1A: Test functionality before shutting down the rack

Overview of Step 1A

Check the state of the rack before shutting down:

Make a note of anything that wasn't ok before shutting down, since it may not be ok after starting up, and if it isn't, that won't indicate a problem with the test.

Results of testing Step 1A on 2013-03-08

I was able to log in to control.instageni.gpolab.bbn.com, and sudo:

[14:12:37] jbs@gpolab:/home/jbs
+$ hostname
gpolab.control-nodes.geniracks.net

[14:12:46] jbs@gpolab:/home/jbs
+$ sudo whoami
root

I browsed to https://www.instageni.gpolab.bbn.com/, logged in, and clicked on the green dot (which turned red) to enter "red dot mode".

I browsed to https://boss.instageni.gpolab.bbn.com/nodecontrol_list.php3?showtype=dl360, and observed that all five nodes are up or free (green dot or white dot in the "Up?" column of the table).

I browsed to http://monitor.gpolab.bbn.com/nagios/cgi-bin/status.cgi?hostgroup=all&style=detail&sorttype=2&sortoption=3, and found no InstaGENI-relevant errors.

I ran getversion and listresources against the BBN rack PG AM:

omni -a https://instageni.gpolab.bbn.com:12369/protogeni/xmlrpc/am getversion
omni -a https://instageni.gpolab.bbn.com:12369/protogeni/xmlrpc/am listresources

The output was quite long, so I didn't paste it here.

I ran omni getversion and listresources against the BBN rack FOAM AM:

omni -a https://foam.instageni.gpolab.bbn.com:3626/foam/gapi/1 getversion
omni -a https://foam.instageni.gpolab.bbn.com:3626/foam/gapi/1 listresources

The output was quite long, so I didn't paste it here.

I verified that http://monitor.gpolab.bbn.com/connectivity/campus.html showed that the connection from argos to instageni-vlan1750.bbn.dataplane.geni.net is OK.

Step 1B: Shut down all rack components

Overview of Step 1B

Shut everything down:

  • Log in to boss, and from there:
    • Shut down the testbed:
      sudo testbed-control shutdown
      
    • Shut down the nodes:
      for node in pc{1..5} ; do sudo ssh $node shutdown -H now ; done
      
      Note that nodes which are not in use (free) may not be reachable (so the SSH connection may time out).
  • Log out of boss.
  • Log in to control, and from there:
    • Shut down all of the VMs that run there:
      sudo xm shutdown -a -w
      
    • Shut down control itself:
      sudo init 0 && exit
      

Visually verify that everything in the rack is powered off, and physically power off anything that isn't already.

Results of testing Step 1B on 2013-03-08

On boss, I shut down the testbed:

sudo testbed-control shutdown

I then shut down the nodes:

for node in pc{1..5} ; do sudo ssh $node shutdown -H now ; done

This didn't actually seem to shut down pc1 or pc2, which I was still able to ping and SSH in to, even after five minutes. I logged in interactively to pc1, and it said

System is going down.

Last login: Fri Mar  8 14:25:04 2013 from boss.instageni.gpolab.bbn.com
[root@vhost1 ~]#

I tried doing the shutdown command there:

[root@vhost1 ~]# shutdown -H now

Broadcast message from root@vhost1.shared-nodes.emulab-ops.instageni.gpolab.bbn.com on pts/0 (Fri, 08 Mar 2013 14:26:52 -0500):

The system is going down for system halt NOW!

But, the system still didn't shut down. I tried init 0 instead:

[root@vhost1 ~]# sudo init 0

Broadcast message from root@vhost1.shared-nodes.emulab-ops.instageni.gpolab.bbn.com on pts/0 (Fri, 08 Mar 2013 14:27:18 -0500):

The system is going down for power-off NOW!

But the system still didn't shut down.

...ah, but then finally, about six or seven minutes after I'd first run the shutdown, they did shut down. Ok.

I logged out of boss, and logged in to control, and shut down the VMs that run there:

sudo xm shutdown -a -w

I then shut down control itself:

sudo init 0 && exit

I was out of the office, but remote hands (aka "Peter") visually verified that only the control node and pc1 actually powered off; he powered everything else off.

Step 2: Start up the rack on

Step 2A: Start up all rack components

Overview of Step 2A

Turn everything on:

  • Power on the switches
  • Power on the control node
  • Wait for the contol node and its VMs to come up
  • Power on the experimenter nodes

Verify that everything in the rack is powered on.

Results of testing Step 2A on 2013-03-08

Things generally came up ok, except that the ops VM didn't fully come up. Leigh fixed it, and explained that he "removed a bad script from your ops that we fixed right around the time the BBN rack was installed. It might have got caught in the mix. The script is unnecessary on the racks at this time." so that should be ok now.

Step 2B: Test functionality after starting up the rack

Overview of Step 2B

Check the state of the rack after starting up:

Make a note of anything that isn't ok now and *was* ok before shutting down.

Results of testing Step 2B on 2013-03-08

I was able to log in to control.instageni.gpolab.bbn.com, and sudo:

[15:07:23] jbs@gpolab:/home/jbs
+$ hostname
gpolab.control-nodes.geniracks.net

[15:07:24] jbs@gpolab:/home/jbs
+$ sudo whoami
root

I browsed to https://www.instageni.gpolab.bbn.com/, and it said

                  Web Interface Temporarily Unavailable
                          Please Try Again Later
              Testbed going offline; back in a little while

Leigh reports that this was due to a missing instruction in the docs, so I logged in to boss, and ran

sudo testbed-control boot

I was then able to browse to https://www.instageni.gpolab.bbn.com/ and log in.

I browsed to https://boss.instageni.gpolab.bbn.com/nodecontrol_list.php3?showtype=dl360, and observed that all five nodes are up or free (green dot or white dot in the "Up?" column of the table).

I browsed to http://monitor.gpolab.bbn.com/nagios/cgi-bin/status.cgi?hostgroup=all&style=detail&sorttype=2&sortoption=3, and found errors related to amcanary trying to getversion and listresources from FOAM and PG on the rack. I scheduled a manual check of those, and it passed.

I ran getversion and listresources against the BBN rack PG AM:

omni -a https://instageni.gpolab.bbn.com:12369/protogeni/xmlrpc/am getversion
omni -a https://instageni.gpolab.bbn.com:12369/protogeni/xmlrpc/am listresources

The output was quite long, so I didn't paste it here.

I ran omni getversion and listresources against the BBN rack FOAM AM:

omni -a https://foam.instageni.gpolab.bbn.com:3626/foam/gapi/1 getversion
omni -a https://foam.instageni.gpolab.bbn.com:3626/foam/gapi/1 listresources

The output was quite long, so I didn't paste it here.

I verified that http://monitor.gpolab.bbn.com/connectivity/campus.html showed that the connection from argos to instageni-vlan1750.bbn.dataplane.geni.net is OK.