[[PageOutline]] = Detailed test plan for EG-ADM-3: Full Rack Reboot Test = ''This page is GPO's working page for performing EG-ADM-3. It is public for informational purposes, but it is not an official status report. See [wiki:GENIRacksHome/ExogeniRacks/AcceptanceTestStatus] for the current status of ExoGENI acceptance tests.'' ''Last substantive edit of this page: 2012-08-30'' == Page format == * The status chart summarizes the state of this test * The high-level description from test plan contains text copied exactly from the public test plan and acceptance criteria pages. * The steps contain things i will actually do/verify: * Steps may be composed of related substeps where i find this useful for clarity * Each step is either a preparatory step (identified by "(prep)") or a verification step (the default): * Preparatory steps are just things we have to do. They're not tests of the rack, but are prerequisites for subsequent verification steps * Verification steps are steps in which we will actually look at rack output and make sure it is as expected. They contain a '''Using:''' block, which lists the steps to run the verification, and an '''Expect:''' block which lists what outcome is expected for the test to pass. == Status of test == || '''Step''' || '''State''' || '''Date completed''' || '''Open Tickets''' || '''Closed Tickets/Comments''' || || 1A || [[Color(green,Pass)]] || || || || || 1B || [[Color(green,Pass)]] || || || || || 2A || [[Color(green,Pass)]] || || || || || 2B || [[Color(green,Pass)]] || || || || || 2C || [[Color(green,Pass)]] || || || || == High-level description from test plan == In this test, a full rack reboot is performed as a drill of a procedure which a site administrator may need to perform for site maintenance. === Procedure === 1. Review relevant rack documentation about shutdown options and make a plan for the order in which to shutdown each component. 2. Cleanly shutdown and/or hard-power-off all devices in the rack, and verify that everything in the rack is powered down. 3. Power on all devices, bring all logical components back online, and use monitoring and comprehensive health tests to verify that the rack is healthy again. === Criteria to verify as part of this test === * IV.01. All experimental hosts are configured to boot (rather than stay off pending manual intervention) when they are cleanly shut down and then remotely power-cycled. (C.3.c) * V.10. Site administrators can authenticate remotely and power on, power off, or power-cycle, all physical rack devices, including experimental hosts, servers, and network devices. (C.3.c) * V.11. Site administrators can authenticate remotely and virtually power on, power off, or power-cycle all virtual rack resources, including server and experimental VMs. (C.3.c) * VI.16. A procedure is documented for cleanly shutting down the entire rack in case of a scheduled site outage. (C.3.c) * VII.16. A public document explains how to perform comprehensive health checks for a rack (or, if those health checks are being run automatically, how to view the current/recent results). (F.8) == Step 1: shut down the rack == === Step 1A: baseline before shutting down the rack === ==== Overview of Step 1A ==== Run all checks before performing the reboot test, so that if something is already down, we won't be confused: * Attempt to SSH to bbn-hn.exogeni.gpolab.bbn.com: {{{ ssh bbn-hn.exogeni.gpolab.bbn.com }}} * If SSH is successful, attempt to sudo on bbn-hn: {{{ sudo whoami }}} * Browse to [https://bbn-hn.exogeni.net/rack_bbn/] and attempt to login to nagios * If successful, enumerate any errors currently outstanding in rack nagios * Browse to [http://monitor.gpolab.bbn.com/nagios/cgi-bin/status.cgi?hostgroup=all&style=detail&sorttype=2&sortoption=3] and ensure that GPO nagios is available * If successful, enumerate any exogeni-relevant errors currently outstanding in GPO nagios * Run omni getversion and listresources against the BBN rack ORCA AM: {{{ omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc getversion -o omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc listresources -o }}} * Run omni getversion and listresources against the BBN rack FOAM AM: {{{ omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 getversion omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 listresources }}} * Verify that [http://monitor.gpolab.bbn.com/connectivity/campus.html] currently shows a successful connection from argos to exogeni-vlan1750.bbn.dataplane.geni.net ==== Results of testing step 1A on 2012-08-30 ==== * SSH and sudo were successful * I could login to nagios. Outstanding errors: * critical: * Updates needed on bbn-w1, bbn-w2, bbn-w3 * 8052.bbn.xo reports that interfaces: 31-36,45-46 are down * `proc_ImageProxy` is critical on bbn-hn. This is unexpected, and i sent mail about it in case anyone wants to investigate * warning: * Updates needed on bbn-hn * 8052.bbn.xo reports that interfaces: 10-15,41-44 are the "wrong speed" (running at 1Gbps, expected 100Mbps) * GPO nagios shows that orca getversion and listresources have been failing for about 18 hours, and that the ping to the BBN rack is failing * omni getversion does not succeed against orca: {{{ jericho,[~],09:28(2)$ omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc getversion -o INFO:omni:Loading config file /home/chaos/omni/omni_pgeni INFO:omni:Using control framework pg ERROR:omni.protogeni:Call for GetVersion at https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc failed.: Unknown socket error: [Errno 111] Connection refused ERROR:omni.protogeni: ..... Run with --debug for more information WARNING:omni:URN: unspecified_AM_URN (url:https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc) call failed: [Errno 111] Connection refused INFO:omni: ------------------------------------------------------------ INFO:omni: Completed getversion: Options as run: aggregate: https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc configfile: /home/chaos/omni/omni_pgeni framework: pg native: True output: True Args: getversion Result Summary: Cannot GetVersion at https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc: [Errno 111] Connection refused Got version for 0 out of 1 aggregates INFO:omni: ============================================================ }}} * omni getversion does succeed against foam: {{{ jericho,[~],09:29(0)$ omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 getversion INFO:omni:Loading config file /home/chaos/omni/omni_pgeni INFO:omni:Using control framework pg INFO:omni:AM URN: unspecified_AM_URN (url: https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1) has version: INFO:omni:{ 'ad_rspec_versions': [ { 'extensions': [ 'http://www.geni.net/resources/rspec/ext/openflow/3'], 'namespace': 'http://www.geni.net/resources/rspec/3', 'schema': 'http://www.geni.net/resources/rspec/3/ad.xsd', 'type': 'GENI', 'version': '3'}], 'foam_version': '0.8.2', 'geni_api': 1, 'request_rspec_versions': [ { 'extensions': [ 'http://www.geni.net/resources/rspec/ext/openflow/3', 'http://www.geni.net/resources/rspec/ext/openflow/4', 'http://www.geni.net/resources/rspec/ext/flowvisor/1'], 'namespace': 'http://www.geni.net/resources/rspec/3', 'schema': 'http://www.geni.net/resources/rspec/3/request.xsd', 'type': 'GENI', 'version': '3'}], 'site_info': { }} INFO:omni: ------------------------------------------------------------ INFO:omni: Completed getversion: Options as run: aggregate: https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 configfile: /home/chaos/omni/omni_pgeni framework: pg native: True Args: getversion Result Summary: Got version for 1 out of 1 aggregates INFO:omni: ============================================================ }}} * The campus connectivity graph shows that the BBN ExoGENI ping has been down for about 18 hours === Step 1B: shutdown all rack components === ==== Overview of Step 1B ==== * Login to bbn-hn from the console, and shut it down: {{{ sudo init 0 }}} Wait for successful shutdown and poweroff. * Login to each worker node (bbn-w1 - bbn-w10) from their consoles, and shut each down: {{{ sudo init 0 }}} Wait for successful shutdown and poweroff. * Power off the iSCSI * Power off bbn-8264 * Power off bbn-8052 * Power off bbn-ssg Verify that everything in the rack is powered off. ==== Results of testing step 1B on 2012-08-30 ==== * Logging into bbn-hn and shutting it down was successful, and the KVM thinks it's off now. * However, of course i couldn't login to bbn-w1 as myself, because the LDAP server is now down. I clarified with people that it's fine to shut down the workers first like normal people, so i did that: * Boot up bbn-hn again * Login to each worker console and shut down the workers * Incidentally, i noticed that the console keyboard is a bit sticky. I sometimes get duplicated characters * bbn-w9 and bbn-w10 were down, so i didn't do anything to them * Login to bbn-hn console and shut it down * Powered off the iSCSI using the two switches on the back * Powered off dataplane switch, then control switch, by yanking the cables * Powered off the SSG by unplugging its power == Step 2: power the rack back on == === Step 2A: power on all rack components === ==== Overview of Step 2A ==== * Power on bbn-ssg * Power on bbn-8052 * Power on bbn-8264 * Power on the iSCSI * Power on bbn-hn and wait until it has booted to a login prompt * Power on each worker (bbn-w1 - bbn-w10), and wait until all workers have booted to login prompts Verify that everything in the rack is powered on. ==== Results of testing step 2A on 2012-08-30 ==== * Power on bbn-ssg * wait for status light to turn green in a couple of seconds * Power on bbn-8052 * the startup fans are very loud. wait 20 seconds or so for them to quiet down * Power on bbn-8264 * again, the startup fans are very loud. wait 20 seconds or so for them to quiet down * Power on the iSCSI * Again, wait for fans to get loud, then quiet again * Power on the head node and wait for it to boot to a login prompt * Power on the 8 workers, w1-w8, and wait until this succeeds: {{{ bbn-hn,[~],14:48(0)$ for n in {1..8}; do ssh bbn-w${n} hostname; done }}} * Victor was on IRC, and reported that there is an OpenStack way to check whether OpenStack is happy by running VMs: {{{ source /opt/orca-12080/ec2/novarc euca-run-instances $foo $bar $baz }}} (He didn't suggest options. I'd like to learn to do this at some point, but for now, Victor offered to try it out and report back to me. He then reported back that everything was fine.) === Step 2B: RENCI performs manual ORCA startup steps === ==== Overview of Step 2B ==== * Notify exogeni-design that the rack is online and ready for ORCA to be brought online. * Wait for a response that the rack is now healthy ==== Results of testing step 2B on 2012-08-30 ==== * Jonathan sent an all-clear at 11:27, which i didn't see until after noon because of an e-mail mixup on my part. === Step 2C: test functionality after bringing up the rack === ==== Overview of Step 2C ==== Run all checks again and report on any discrepancies: * Attempt to SSH to bbn-hn.exogeni.gpolab.bbn.com: {{{ ssh bbn-hn.exogeni.gpolab.bbn.com }}} * If SSH is successful, attempt to sudo on bbn-hn: {{{ sudo whoami }}} * Browse to [https://bbn-hn.exogeni.net/rack_bbn/] and attempt to login to nagios * If successful, enumerate any errors currently outstanding in rack nagios * Browse to [http://monitor.gpolab.bbn.com/nagios/cgi-bin/status.cgi?hostgroup=all&style=detail&sorttype=2&sortoption=3] and ensure that GPO nagios is available * If successful, enumerate any exogeni-relevant errors currently outstanding in GPO nagios * Run omni getversion and listresources against the BBN rack ORCA AM: {{{ omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc getversion omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc listresources }}} * Run omni getversion and listresources against the BBN rack FOAM AM: {{{ omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 getversion omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 listresources }}} * Verify that [http://monitor.gpolab.bbn.com/connectivity/campus.html] currently shows a successful connection from argos to exogeni-vlan1750.bbn.dataplane.geni.net ==== Results of testing step 2C on 2012-08-30 ==== * I can login to bbn-hn and sudo * I can login to rack nagios * critical: * 8052.bbn.xo reports that interfaces: 31-36,45-46 are down * warning: * Updates needed on bbn-hn * 8052.bbn.xo reports that interfaces: 10-15,41-44 are the "wrong speed" (running at 1Gbps, expected 100Mbps) * I can look at GPO nagios: * The dataplane ping in the tuptymon slice is still down, which is not surprising * ORCA omni getversion and listresources are successful * FOAM omni getversion and listresources are successful