| 1 | [[PageOutline]] |
| 2 | |
| 3 | = Detailed test plan for EG-ADM-3: Full Rack Reboot Test = |
| 4 | |
| 5 | ''This page is GPO's working page for performing EG-ADM-3. It is public for informational purposes, but it is not an official status report. See [wiki:GENIRacksHome/ExogeniRacks/AcceptanceTestStatus] for the current status of ExoGENI acceptance tests.'' |
| 6 | |
| 7 | ''Last substantive edit of this page: 2012-08-29'' |
| 8 | |
| 9 | == Page format == |
| 10 | |
| 11 | * The status chart summarizes the state of this test |
| 12 | * The high-level description from test plan contains text copied exactly from the public test plan and acceptance criteria pages. |
| 13 | * The steps contain things i will actually do/verify: |
| 14 | * Steps may be composed of related substeps where i find this useful for clarity |
| 15 | * Each step is either a preparatory step (identified by "(prep)") or a verification step (the default): |
| 16 | * Preparatory steps are just things we have to do. They're not tests of the rack, but are prerequisites for subsequent verification steps |
| 17 | * Verification steps are steps in which we will actually look at rack output and make sure it is as expected. They contain a '''Using:''' block, which lists the steps to run the verification, and an '''Expect:''' block which lists what outcome is expected for the test to pass. |
| 18 | |
| 19 | == Status of test == |
| 20 | |
| 21 | || '''Step''' || '''State''' || '''Date completed''' || '''Open Tickets''' || '''Closed Tickets/Comments''' || |
| 22 | || 1A || || || || || |
| 23 | |
| 24 | == High-level description from test plan == |
| 25 | |
| 26 | In this test, a full rack reboot is performed as a drill of a procedure which a site administrator may need to perform for site maintenance. |
| 27 | |
| 28 | === Procedure === |
| 29 | |
| 30 | 1. Review relevant rack documentation about shutdown options and make a plan for the order in which to shutdown each component. |
| 31 | 2. Cleanly shutdown and/or hard-power-off all devices in the rack, and verify that everything in the rack is powered down. |
| 32 | 3. Power on all devices, bring all logical components back online, and use monitoring and comprehensive health tests to verify that the rack is healthy again. |
| 33 | |
| 34 | === Criteria to verify as part of this test === |
| 35 | |
| 36 | * IV.01. All experimental hosts are configured to boot (rather than stay off pending manual intervention) when they are cleanly shut down and then remotely power-cycled. (C.3.c) |
| 37 | * V.10. Site administrators can authenticate remotely and power on, power off, or power-cycle, all physical rack devices, including experimental hosts, servers, and network devices. (C.3.c) |
| 38 | * V.11. Site administrators can authenticate remotely and virtually power on, power off, or power-cycle all virtual rack resources, including server and experimental VMs. (C.3.c) |
| 39 | * VI.16. A procedure is documented for cleanly shutting down the entire rack in case of a scheduled site outage. (C.3.c) |
| 40 | * VII.16. A public document explains how to perform comprehensive health checks for a rack (or, if those health checks are being run automatically, how to view the current/recent results). (F.8) |
| 41 | |
| 42 | == Step 1: shut down the rack == |
| 43 | |
| 44 | === Step 1A: baseline before shutting down the rack === |
| 45 | |
| 46 | ==== Overview of Step 1A ==== |
| 47 | |
| 48 | Run all checks before performing the reboot test, so that if something is already down, we won't be confused: |
| 49 | * Attempt to SSH to bbn-hn.exogeni.gpolab.bbn.com: |
| 50 | {{{ |
| 51 | ssh bbn-hn.exogeni.gpolab.bbn.com |
| 52 | }}} |
| 53 | * If SSH is successful, attempt to sudo on bbn-hn: |
| 54 | {{{ |
| 55 | sudo -v |
| 56 | }}} |
| 57 | * Browse to [https://bbn-hn.exogeni.net/rack_bbn/] and attempt to login to nagios |
| 58 | * If successful, enumerate any errors currently outstanding in rack nagios |
| 59 | * Browse to [http://monitor.gpolab.bbn.com/nagios/cgi-bin/status.cgi?hostgroup=all&style=detail&sorttype=2&sortoption=3] and ensure that GPO nagios is available |
| 60 | * If successful, enumerate any exogeni-relevant errors currently outstanding in GPO nagios |
| 61 | * Run omni getversion and listresources against the BBN rack ORCA AM: |
| 62 | {{{ |
| 63 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc getversion |
| 64 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc listresources |
| 65 | }}} |
| 66 | * Run omni getversion and listresources against the BBN rack FOAM AM: |
| 67 | {{{ |
| 68 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 getversion |
| 69 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 listresources |
| 70 | }}} |
| 71 | * Verify that [http://monitor.gpolab.bbn.com/connectivity/campus.html] currently shows a successful connection from argos to exogeni-vlan1750.bbn.dataplane.geni.net |
| 72 | |
| 73 | === Step 1B: shutdown all rack components === |
| 74 | |
| 75 | ==== Overview of Step 1B ==== |
| 76 | |
| 77 | * Login to bbn-hn from the console, and shut it down: |
| 78 | {{{ |
| 79 | sudo init 0 |
| 80 | }}} |
| 81 | Wait for successful shutdown and poweroff. |
| 82 | * Login to each worker node (bbn-w1 - bbn-w10) from their consoles, and shut each down: |
| 83 | {{{ |
| 84 | sudo init 0 |
| 85 | }}} |
| 86 | Wait for successful shutdown and poweroff. |
| 87 | * Power off the iSCSI |
| 88 | * Power off bbn-8264 |
| 89 | * Power off bbn-ssg |
| 90 | * Power off bbn-8052 |
| 91 | |
| 92 | Verify that everything in the rack is powered off. |
| 93 | |
| 94 | == Step 2: power the rack back on == |
| 95 | |
| 96 | === Step 2A: power on all rack components === |
| 97 | |
| 98 | ==== Overview of Step 2A ==== |
| 99 | |
| 100 | * Power on bbn-8052 |
| 101 | * Power on bbn-ssg |
| 102 | * Power on bbn-8264 |
| 103 | * Power on the iSCSI |
| 104 | * Power on each worker (bbn-w1 - bbn-w10), and wait until all workers have booted to login prompts |
| 105 | * Power on bbn-hn and wait until it has booted to a login prompt |
| 106 | |
| 107 | Verify that everything in the rack is powered on. |
| 108 | |
| 109 | === Step 2B: RENCI performs manual ORCA startup steps === |
| 110 | |
| 111 | ==== Overview of Step 2B ==== |
| 112 | |
| 113 | * Notify exogeni-design that the rack is online and ready for ORCA to be brought online. |
| 114 | * Wait for a response that the rack is now healthy |
| 115 | |
| 116 | === Step 2C: test functionality after bringing up the rack === |
| 117 | |
| 118 | ==== Overview of Step 2C ==== |
| 119 | |
| 120 | Run all checks again and report on any discrepancies: |
| 121 | * Attempt to SSH to bbn-hn.exogeni.gpolab.bbn.com: |
| 122 | {{{ |
| 123 | ssh bbn-hn.exogeni.gpolab.bbn.com |
| 124 | }}} |
| 125 | * If SSH is successful, attempt to sudo on bbn-hn: |
| 126 | {{{ |
| 127 | sudo -v |
| 128 | }}} |
| 129 | * Browse to [https://bbn-hn.exogeni.net/rack_bbn/] and attempt to login to nagios |
| 130 | * If successful, enumerate any errors currently outstanding in rack nagios |
| 131 | * Browse to [http://monitor.gpolab.bbn.com/nagios/cgi-bin/status.cgi?hostgroup=all&style=detail&sorttype=2&sortoption=3] and ensure that GPO nagios is available |
| 132 | * If successful, enumerate any exogeni-relevant errors currently outstanding in GPO nagios |
| 133 | * Run omni getversion and listresources against the BBN rack ORCA AM: |
| 134 | {{{ |
| 135 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc getversion |
| 136 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc listresources |
| 137 | }}} |
| 138 | * Run omni getversion and listresources against the BBN rack FOAM AM: |
| 139 | {{{ |
| 140 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 getversion |
| 141 | omni -a https://bbn-hn.exogeni.gpolab.bbn.com:3626/foam/gapi/1 listresources |
| 142 | }}} |
| 143 | * Verify that [http://monitor.gpolab.bbn.com/connectivity/campus.html] currently shows a successful connection from argos to exogeni-vlan1750.bbn.dataplane.geni.net |
| 144 | |