[[PageOutline]] = Detailed test plan for IG-ADM-4: Emergency Stop Test = ''This page is GPO's working page for performing IG-ADM-4. It is public for informational purposes, but it is not an official status report. See [wiki:GENIRacksHome/InstageniRacks/AcceptanceTestStatus] for the current status of InstaGENI acceptance tests.'' ''Last substantive edit of this page: 2013-02-26'' = Page format = * The status chart summarizes the state of this test * The high-level description from test plan contains text copied exactly from the public test plan and acceptance criteria pages. * The steps contain things I will actually do/verify: * Steps may be composed of related substeps where I find this useful for clarity * Each step is identified as either "(prep)" or "(verify)": * Prep steps are just things we have to do. They're not tests of the rack, but are prerequisites for subsequent verification steps * Verify steps are steps in which we will actually look at rack output and make sure it is as expected. They contain a '''Using:''' block, which lists the steps to run the verification, and an '''Expect:''' block which lists what outcome is expected for the test to pass. = Status of test = See [wiki:GENIRacksHome/InstageniRacks/AcceptanceTestStatus#Legend] for the meanings of test states. || '''Step''' || '''State''' || '''Date completed''' || '''Open Tickets''' || '''Closed Tickets/Comments''' || || 1 || [[Color(green,Pass)]] || 2013-03-01 || || tupty has reviewed GMOC doc and EG doc, GPO doc has been posted || || 2 || [[Color(green,Pass)]] || 2013-03-06 || || || || 3 || [[Color(green,Pass)]] || 2013-03-07 || || || || 4 || [[Color(green,Pass)]] || 2013-03-07 || || || || 5A || [[Color(green,Pass)]] || 2013-03-07 || || || || 5B || [[Color(green,Pass)]] || 2013-03-07 || || || || 5C || [[Color(green,Pass)]] || 2013-03-07 || || || || 6 || [[Color(green,Pass)]] || 2013-03-07 || || || = High-level description from test plan = In this test, an ES (Emergency Stop) drill is performed on a sliver in the rack. == Procedure == * A site administrator reviews the local site ES procedure, GMOC ES procedure, and sliver shut down procedure, and verifies that these documents combined fully document the campus side of the ES procedure. * A second administrator (or the GPO) submits an ES request to GMOC, referencing activity from a public IP address assigned to a compute sliver in the rack that is part of the test experiment. * GMOC and the first site administrator perform an ES drill in which the site administrator successfully shuts down the sliver in coordination with GMOC. * GMOC completes the ES workflow, including updating/closing GMOC tickets. == Criteria to verify as part of this test == * VI.07. A public document explains the requirements that site administrators have to the GENI community, including how to join required mailing lists, how to keep their support contact information up-to-date, how and under what circumstances to work with Legal, Law Enforcement and Regulatory(LLR) Plan, how to best contact the rack vendor with operational problems, what information needs to be provided to GMOC to support emergency stop, and how to interact with GMOC when an Emergency Stop request is received. (F.3, C.3.d) * VI.17. A procedure is documented for performing a shut down operation on any type of sliver on the rack, in support of an Emergency Stop request. (C.3.d) * VII.18. Given a public IP address and port, an exclusive VLAN, a sliver name, or a piece of user-identifying information such as e-mail address or username, a site administrator or GMOC operator can identify the email address, username, and affiliation of the experimenter who controlled that resource at a particular time. (D.7) * VII.19. GMOC and a site administrator can perform a successful Emergency Stop drill in which slivers containing compute and OpenFlow-controlled network resources are shut down. (C.3.d) = Step 1 (prep): Site administrator reviews local site ES procedure, GMOC ES procedure, and InstaGENI sliver shut down procedure = The site administrator should review the local site ES procedure, the ES procedure provided by the GMOC, and the InstaGENI sliver shut down procedure. All of these procedures should make sense together, and the site administrator should follow the local site ES procedure for the test. The site administrator should identify parts of the local procedure where they need to take action on the aggregate, and they should reference the InstaGENI sliver shut down procedure for that part of the test. He or she should also identify where the local site procedure requires interfacing with the GMOC. The parts identified by the site administrator should be verified with the GMOC and with the InstaGENI team. == Results of testing step 1: 2013-03-01 == The documents have been collected, and they are at the following locations: * Local site ES procedure: http://groups.geni.net/geni/wiki/EmergencyStop#ExampleProcedureforSiteOperatorResponsetoEmergencyStopRequest * GMOC ES procedure: http://gmoc.grnoc.iu.edu/gmoc/index/documents/geni-proposals/gmoc-noc-support-spiral-4-emergency-stop-workflow.html * InstaGENI sliver shut down procedure: http://www.protogeni.net/wiki/instageni/emergencyshutdown = Step 2 (prep): GPO, GMOC, and InstaGENI team coordinate a time to run an ES test = The GPO will coordinate with parties at the GMOC and on the InstaGENI team to identify when an ES test can be run. This test will focus primarily on the interactions with the site administator(s) and performing the procedures documented by the rack team. The following roles will need to be defined for this test: * '''GMOC Coordinator''': person from the GMOC who coordinates the ES activity on the GMOC's side * '''InstaGENI Contact''': person from the InstaGENI team who can be around if there are questions about the document or sliver shut down procedure * '''ES Initiator''': GPO person who initiates an Emergency Stop request * '''Experimenter''': GPO person who has created a sliver * '''Site Administrator''': GPO person who is acting as the site administrator of the GPO InstaGENI rack == Results of testing step 2: 2013-02-28 == Date of test: 2013-03-07 || '''Role''' ||'''Person'''|| || GMOC Coordinator || Eldar || || InstaGENI Contact || geni-dev-utah@flux.utah.edu, Jon Deurig will be around || || ES Initiator || Chaos || || Experimenter || Josh || || Site Administrator|| Tim || = Step 3 (prep): Experimenter sets up a slice = The experimenter will set up a slice that includes a sliver on the GPO InstaGENI rack. The sliver should be a VM that is attached to the shared mesoscale VLAN, and it should be sending traffic that is visible through monitoring. == Results of testing step 3: 2013-03-07 == Josh is running UDP iperfs in his jbs15 and jbs16 slices from the BBN InstaGENI rack to the BBN ExoGENI rack. This traffic should be visible on [http://monitor.gpolab.bbn.com/ganglia/graph.php?&c=BBN%20External&h=poblano.gpolab.bbn.com&v=91.8157894737&m=RX_BYTES_gi0_16&r=hour&z=medium&jr=&js=&vl=bytes%2Fsec&z=large the graph of poblano gi0/16's last hour of RX bytes/sec]. We will want to look at the amount of traffic before and after we shut down the sliver. Once the shutdown is complete, I will capture today's graph with a timestamp. = Step 4 (prep): ES initiated = * The ES Initiator contacts the GMOC Coordinator to initiate an ES request describing the slice URN. * The GMOC walks quickly walks through their procedure, skipping more formal steps as needed, in order to contact the aggregate operator primary contact. * The GMOC does not need to verify the identity of the ES Initiator for the purposes of this test, and therefore should not contact the ES Initiator. * The GMOC does not need to contact the Experimenter for the purposes of this test, and therefore the GMOC should not contact the Experimenter. == Results of testing step 4: 2013-03-07 == Chaos sent the following at 3:00 PM EST: {{{ GMOC: I am still a GENI experimenter, and am still trying to use the BBN site. I noticed that the slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15 has a sliver on the ProtoGENI aggregate at the BBN InstaGENI rack (instageni.gpolab.bbn.com:12369) which is causing trouble. Please perform an emergency stop on that sliver. Also, i swear i sent almost this exact complaint yesterday. Please tell that experimenter to knock it off already. Some of us are trying to use GENI here. :>) Chaos }}} The GMOC followed up at 3:11 PM EST with: {{{ Greetings GENI Rack Site Operator at BBN, We received an emergency stop request from a GENI experimenter, Chaos, for the following GENI resource: urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15 on the ProtoGENI Aggregate at the BBN InstaGENI rack (instageni.gpolab.bbn.com:12369) of which BBN and InstaGENI Ops team are listed as Aggregate Operator(s). Please acknowledge within an hour that you are looking into this Emergency Stop request and are working to shut-down this resource. If no response is received in one hour, we will proceed to contact your Escalation contact and/or perform an isolation of your slice/resource from the GENI Core network. NOTE: Please feel free to reference the following document for details of the GMOC Emergency Stop process: http://gmoc.grnoc.iu.edu/gmoc/documents/geni-standards/gmoc-support-emergency-stop-procedure-and-workflow.html Thank You, Eldar GENI Meta Operations Center Indiana University gmoc@grnoc.iu.edu, 317-274-7783 Visit the GMOC Home Page at http://gmoc.grnoc.iu.edu/ }}} = Step 5: Site Administrator receives ES request = == Step 5A (verify): Data passed from GMOC to Site Administrator is in expected format == '''Using:''' * Local site ES procedure * Documented InstaGENI sliver shut down procedure * GMOC monitoring tools '''Verify:''' * The GMOC sends a request with slice-specific or sliver-specific data in a format that can be fed into the shut down procedure * There is a step in the local site ES procedure for the Site Administrator to acknowledge that the GMOC's request is being processed * The Site Administrator can identify the experimenter's email address, username, and affiliation with the information provided by the GMOC and GMOC monitoring tools === Results of testing step 5A: 2013-03-07 === * The GMOC's email included the slice URN, which was as expected for today's test. The InstaGENI procedure can be followed with a slice URN. * The example procedure includes a step to ack the GMOC's ES request, and I acked the request at 3:13 PM EST. * I can see the experimenter's email address, user URN, and operator organization when I look at [https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=slice_details;slice=urn%3Apublicid%3AIDN%2Bpgeni.gpolab.bbn.com%2Bslice%2Bjbs15 the GMOC web UI]. == Step 5B (verify): Shut down procedure can be followed to successfully shut down a sliver == '''Using:''' * Documented InstaGENI sliver shut down procedure * Administrative tools to shut down a sliver * GMOC monitoring tools '''Verify:''' * The shut down procedure includes the complete set of steps shut down a sliver in the rack * Following the shut down procedure results in a sliver being deactivated on a rack * Experimental traffic from the sliver is no longer being sent === Results of testing step 5B: 2013-03-07 === * I used steps provided in the procedure to shutdown the sliver in our rack: {{{ [tupty@boss ~]$ wap cleanupslice -m urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15 Syncing target vlan 11 in [Experiment: emulab-ops/openflow-vlans] getTrunksForVlan: 11: procurve2 mapVlansToSwitches: procurve2 getExperimentTrunksForVlan: 347: procurve2 getExperimentTrunksForVlan: 348: procurve2 getExperimentTrunksForVlan: 349: procurve2 getExperimentTrunksForVlan: 350: procurve2 getExperimentTrunksForVlan: 351: procurve2 getExperimentTrunksForVlan: 481: procurve2 getExperimentTrunksForVlan: 11: procurve2 getExperimentTrunksForVlan: 222: procurve2 mapStaleVlansToSwitches: procurve2 procurve2 -> startChildCall(FlipDebug) Experiment vlans: 11 Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1 Existing vlans: 11 Stale vlans: Existing Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1 snmpit: pgeni-gpolab-bbn-com/jbs15 has no VLANs, skipping Running 'tbswap out pgeni-gpolab-bbn-com jbs15' Beginning swap-out for pgeni-gpolab-bbn-com/jbs15 (782). 03/07/2013 15:17:24 TIMESTAMP: 15:17:24:621289 tbswap out started Checking for feature SyncVlans. Stopping the event system Checking for feature NewEventScheduler. Closing TCP proxy ports... TIMESTAMP: 15:17:28:969824 snmpit started Removing VLANs. snmpit: pgeni-gpolab-bbn-com/jbs15 has no VLANs, skipping TIMESTAMP: 15:17:29:468312 snmpit finished Removing dynamic blobs. Clearing shared port vlans. Syncing target vlan 11 in [Experiment: emulab-ops/openflow-vlans] getTrunksForVlan: 11: procurve2 mapVlansToSwitches: procurve2 getExperimentTrunksForVlan: 347: procurve2 getExperimentTrunksForVlan: 348: procurve2 getExperimentTrunksForVlan: 349: procurve2 getExperimentTrunksForVlan: 350: procurve2 getExperimentTrunksForVlan: 351: procurve2 getExperimentTrunksForVlan: 481: procurve2 getExperimentTrunksForVlan: 11: procurve2 getExperimentTrunksForVlan: 222: procurve2 mapStaleVlansToSwitches: procurve2 procurve2 -> startChildCall(FlipDebug) Experiment vlans: 11 Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1 Existing vlans: 11 Stale vlans: Existing Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1 Tearing down virtual nodes. TIMESTAMP: 15:17:29:998358 vnode_setup -k started TIMESTAMP: 15:17:30:207887 vnode_setup finished Removing logical wires. Freeing nodes. TIMESTAMP: 15:17:30:431157 nfree started Releasing all nodes from experiment [Experiment: pgeni-gpolab-bbn-com/jbs15]. TIMESTAMP: 15:17:30:688709 nfree finished Resetting mountpoints. TIMESTAMP: 15:17:30:691010 exports started TIMESTAMP: 15:17:33:38637 exports finished Resetting named maps. TIMESTAMP: 15:17:33:41287 named started TIMESTAMP: 15:17:33:552959 named finished Resetting email lists. TIMESTAMP: 15:17:33:554725 genelists started TIMESTAMP: 15:17:33:719180 genelists finished Resetting DB. Successfully finished swap-out for pgeni-gpolab-bbn-com/jbs15. 15:17:33:727306 TIMESTAMP: 15:17:33:727868 tbswap out finished (succeeded) Getting user added files. Doing a savepoint on the experiment archive ... Doing a commit on the experiment archive ... Running 'tbend -e 782' Beginning cleanup for pgeni-gpolab-bbn-com/jbs15. 15:17:34:173128 Clearing out virtual state. Removing visualization data... Cleanup finished! 15:17:34:601973 Archiving and clearing the experiment archive ... Experiment pgeni-gpolab-bbn-com/jbs15 has been successfully terminated! Removing experiment directories ... }}} * I don't know of a way of verifying that the slice was deleted from the rack. That should probably be added to the doc. '''Update:''' This has been addressed. * Traffic dropped to the expected levels after the shutdown. See [http://monitor.gpolab.bbn.com/ganglia/graph.php?&c=BBN%20External&h=poblano.gpolab.bbn.com&v=1306335.89474&m=RX_BYTES_gi0_16&r=day&z=medium&jr=&js=&st=1362689368&vl=bytes%2Fsec&z=large traffic graph of poblano gi0/16 for the day as of 3:50]. == Step 5C (verify): Documented procedure includes a step to follow up with GMOC == '''Using:''' * Local site ES procedure '''Verify:''' * There is a step for the site administrator to follow up with the GMOC that a sliver has been shut down === Results of testing step 5C: 2013-03-07 === There is a step to follow up with the GMOC, and I did so at 3:24 PM EST. = Step 6 (verify): Sliver shut down procedure includes a clean-up step (if necessary) = '''Using:''' * Documented InstaGENI sliver shut down procedure '''Verify:''' * Ensure the InstaGENI sliver shut down procedure contains a recovery step describing what to do if the shut down affects other experimenters. == Results of testing step 6: 2013-03-07 == * Other slices (including tuptymon and jbs16) were not affected by us shutting down jbs15. * There were no clean up steps necessary for the pieces involving the InstaGENI doc in today's test.