wiki:GENIRacksHome/InstageniRacks/AcceptanceTestStatus/IG-ADM-4

Version 48 (modified by tupty@bbn.com, 11 years ago) (diff)

--

Detailed test plan for IG-ADM-4: Emergency Stop Test

This page is GPO's working page for performing IG-ADM-4. It is public for informational purposes, but it is not an official status report. See GENIRacksHome/InstageniRacks/AcceptanceTestStatus for the current status of InstaGENI acceptance tests.

Last substantive edit of this page: 2013-02-26

Page format

  • The status chart summarizes the state of this test
  • The high-level description from test plan contains text copied exactly from the public test plan and acceptance criteria pages.
  • The steps contain things I will actually do/verify:
    • Steps may be composed of related substeps where I find this useful for clarity
    • Each step is identified as either "(prep)" or "(verify)":
      • Prep steps are just things we have to do. They're not tests of the rack, but are prerequisites for subsequent verification steps
      • Verify steps are steps in which we will actually look at rack output and make sure it is as expected. They contain a Using: block, which lists the steps to run the verification, and an Expect: block which lists what outcome is expected for the test to pass.

Status of test

See GENIRacksHome/InstageniRacks/AcceptanceTestStatus for the meanings of test states.

Step State Date completed Open Tickets Closed Tickets/Comments
1 Color(green,Pass)? 2013-03-01 tupty has reviewed GMOC doc and EG doc, GPO doc has been posted
2 Color(green,Pass)? 2013-03-06
3 Color(green,Pass)? 2013-03-07
4 Color(green,Pass)? 2013-03-07
5A Color(green,Pass)? 2013-03-07
5B Color(green,Pass)? 2013-03-07
5C Color(green,Pass)? 2013-03-07
6 Color(green,Pass)? 2013-03-07

High-level description from test plan

In this test, an ES (Emergency Stop) drill is performed on a sliver in the rack.

Procedure

  • A site administrator reviews the local site ES procedure, GMOC ES procedure, and sliver shut down procedure, and verifies that these documents combined fully document the campus side of the ES procedure.
  • A second administrator (or the GPO) submits an ES request to GMOC, referencing activity from a public IP address assigned to a compute sliver in the rack that is part of the test experiment.
  • GMOC and the first site administrator perform an ES drill in which the site administrator successfully shuts down the sliver in coordination with GMOC.
  • GMOC completes the ES workflow, including updating/closing GMOC tickets.

Criteria to verify as part of this test

  • VI.07. A public document explains the requirements that site administrators have to the GENI community, including how to join required mailing lists, how to keep their support contact information up-to-date, how and under what circumstances to work with Legal, Law Enforcement and Regulatory(LLR) Plan, how to best contact the rack vendor with operational problems, what information needs to be provided to GMOC to support emergency stop, and how to interact with GMOC when an Emergency Stop request is received. (F.3, C.3.d)
  • VI.17. A procedure is documented for performing a shut down operation on any type of sliver on the rack, in support of an Emergency Stop request. (C.3.d)
  • VII.18. Given a public IP address and port, an exclusive VLAN, a sliver name, or a piece of user-identifying information such as e-mail address or username, a site administrator or GMOC operator can identify the email address, username, and affiliation of the experimenter who controlled that resource at a particular time. (D.7)
  • VII.19. GMOC and a site administrator can perform a successful Emergency Stop drill in which slivers containing compute and OpenFlow-controlled network resources are shut down. (C.3.d)

Step 1 (prep): Site administrator reviews local site ES procedure, GMOC ES procedure, and InstaGENI sliver shut down procedure

The site administrator should review the local site ES procedure, the ES procedure provided by the GMOC, and the InstaGENI sliver shut down procedure. All of these procedures should make sense together, and the site administrator should follow the local site ES procedure for the test. The site administrator should identify parts of the local procedure where they need to take action on the aggregate, and they should reference the InstaGENI sliver shut down procedure for that part of the test. He or she should also identify where the local site procedure requires interfacing with the GMOC. The parts identified by the site administrator should be verified with the GMOC and with the InstaGENI team.

Results of testing step 1: 2013-03-01

The documents have been collected, and they are at the following locations:

Step 2 (prep): GPO, GMOC, and InstaGENI team coordinate a time to run an ES test

The GPO will coordinate with parties at the GMOC and on the InstaGENI team to identify when an ES test can be run. This test will focus primarily on the interactions with the site administator(s) and performing the procedures documented by the rack team. The following roles will need to be defined for this test:

  • GMOC Coordinator: person from the GMOC who coordinates the ES activity on the GMOC's side
  • InstaGENI Contact: person from the InstaGENI team who can be around if there are questions about the document or sliver shut down procedure
  • ES Initiator: GPO person who initiates an Emergency Stop request
  • Experimenter: GPO person who has created a sliver
  • Site Administrator: GPO person who is acting as the site administrator of the GPO InstaGENI rack

Results of testing step 2: 2013-02-28

Date of test: 2013-03-07

Role Person
GMOC Coordinator Eldar
InstaGENI Contact geni-dev-utah@flux.utah.edu, Jon Deurig will be around
ES Initiator Chaos
Experimenter Josh
Site Administrator Tim

Step 3 (prep): Experimenter sets up a slice

The experimenter will set up a slice that includes a sliver on the GPO InstaGENI rack. The sliver should be a VM that is attached to the shared mesoscale VLAN, and it should be sending traffic that is visible through monitoring.

Results of testing step 3: 2013-03-07

Josh is running UDP iperfs in his jbs15 and jbs16 slices from the BBN InstaGENI rack to the BBN ExoGENI rack. This traffic should be visible on the graph of poblano gi0/16's last hour of RX bytes/sec. We will want to look at the amount of traffic before and after we shut down the sliver. Once the shutdown is complete, I will capture today's graph with a timestamp.

Step 4 (prep): ES initiated

  • The ES Initiator contacts the GMOC Coordinator to initiate an ES request describing the slice URN.
  • The GMOC walks quickly walks through their procedure, skipping more formal steps as needed, in order to contact the aggregate operator primary contact.
    • The GMOC does not need to verify the identity of the ES Initiator for the purposes of this test, and therefore should not contact the ES Initiator.
    • The GMOC does not need to contact the Experimenter for the purposes of this test, and therefore the GMOC should not contact the Experimenter.

Results of testing step 4: 2013-03-07

Chaos sent the following at 3:00 PM EST:

GMOC:

I am still a GENI experimenter, and am still trying to use the BBN site.

I noticed that the slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15
has a sliver on the ProtoGENI aggregate at the BBN InstaGENI rack
(instageni.gpolab.bbn.com:12369) which is causing trouble.

Please perform an emergency stop on that sliver.

Also, i swear i sent almost this exact complaint yesterday.  Please tell
that experimenter to knock it off already.  Some of us are trying to
use GENI here.  :>)

Chaos

The GMOC followed up at 3:11 PM EST with:

Greetings GENI Rack Site Operator at BBN,

We received an emergency stop request from a GENI experimenter, Chaos, for the following GENI resource:

urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15 on the ProtoGENI Aggregate at the BBN InstaGENI rack (instageni.gpolab.bbn.com:12369) of which BBN and InstaGENI Ops team are listed as Aggregate Operator(s).

Please acknowledge within an hour that you are looking into this Emergency Stop request and are working to shut-down this resource. If no response is received in one hour, we will proceed to contact your Escalation contact and/or perform an isolation of your slice/resource from the GENI Core network.

NOTE: Please feel free to reference the following document for details of the GMOC Emergency Stop process:
http://gmoc.grnoc.iu.edu/gmoc/documents/geni-standards/gmoc-support-emergency-stop-procedure-and-workflow.html

Thank You,

Eldar
GENI Meta Operations Center
Indiana University
gmoc@grnoc.iu.edu, 317-274-7783
Visit the GMOC Home Page at
http://gmoc.grnoc.iu.edu/

Step 5: Site Administrator receives ES request

Step 5A (verify): Data passed from GMOC to Site Administrator is in expected format

Using:

  • Local site ES procedure
  • Documented InstaGENI sliver shut down procedure
  • GMOC monitoring tools

Verify:

  • The GMOC sends a request with slice-specific or sliver-specific data in a format that can be fed into the shut down procedure
  • There is a step in the local site ES procedure for the Site Administrator to acknowledge that the GMOC's request is being processed
  • The Site Administrator can identify the experimenter's email address, username, and affiliation with the information provided by the GMOC and GMOC monitoring tools

Results of testing step 5A: 2013-03-07

  • The GMOC's email included the slice URN, which was as expected for today's test. The InstaGENI procedure can be followed with a slice URN.
  • The example procedure includes a step to ack the GMOC's ES request, and I acked the request at 3:13 PM EST.
  • I can see the experimenter's email address, user URN, and operator organization when I look at the GMOC web UI.

Step 5B (verify): Shut down procedure can be followed to successfully shut down a sliver

Using:

  • Documented InstaGENI sliver shut down procedure
  • Administrative tools to shut down a sliver
  • GMOC monitoring tools

Verify:

  • The shut down procedure includes the complete set of steps shut down a sliver in the rack
  • Following the shut down procedure results in a sliver being deactivated on a rack
  • Experimental traffic from the sliver is no longer being sent

Results of testing step 5B: 2013-03-07

  • I used steps provided in the procedure to shutdown the sliver in our rack:
    [tupty@boss ~]$ wap cleanupslice -m urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15
    Syncing target vlan 11 in [Experiment: emulab-ops/openflow-vlans]
    getTrunksForVlan: 11: procurve2
    mapVlansToSwitches: procurve2
    getExperimentTrunksForVlan: 347: procurve2
    getExperimentTrunksForVlan: 348: procurve2
    getExperimentTrunksForVlan: 349: procurve2
    getExperimentTrunksForVlan: 350: procurve2
    getExperimentTrunksForVlan: 351: procurve2
    getExperimentTrunksForVlan: 481: procurve2
    getExperimentTrunksForVlan: 11: procurve2
    getExperimentTrunksForVlan: 222: procurve2
    mapStaleVlansToSwitches: procurve2
    procurve2 -> startChildCall(FlipDebug)
    Experiment vlans: 11
    Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1
    Existing vlans: 11
    Stale vlans: 
    Existing Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1
    snmpit: pgeni-gpolab-bbn-com/jbs15 has no VLANs, skipping
    Running 'tbswap out  pgeni-gpolab-bbn-com jbs15'
    Beginning swap-out for pgeni-gpolab-bbn-com/jbs15 (782). 03/07/2013 15:17:24
    TIMESTAMP: 15:17:24:621289 tbswap out started
    Checking for feature SyncVlans.
    Stopping the event system
    Checking for feature NewEventScheduler.
    Closing TCP proxy ports...
    TIMESTAMP: 15:17:28:969824 snmpit started
    Removing VLANs.
    snmpit: pgeni-gpolab-bbn-com/jbs15 has no VLANs, skipping
    TIMESTAMP: 15:17:29:468312 snmpit finished
    Removing dynamic blobs.
    Clearing shared port vlans.
    Syncing target vlan 11 in [Experiment: emulab-ops/openflow-vlans]
    getTrunksForVlan: 11: procurve2
    mapVlansToSwitches: procurve2
    getExperimentTrunksForVlan: 347: procurve2
    getExperimentTrunksForVlan: 348: procurve2
    getExperimentTrunksForVlan: 349: procurve2
    getExperimentTrunksForVlan: 350: procurve2
    getExperimentTrunksForVlan: 351: procurve2
    getExperimentTrunksForVlan: 481: procurve2
    getExperimentTrunksForVlan: 11: procurve2
    getExperimentTrunksForVlan: 222: procurve2
    mapStaleVlansToSwitches: procurve2
    procurve2 -> startChildCall(FlipDebug)
    Experiment vlans: 11
    Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1
    Existing vlans: 11
    Stale vlans: 
    Existing Trunk Ports: interconnect-poblano:0.1 pc2:2.1 interconnect-campus:0.1 pc2:1.1 pc1:1.1
    Tearing down virtual nodes.
    TIMESTAMP: 15:17:29:998358 vnode_setup -k started
    TIMESTAMP: 15:17:30:207887 vnode_setup finished
    Removing logical wires.
    Freeing nodes.
    TIMESTAMP: 15:17:30:431157 nfree started
    Releasing all nodes from experiment [Experiment: pgeni-gpolab-bbn-com/jbs15].
    TIMESTAMP: 15:17:30:688709 nfree finished
    Resetting mountpoints.
    TIMESTAMP: 15:17:30:691010 exports started
    TIMESTAMP: 15:17:33:38637 exports finished
    Resetting named maps.
    TIMESTAMP: 15:17:33:41287 named started
    TIMESTAMP: 15:17:33:552959 named finished
    Resetting email lists.
    TIMESTAMP: 15:17:33:554725 genelists started
    TIMESTAMP: 15:17:33:719180 genelists finished
    Resetting DB.
    Successfully finished swap-out for pgeni-gpolab-bbn-com/jbs15. 15:17:33:727306
    TIMESTAMP: 15:17:33:727868 tbswap out finished (succeeded)
    Getting user added files.
    Doing a savepoint on the experiment archive ...
    Doing a commit on the experiment archive ...
    Running 'tbend  -e 782'
    Beginning cleanup for pgeni-gpolab-bbn-com/jbs15. 15:17:34:173128
    Clearing out virtual state.
    Removing visualization data...
    Cleanup finished! 15:17:34:601973
    Archiving and clearing the experiment archive ...
    Experiment pgeni-gpolab-bbn-com/jbs15 has been successfully terminated!
    Removing experiment directories ... 
    
  • I don't know of a way of verifying that the slice was deleted from the rack. That should probably be added to the doc.

Update: This has been addressed.

Step 5C (verify): Documented procedure includes a step to follow up with GMOC

Using:

  • Local site ES procedure

Verify:

  • There is a step for the site administrator to follow up with the GMOC that a sliver has been shut down

Results of testing step 5C: 2013-03-07

There is a step to follow up with the GMOC, and I did so at 3:24 PM EST.

Step 6 (verify): Sliver shut down procedure includes a clean-up step (if necessary)

Using:

  • Documented InstaGENI sliver shut down procedure

Verify:

  • Ensure the InstaGENI sliver shut down procedure contains a recovery step describing what to do if the shut down affects other experimenters.

Results of testing step 6: 2013-03-07

  • Other slices (including tuptymon and jbs16) were not affected by us shutting down jbs15.
  • There were no clean up steps necessary for the pieces involving the InstaGENI doc in today's test.