wiki:GENIRacksHome/ExogeniRacks/AcceptanceTestStatus/EG-ADM-1

Version 62 (modified by Josh Smift, 11 years ago) (diff)

--

  1. Detailed test plan for EG-ADM-1: Rack Receipt and Inventory Test
    1. Page format
    2. Status of test
    3. High-level description from test plan
      1. Procedure
      2. Criteria to verify as part of this test
    4. Step 1 (prep): ExoGENI and GPO power and wire the BBN rack
    5. Step 2: Configure and verify DNS
      1. Step 2A (verify): Find out what IP-to-hostname mapping to use
      2. Step 2B (prep): Insert IP-to-hostname mapping in DNS
      3. Step 2C (verify): Test all PTR records
    6. Step 3: GPO requests and receives administrator accounts
      1. Step 3A: GPO requests access to head node
        1. Results of testing step 3A: 2012-05-10
      2. Step 3B: GPO requests access to network devices
        1. Results of testing step 3B: 2012-05-10
        2. Results of testing step 3B: 2012-05-26
      3. Step 3C: GPO requests access to worker nodes running under OpenStack
        1. Results of testing step 3C: 2012-05-10
      4. Step 3D: GPO requests access to IPMI management interfaces for workers
        1. Results of testing step 3D: 2012-05-10 - 2012-05-11
          1. Steps taken on 2012-05-10
          2. Steps taken on 2012-05-11
      5. Step 3E: GPO gets access to allocated bare metal worker nodes by default
        1. Results of testing step 3E: 2012-07-05
    7. Step 4: GPO inventories the rack based on our own processes
      1. Step 4A: Inventory and label physical rack contents
        1. Results of testing step 4A: 2012-05-11
        2. Results of testing step 4A: 2012-05-26
      2. Step 4B: Inventory rack power requirements
        1. Results of testing step 4B: 2012-05-11
        2. Results of testing step 4B: 2012-07-05
      3. Step 4C: Inventory rack network connections
        1. Results of testing step 4C: 2012-05-11
        2. Results of testing step 4C: 2012-07-23
      4. Step 4D: Verify government property accounting for the rack
    8. Step 5: Configure operational alerting for the rack
      1. Step 5A: GPO installs active control network monitoring
        1. Results of testing step 5A: 2012-05-23
      2. Step 5B: GPO installs active shared dataplane monitoring
        1. Results of testing step 5B: 2012-07-05
      3. Step 5C: GPO gets access to nagios information about the BBN rack
        1. Results of testing step 5C: 2012-05-23
        2. Results of testing step 5C: 2012-05-26
        3. Results of testing step 5C: 2012-07-05
        4. Results of testing step 5C: 2012-07-23
      4. Step 5D: GPO receives e-mail about BBN rack nagios alerts
        1. Results of testing step 5D: 2012-05-23
        2. Results of testing step 5D: 2012-07-05
        3. Results of testing step 5D: 2012-07-23
    9. Step 6: Setup contact info and change control procedures
      1. Step 6A: Exogeni operations staff should subscribe to response-team
        1. Results of testing step 6A: 2012-05-23
      2. Step 6B: Exogeni operations staff should provide contact info to GMOC
        1. Results of testing step 6B: 2012-05-23
      3. Step 6C: Negotiate an interim change control notification procedure
        1. Results of testing step 6C: 2012-05-23
        2. Results of testing step 6C: 2013-02-14

Detailed test plan for EG-ADM-1: Rack Receipt and Inventory Test

This page is GPO's working page for performing EG-ADM-1. It is public for informational purposes, but it is not an official status report. See GENIRacksHome/ExogeniRacks/AcceptanceTestStatus for the current status of ExoGENI acceptance tests.

Last substantive edit of this page: 2013-02-19

Page format

  • The status chart summarizes the state of this test
  • The high-level description from test plan contains text copied exactly from the public test plan and acceptance criteria pages.
  • The steps contain things i will actually do/verify:
    • Steps may be composed of related substeps where i find this useful for clarity
    • Each step is identified as either "(prep)" or "(verify)":
      • Prep steps are just things we have to do. They're not tests of the rack, but are prerequisites for subsequent verification steps
      • Verify steps are steps in which we will actually look at rack output and make sure it is as expected. They contain a Using: block, which lists the steps to run the verification, and an Expect: block which lists what outcome is expected for the test to pass.

Status of test

Step State Date completed Open Tickets Closed Tickets/Comments
1 Color(green,Pass)? 2012-02-24
2A Color(green,Pass)? 2012-10-10 (11)
2B Color(green,Pass)? 2012-10-10
2C Color(green,Pass)? 2012-10-10
3A Color(green,Pass)? 2012-05-10
3B Color(green,Pass)? 2012-05-10 (10, 20, 32)
3C Color(green,Pass)? 2012-05-10
3D Color(green,Pass)? 2012-05-11
3E Color(green,Pass)? 2012-07-05
4A Color(green,Pass)? 2012-10-10 (22, 33) clarify some outstanding DNS questions
4B Color(green,Pass)? 2012-07-05 (23)
4C Color(green,Pass)? 2012-10-10 (71)
4D Color(green,Pass)? 2012-06-21 (12)
5A Color(green,Pass)? 2012-05-23
5B Color(yellow,Complete)? (28) This test was completed once, but will re-run once hybrid mode is available on the dataplane switch
5C Color(green,Pass)? 2012-07-23 (29)
5D Color(green,Pass)? 2012-07-23 (30)
6A Color(green,Pass)? 2012-05-23
6B Color(green,Pass)? 2012-05-23
6C Color(green,Pass)? 2013-02-14

High-level description from test plan

This "test" uses BBN as an example site by verifying that we can do all the things we need to do to integrate the rack into our standard local procedures for systems we host.

Procedure

  • ExoGENI and GPO power and wire the BBN rack
  • GPO configures the exogeni.gpolab.bbn.com DNS namespace and 192.1.242.0/25 IP space, and enters all public IP addresses for the BBN rack into DNS.
  • GPO requests and receives administrator accounts on the rack and read access to ExoGENI Nagios for GPO sysadmins.
  • GPO inventories the physical rack contents, network connections and VLAN configuration, and power connectivity, using our standard operational inventories.
  • GPO, ExoGENI, and GMOC share information about contact information and change control procedures, and ExoGENI operators subscribe to GENI operations mailing lists and submit their contact information to GMOC.

Criteria to verify as part of this test

  • VI.02. A public document contains a parts list for each rack. (F.1)
  • VI.03. A public document states the detailed power requirements of the rack, including how many PDUs are shipped with the rack, how many of the PDUs are required to power the minimal set of shipped equipment, the part numbers of the PDUs, and the NEMA input connector type needed by each PDU. (F.1)
  • VI.04. A public document states the physical network connectivity requirements between the rack and the site network, including number, allowable bandwidth range, and allowed type of physical connectors, for each of the control and dataplane networks. (F.1)
  • VI.05. A public document states the minimal public IP requirements for the rack, including: number of distinct IP ranges and size of each range, hostname to IP mappings which should be placed in site DNS, whether the last-hop routers for public IP ranges subnets sit within the rack or elsewhere on the site, and what firewall configuration is desired for the control network. (F.1)
  • VI.06. A public document states the dataplane network requirements and procedures for a rack, including necessary core backbone connectivity and documentation, any switch configuration options needed for compatibility with the L2 core, and the procedure for connecting non-rack-controlled VLANs and resources to the rack dataplane. (F.1)
  • VI.07. A public document explains the requirements that site administrators have to the GENI community, including how to join required mailing lists, how to keep their support contact information up-to-date, how and under what circumstances to work with Legal, Law Enforcement and Regulatory(LLR) Plan, how to best contact the rack vendor with operational problems, what information needs to be provided to GMOC to support emergency stop, and how to interact with GMOC when an Emergency Stop request is received. (F.3, C.3.d)
  • VI.14. A procedure is documented for creating new site administrator and operator accounts. (C.3.a)
  • VII.01. Using the provided documentation, GPO is able to successfully power and wire their rack, and to configure all needed IP space within a per-rack subdomain of gpolab.bbn.com. (F.1)
  • VII.02. Site administrators can understand the physical power, console, and network wiring of components inside their rack and document this in their preferred per-site way. (F.1)

Step 1 (prep): ExoGENI and GPO power and wire the BBN rack

This was done on 2012-02-23 and 2012-02-24, and Chaos took rough notes at ChaosSandbox/ExogeniRackNotes.

Step 2: Configure and verify DNS

(This is GST 3354 item 5.)

Step 2A (verify): Find out what IP-to-hostname mapping to use

Using:

  • If the rack IP requirements documentation for the rack exists:
    • Review that documentation and determine what IP to hostname mappings should exist for 192.1.242.0/25
  • Otherwise:
    • Iterate with exogeni-ops to determine the IP to hostname mappings to use for 192.1.242.0/25

Expect:

  • Reasonable IP-to-hostname mappings for 126 valid IPs allocated for ExoGENI use in 192.1.242.0/25

Step 2B (prep): Insert IP-to-hostname mapping in DNS

  • Fully populate 192.1.242.0/25 PTR entries in GPO lab DNS
  • Fully populate exogeni.gpolab.bbn.com PTR entries in GPO lab DNS

Step 2C (verify): Test all PTR records

Using:

  • From a BBN desktop host:
    for lastoct in {1..127}; do
    host 192.1.242.$lastoct
    done
    

Expect:

  • All results look like:
    $lastoct.242.1.192.in-addr.arpa domain name pointer <something reasonable>
    
    and none look like:
    Host $lastoct.242.1.192.in-addr.arpa. not found: 3(NXDOMAIN)
    

Step 3: GPO requests and receives administrator accounts

Step 3A: GPO requests access to head node

(This is GST 3354 item 2a.)

Using:

Verify:

  • Logins succeed for Chaos, Josh, and Tim
  • The command works:
    $ sudo whoami
    root
    

Results of testing step 3A: 2012-05-10

  • Chaos successfully used public-key login and sudo from a BBN subnet (128.89.68.0/23):
    capybara,[~],10:06(0)$ ssh bbn-hn.exogeni.gpolab.bbn.com
    Last login: Thu May 10 14:06:47 2012 from capybara.bbn.com
    ...
    bbn-hn,[~],14:06(0)$ sudo whoami
    [sudo] password for chaos:
    root
    bbn-hn,[~],14:07(0)$
    
  • Josh reported successful public-key login and sudo from a BBN subnet (128.89.91.0/24)
  • Tim reported successful password login and sudo from a BBN subnet (128.89.252.0/22)

Step 3B: GPO requests access to network devices

(This is GST 3354 item 2f.)

Using:

  • Request accounts for GPO ops staffers on network devices 8052.bbn.xo (management) and 8264.bbn.xo (dataplane) from exogeni-ops

Verify:

  • I know what hostname or IP address to login to to reach each of the 8052 and 8264 switches
  • I know where to login to each of 8052 and 8264
  • I can successfully perform those logins at least once
  • I can successfully run a few test commands to verify enable mode:
    show running-config
    show mac-address-table
    

Results of testing step 3B: 2012-05-10

  • Per e-mail from Chris, the 8052 is 192.168.103.2, and the 8264 is 192.168.103.4. The 8052 also has the public IP address 192.1.242.4.
  • Login from bbn-hn to 192.168.103.2:
    bbn-hn,[~],14:23(1)$ ssh 192.168.103.2
    Enter radius password:
    
    IBM Networking Operating System RackSwitch G8052.
    
    
    8052.bbn.xo>enable
    
    Enable privilege granted.
    8052.bbn.xo#show running-config
    Current configuration:
    !
    version "6.8.1"
    switch-type "IBM Networking Operating System RackSwitch G8052"
    ...
    8052.bbn.xo#show mac-address-table
    Mac address Aging Time: 300
    ...
    8052.bbn.xo#exitReceived disconnect from 192.168.103.2: 11: Logged out.
    
  • Login from bbn-hn to 192.168.103.4:
    bbn-hn,[~],14:28(1)$ ssh 192.168.103.4
    Enter radius password:
    
    IBM Networking Operating System RackSwitch G8264.
    
    
    8264.bbn.xo>en
    
    Enable privilege granted.
    8264.bbn.xo#show running-config
    Current configuration:
    !
    version "6.8.1"
    switch-type "IBM Networking Operating System RackSwitch G8264"
    ...
    8264.bbn.xo#show mac-address-table
    Mac address Aging Time: 300
    
    FDB is empty.
    8264.bbn.xo#exit
    
    Received disconnect from 192.168.103.4: 11: Logged out.
    
  • Login from capybara (BBN network) to 192.1.242.4:
    capybara,[~/src/cvs/geni-inf/GENI-CVS.BBN.COM/puppet],10:14(0)$ ssh 192.1.242.4
    Enter radius password:
    
    IBM Networking Operating System RackSwitch G8052.
    
    
    8052.bbn.xo>en
    
    Enable privilege granted.
    8052.bbn.xo#exitReceived disconnect from 192.1.242.4: 11: Logged out.
    
  • Tim (tupty) attempts login from bbn-hn to 192.168.103.2:
    [tupty@bbn-hn ~]$ ssh 192.168.103.2
    Enter radius password:
    Received disconnect from 192.168.103.2: 11: Logged out.
    

In summary, all of the access works for me because i am in xoadmins, but Tim is not able to login because bbnadmins does not have access.

Results of testing step 3B: 2012-05-26

Testing assertion that exoticket:20 has been resolved, so my site admin account, cgolubit, should be able to run this test.

  • Per e-mail from Chris, the 8052 is 192.168.103.2, and the 8264 is 192.168.103.4. The 8052 also has the public IP address 192.1.242.4.

Testing the 8052:

  • Login from bbn-hn to 192.168.103.2 works:
    (cgolubit) bbn-hn,[~],13:18(0)$ ssh 192.168.103.2
    The authenticity of host '192.168.103.2 (192.168.103.2)' can't be established.
    DSA key fingerprint is 89:b6:13:30:a5:74:e3:3e:a6:aa:71:7a:91:6e:80:fd.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added '192.168.103.2' (DSA) to the list of known hosts.
    Enter radius password:
    
    IBM Networking Operating System RackSwitch G8052.
    
    
    8052.bbn.xo>
    
  • Enable access is not granted, as Chris said:
    8052.bbn.xo>enable
    
    Enable access using (oper) credentials restricted to admin accounts only.
    8052.bbn.xo>
    
  • The full running-config can't be viewed in oper mode:
    8052.bbn.xo>show running-config
                      ^
    % Invalid input detected at '^' marker.
    8052.bbn.xo>
    
  • The MAC address table can be viewed in oper mode:
    8052.bbn.xo>show mac-address-table
    Mac address Aging Time: 300
    
    Total number of FDB entries : 26
    ...
    
  • Some information about VLANs can be viewed in oper mode. Both of these work:
    show interface information
    show vlan information
    
    and i believe that, between these, i can get as much information about vlan configurations of interfaces as i could from show running-config.

Testing the 8064:

  • Login from bbn-hn to 192.168.103.4 works:
    (cgolubit) bbn-hn,[~],13:31(255)$ ssh 192.168.103.4
    The authenticity of host '192.168.103.4 (192.168.103.4)' can't be established.
    DSA key fingerprint is f0:55:24:77:00:f2:5c:cd:69:86:4c:28:ac:f8:52:26.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added '192.168.103.4' (DSA) to the list of known hosts.
    Enter radius password:
    
    IBM Networking Operating System RackSwitch G8264.
    
    
    8264.bbn.xo>
    
  • Entering enable mode does not work:
    8264.bbn.xo>enable
    
    Enable access using (oper) credentials restricted to admin accounts only.
    8264.bbn.xo>
    
  • Full running config cannot be viewed:
    8264.bbn.xo>show running-config
                      ^
    % Invalid input detected at '^' marker.
    
  • Mac address table (which is empty here) can be viewed:
    8264.bbn.xo>show mac-address-table
    Mac address Aging Time: 300
    
    FDB is empty.
    
  • Openflow informaton can be viewed, including DPID and controllers for an active instance:
    8264.bbn.xo>show openflow 1
    Open Flow Instance ID: 1
            DataPath ID: 0x640817f4b52a00
    ...
    Configured Controllers:
            IP Address: 192.168.103.10
                    State: Active
                    Port: 6633
                    Retry Count: 0
            Configured Controller Count 1
    

Back to control switch, looking at public IP address:

  • Login from capybara (BBN network) to 192.1.242.4 appears to work identically:
    capybara,[~],09:35(255)$ ssh cgolubit@192.1.242.4
    Enter radius password:
    
    IBM Networking Operating System RackSwitch G8052.
    
    
    8052.bbn.xo>enable
    
    Enable access using (oper) credentials restricted to admin accounts only.
    8052.bbn.xo>
    

One more thing: per Wednesday's call, the switch control IPs should have the hostnames 8052.bbn.xo and 8064.bbn.xo, but these do not work either forward or reverse:

bbn-hn,[~],13:39(0)$ nslookup
> server
Default server: 152.54.1.66
Address: 152.54.1.66#53
Default server: 192.1.249.10
Address: 192.1.249.10#53
> 8052.bbn.xo
Server:         152.54.1.66
Address:        152.54.1.66#53

** server can't find 8052.bbn.xo: NXDOMAIN
> 8064.bbn.xo
Server:         152.54.1.66
Address:        152.54.1.66#53

** server can't find 8064.bbn.xo: NXDOMAIN
> 192.168.103.2
Server:         152.54.1.66
Address:        152.54.1.66#53

** server can't find 2.103.168.192.in-addr.arpa.: NXDOMAIN
> 192.168.103.4
Server:         152.54.1.66
Address:        152.54.1.66#53

** server can't find 4.103.168.192.in-addr.arpa.: NXDOMAIN

Having clarified that this should work, i'll make a ticket for it now.

Step 3C: GPO requests access to worker nodes running under OpenStack

(This is GST 3354 item 2c.)

Using:

  • From bbn-hn, try to SSH to bbn-w1
  • From bbn-hn, try to SSH to bbn-w2
  • From bbn-hn, try to SSH to bbn-w3
  • From bbn-hn, try to SSH to bbn-w4

Verify:

  • For each connection, either the connection succeeds or we can verify that the node is not an OpenStack worker.

Results of testing step 3C: 2012-05-10

  • According to /etc/hosts:
    10.100.0.11             bbn-w1.local            bbn-w1
    10.100.0.12             bbn-w2.local            bbn-w2
    10.100.0.13             bbn-w3.local            bbn-w3
    10.100.0.14             bbn-w4.local            bbn-w4
    
    So i think the names bbn-w[1-4] will point to the VLAN 1007 (OpenStack) locations.
  • Chaos's login from bbn-hn to bbn-w1 using public-key SSH:
    bbn-hn,[~],16:18(0)$ ssh bbn-w1
    Last login: Fri Apr 27 12:27:02 2012 from bbn-hn.local
    ...
    bbn-w1,[~],16:19(0)$ sudo whoami
    [sudo] password for chaos:
    root
    
  • Tim reported successful login from bbn-hn to bbn-w1 and was able to sudo, so this works for members of the bbnadmins group as well.
  • Chaos's login from bbn-hn to bbn-w2:
    bbn-hn,[~],16:25(0)$ ssh bbn-w2
    Last login: Fri Mar 23 20:09:42 2012 from bbn-hn.bbn.exogeni.net
    ...
    bbn-w2,[~],16:25(0)$ sudo whoami
    [sudo] password for chaos:
    root
    
  • Chaos's login from bbn-hn to bbn-w3:
    bbn-hn,[~],16:26(0)$ ssh bbn-w3
    Last login: Fri Mar 23 20:14:17 2012 from bbn-hn.bbn.exogeni.net
    ...
    bbn-w3,[~],16:26(0)$ sudo whoami
    [sudo] password for chaos:
    root
    
  • Chaos's login from bbn-hn to bbn-w4:
    bbn-hn,[~],16:27(0)$ ssh bbn-w4
    ssh: connect to host bbn-w4 port 22: No route to host
    
    I need to verify that bbn-w4 is inaccessible because it is not running OpenStack, rather than because something else is amiss. Two possible ways i could verify this:
    • I can look at the IPMI console when i get to that point, and verify that the node's console appears to be doing something other than OpenStack.
    • I can ask Victor to tell me the node state.

During testing of item 3D, i was able to verify that bbn-w4 is at a PXE prompt, and thus is not running OpenStack right now.

Step 3D: GPO requests access to IPMI management interfaces for workers

(This is GST 3354 item 2b.)

Using:

Verify:

  • VPN connection succeeds
  • Login to each IMM succeeds
  • Launching the remote console at each IMM succeeds

Results of testing step 3D: 2012-05-10 - 2012-05-11

Steps taken on 2012-05-10

Chaos's testing under Mac OS X 10.6.8:

  • One-time setup:
    sudo port install openvpn2
    sudo port install tuntaposx
    
    mkdir -p ~/tmp/exogeni-vpn
    cd ~/tmp/exogeni-vpn
    unzip ~/bbnadmins.zip
    
    cd bbnadmins
    chmod 600 bbnadmins.key
    
  • One-time: use RENCI's DNS to look up the IPs we need to put into /etc/hosts (either do this from an SSH session on bbn-hn, or after connecting via VPN once):
    $ host bbn-w1.bbn.xo 192.168.100.2
    Using domain server:
    Name: 192.168.100.2
    Address: 192.168.100.2#53
    Aliases:
    
    bbn-w1.bbn.xo has address 192.168.103.101
    
    and use this to create the file:
    $ cat bbn.xo-hosts.txt
    
    # Static host entries for use with bbn.xo
    192.168.103.100 bbn-hn.bbn.xo
    192.168.103.101 bbn-w1.bbn.xo
    192.168.103.102 bbn-w2.bbn.xo
    192.168.103.103 bbn-w3.bbn.xo
    192.168.103.104 bbn-w4.bbn.xo
    
  • Finally, per-invocation, do startup:
    sudo kextload /opt/local/Library/Extensions/tun.kext
    cd ~/tmp/exogeni-vpn/bbnadmins
    sudo openvpn2 ./bbnadmins.conf
    sudo sh -c 'cat ./bbn.xo-hosts.txt >> /etc/hosts'
    
  • Now browse to http://bbn-w1.bbn.xo:
    • Login at the dialogue
    • Click: Continue
    • Note: the IMM is one place we can get the interface MACs if we ever need them
    • Tasks -> Remote Control -> Start Remote Control in Multi-User Mode
      • this launched bbn-w1.bbn.xo Video Viewer (a java webstart app)
      • i was able to login as chaos on that console
      • i was able to sudo on that console
    • IMM Control -> Port Assignments says which ports are open on this IMM (this is also in the config file)
    • I can go to: http://bbn-w1.bbn.xo/page/ibmccontrol_configsummary.html to get a configuration summary, which i saved off by hand for future reference
    • I imagine backing up the config is a pretty safe alternative to viewing it, but don't want to muck around too much
  • Logout
  • Now browse to http://bbn-w2.bbn.xo:
    • Login as before
    • Tasks -> Remote Control -> Start Remote Control in Multi-User Mode
    • IMM Control -> Configuration File -> view the current configuration summary, and make a copy
  • Now browse to http://bbn-w3.bbn.xo:
    • Login as before
    • Tasks -> Remote Control -> Start Remote Control in Multi-User Mode
    • IMM Control -> Configuration File -> view the current configuration summary, and make a copy
  • Now browse to http://bbn-w4.bbn.xo:
    • Login as before
    • Tasks -> Remote Control -> Start Remote Control in Multi-User Mode
      • the console here shows that bbn-w4 is at a PXE boot prompt
    • IMM Control -> Configuration File -> view the current configuration summary, and make a copy
      • Trying to get this config, i got a bunch of errors about trouble communicating with the IMM
  • Per-invocation, do shutdown:
    • ctrl-C to kill the openvpn connection
    • remove the kernel module and the lines from /etc/hosts:
      sudo kextunload /opt/local/Library/Extensions/tun.kext
      sudo vi /etc/hosts
      

Tim's testing under Ubuntu 10.10:

  • Install and connect:
    sudo apt-get install openvpn
    sudo openvpn --config bbnadmins.ovpn
    
  • He reported that he could login to the remote KVM for bbn-w1 (by directly using the IP address in his web browser, since he also did not have openvpn modify resolv.conf)

Still to test here:

  • Make sure i can actually successfully get the config for bbn-w4, or complain about the timeout issue if it recurs
  • Do the check on bbn-hn
Steps taken on 2012-05-11
  • Per-invocation VPN startup:
    sudo kextload /opt/local/Library/Extensions/tun.kext
    cd ~/tmp/exogeni-vpn/bbnadmins
    sudo sh -c 'cat ./bbn.xo-hosts.txt >> /etc/hosts'
    sudo openvpn2 ./bbnadmins.conf
    
  • Now browse to http://bbn-w4.bbn.xo:
    • Login as before
    • IMM Control -> Configuration File -> view the current configuration summary, and make a copy
      • This time, the configuration eventually loaded with no trouble
  • Now browse to http://bbn-hn.bbn.xo:
    • Login as before
    • Tasks -> Remote Control -> Start Remote Control in Multi-User Mode
    • IMM Control -> Configuration File -> view the current configuration summary, and make a copy
  • Per-invocation VPN shutdown:
    • ctrl-C to kill the openvpn connection
    • remove the kernel module and the lines from /etc/hosts:
      sudo kextunload /opt/local/Library/Extensions/tun.kext
      sudo vi /etc/hosts
      

Step 3E: GPO gets access to allocated bare metal worker nodes by default

(This is GST 3354 item 2d.)

Prerequisites:

  • A bare metal node is available for allocation by xCAT
  • Someone has successfully allocated the node for a bare metal experiment

Using:

  • From bbn-hn, try to SSH into root on the allocated worker node

Verify:

  • We find out the IP address/hostname at which to reach the allocated worker node
  • We find out the location of the SSH private key on bbn-hn
  • Login using this SSH key succeeds.

Results of testing step 3E: 2012-07-05

  • Luisa has reserved bare-metal node bbn-w4 for her experiment
  • The hostname bbn-w4 resolves locally from bbn-hn:
    bbn-hn,[~],16:53(0)$ ssh bbn-w4
    The authenticity of host 'bbn-w4 (10.100.0.14)' can't be established.
    ...
    
  • Login does not succeed with either my agent-cached SSH key, my "chaos" LDAP password (in xoadmins) or my "cgolubit" LDAP password (in bbnadmins):
    bbn-hn,[~],16:55(0)$ ssh bbn-w4
    chaos@bbn-w4's password:
    Permission denied, please try again.
    
    bbn-hn,[~],16:55(130)$ ssh cgolubit@bbn-w4
    cgolubit@bbn-w4's password:
    Permission denied, please try again.
    
  • I believe there is a shared SSH key, but i don't know where it is. I asked exogeni-design for it, and Ilia reported that it is in /opt/orca-12080 (orca-12080 is the ORCA AM that controls the rack).
  • That works:
    bbn-hn,[~],17:31(0)$ sudo ssh -i /opt/orca-12080/xcat/id_rsa root@bbn-w4
    [sudo] password for chaos:
    Last login: Thu Jul  5 17:30:40 2012 from 10.100.0.1
    [root@bbn-w4 ~]#
    

Step 4: GPO inventories the rack based on our own processes

Step 4A: Inventory and label physical rack contents

(This covers GST 3354 items 3 and 7.)

Using:

Verify:

Results of testing step 4A: 2012-05-11

  • Physical objects found in the rack:
    41 management switch: 8052.bbn.xo
    40
    39 dataplane switch: 8264.bbn.xo
    38
    37
    36
    35
    34
    33
    32
    31
    30
    29
    28
    27 SSG5 (not mounted)
    26 ^   ^
    25 iSCSI
    24
    23
    22
    21
    20
    19 console
    18
    17
    16
    15
    14
    13
    12
    11
    10 ^      ^
    09 worker 4: bbn-w4.bbn.xo
    08 ^      ^
    07 worker 3: bbn-w3.bbn.xo
    06 ^      ^
    05 worker 2: bbn-w2.bbn.xo
    04 ^      ^
    03 worker 1: bbn-w1.bbn.xo
    02 ^  ^
    01 head: bbn-hn.bbn.xo
    
  • I was able to get the names of most objects (inlined above) from https://wiki.exogeni.net/doku.php?id=public:hardware:rack_layout. I found a few small issues:
    • Our iSCSI is mounted at U25-U26, where the diagram shows U24-U25. Is this an error in the diagram or an error in our rack? If the latter, should it be fixed, or just noted as a footnote on the rack diagram or elsewhere?
    • The SSG5 is not shown in that diagram. It's not rackmounted, but it is a networked device in the rack, so i think it would be good to label it so people can tell what it is.
    • No hostnames are given in the layout page for the iSCSI array, the SSG5, or the console. Of those, i think the iSCSI array and the SSG5 actually have hostnames (since they are networked), so let's agree on what they are.
  • Physical labelling of devices:
    • I assumed the names bbn-iscsi, bbn-ssg5, and bbn-console for the devices that don't have hostnames in ExoGENI's rack diagram, and will iterate later if needed.
    • I chose to label the SSG5 on the back only, because it doesn't take up an entire U, so the front is blocked by a panel, and turning the device around to label the other side seemed needlessly risky.
    • The switches don't really have any room on the front or back plates for labels, so i labelled them on the top edge on the front (i don't think anyone is ever going to look there), and on the bottom edge on the back (where i believe they are mounted at or above eye level for everyone who will be looking at the rack).
  • I updated our inventory without issue, and transferred a copy of my ascii rack diagram (above) over to that page.

Results of testing step 4A: 2012-05-26

Answers to previous questions:

  • Our iSCSI is mounted at U25-U26, where the diagram shows U24-U25: U25-U26 is correct, and the diagram was fixed as part of exoticket:22.
  • The SSG5 is not shown in that diagram: the diagram was fixed as part of exoticket:22
  • No hostnames are given in the layout page for the iSCSI array, the SSG5, or the console: the iSCSI array doesn't believe it has a hostname, but there is presumably a private IP in the 10.102.0.0/24 range which is reserved for it. I opened exoticket:33 for this.

Step 4B: Inventory rack power requirements

Using:

Verify:

  • We succeed in locating and documenting information about rack power circuits in use

Results of testing step 4B: 2012-05-11

  • For each of the six PDUs which is plugged into a circuit in our floor, the PDU cable is labelled on the floor end (e.g. PDU3 ExoGENI), but doesn't appear to be labelled on the PDU end. So i can't figure out which one is which, and this information doesn't seem to appear on the wiki. This is a blocker for now.

Results of testing step 4B: 2012-07-05

  • BBN facilities was able to inventory the circuits during circuit rewiring.
  • We now have accurate mappings of circuits to PDUs for the 6 ExoGENI PDUs which connect directly to BBN circuits, for the BBN UPS which connects to the other circuit, and for ExoGENI PDU 02 which connects to the BBN UPS.

Step 4C: Inventory rack network connections

Using:

Verify:

  • We are able to identify and determine all rack network connections and VLAN configurations
  • We are able to determine the OpenFlow configuration of the rack dataplane switch

Results of testing step 4C: 2012-05-11

  • The rack has three devices we would normally put in our connection inventory: bbn-ssg, 8052, and 8264.
  • Pieces of information i used in putting this inventory together:
  • What i did:
    • Looked through my notes while looking at the physical switch and at the switch running configuration, and tried to write down a list of what was connected to each interface and its VLAN configuration
    • Use MAC address tables to try to disambiguate things i wasn't sure about
    • Also use the stored configurations from the IPMI to disambiguate
  • Note: because LACP trunk interfaces report MAC address tables for the entire trunk rather than for each individual port, i was not able to figure out with certainty which interface connected to which on the two bbn-hn trunks. (I may be able to revisit this Monday looking at cable labelling.)
  • Note: the blue cables (interfaces 1-9 on the 8052) were not actually labelled, so i needed to resort to MAC address tables to tell which was which (and of course couldn't differentiate between the iSCSI interfaces at all, since i don't have MAC address information for the iSCSI's two halves.

Results of testing step 4C: 2012-07-23

  • Last Friday, we inventoried all cabling by briefly unplugging cables and watching link lights.
  • Before signing off on this, i want to review the switch configs for VLAN memberships:
  • Hmm, but there's a problem there: those configs say what access VLANs each switch belongs to, but don't say which VLANs are allowed on each trunk interface. I assume that must be defined somewhere, probably in show vlan information on each switch. Was blocked by exoticket:71 blocks, which is now resolved.

Step 4D: Verify government property accounting for the rack

(This is GST 3354 item 11.)

Using:

  • Receive a completed DD1149 form from RENCI
  • Receive and inventory a property tag number for the BBN ExoGENI rack

Verify:

  • The DD1149 paperwork is complete to BBN government property standards
  • We receive a single property tag for the rack, as expected

Step 5: Configure operational alerting for the rack

Step 5A: GPO installs active control network monitoring

(This is GST 3354 item 8.)

Using:

  • Add a monitored control network ping from ilian.gpolab.bbn.com to 192.1.242.2
  • Add a monitored control network ping from ilian.gpolab.bbn.com to 192.1.242.3
  • Add a monitored control network ping from ilian.gpolab.bbn.com to 192.1.242.4

Verify:

  • Active monitoring of the control network is successful
  • Each monitored IPs is successfully available at least once

Results of testing step 5A: 2012-05-23

This monitoring was actually installed on 2012-03-04, and has been active since then. Results are at http://monitor.gpolab.bbn.com/connectivity/exogeni.html, and the IPs have been reachable for most of that period.

Step 5B: GPO installs active shared dataplane monitoring

(This is GST 3354 item 9.)

Using:

  • Add a monitored dataplane network ping from a lab dataplane test host on vlan 1750 to the rack dataplane
  • If necessary, add an openflow controller to handle traffic for the monitoring subnet

Verify:

  • Active monitoring of the dataplane network is successful
  • The monitored IP is successfully available at least once

Results of testing step 5B: 2012-07-05

Step 5C: GPO gets access to nagios information about the BBN rack

(This is part of GST 3354 item 10.)

Using:

Verify:

  • Login succeeds
  • I can see a number of types of devices
  • I can click on a problem report and verify its details

Results of testing step 5C: 2012-05-23

  • I successfully logged into https://bbn-hn.exogeni.net/rack_bbn/ as cgolubit (my account with site admin privileges)
  • I see the following devices:
    bbn-hn.exogeni.net
    
    ssg5.bbn.xo
    
    bbn-hn.ipmi
    bbn-w1.ipmi
    bbn-w2.ipmi
    bbn-w3.ipmi
    bbn-w4.ipmi
    
    8052.bbn.xo
    8264.bbn.xo
    ssg5.bbn.xo
    
    bbn-w1.local
    bbn-w2.local
    bbn-w3.local
    
  • The tactical overview in the top left lists 5 problems.
  • I also notice that my login reports me as cgolubit (guest), while my chaos account lists me as chaos (admin). Josh reports that he shows up as jbs (admin).
    • In particular, on that service page i listed, the list of service contacts includes chaos, jbs, tupty (and a number of RENCI and Duke people), but does not include cgolubit. That means (i believe) that once e-mail notifications were enabled, cgolubit would not receive them.
    • Out of curiosity, i tried logging into the RENCI rack https://rci-hn.exogeni.net/rack_rci/check_mk/.
      • As cgolubit, i can login, and see cgolubit (user)
      • As chaos, i can login, and see chaos (admin)
      • When Josh tried to login, he gets the error:
        Your username (jbs) is listed more than once in multisite.mk.
        This is not allowed. Please check your config.
        
    • I created 29 for these inconsistencies.

Results of testing step 5C: 2012-05-26

I'm tracking testing of the RCI login problem, which is still giving a weird error, on exoticket:29.

Returning to:

  • I am not able to figure out, either from that page, or from poking around on bbn-hn, what the command is which is being run to get that result. I will follow up via e-mail and ask.

Jonathan said:

  • All of the server-side infrastructure is in /opt/omd (symlink /omd).
  • It runs as the user 'rack_bbn', whose homedir is /omd/sites/rack_bbn
  • Most of the config is in /omd/sites/rack_bbn/etc/check_mk/conf.d
  • The multipath check specifically is in /omd/sites/rack_bbn/share/check_mk/checks/multipath

I got blocked here by not being able to successfully run cmk commands to find out what check_mk's current state is. Details in e-mail:

From: Chaos Golubitsky <chaos@bbn.com>
Date: Sat, 26 May 2012 11:09:48 -0400
To: exogeni-design@geni.net
Subject: Re: [exogeni-design] question about nagios/omd plugins

Results of testing step 5C: 2012-07-05

  • It is now possible for me to use sudo to become rack_bbn:
    bbn-hn,[~],18:57(0)$ sudo su - rack_bbn
    OMD[rack_bbn]:~$
    
  • As rack_bbn, i can list the check_mk information about bbn-hn.exogeni.net:
    OMD[rack_bbn]:~$ cmk -D bbn-hn.exogeni.net | head
    
    bbn-hn.exogeni.net (192.1.242.3)
    Tags:                   tcp, linux, nagios, hn
    Host groups:            linux, hn
    Contact groups:         admins, users
    Type of agent:          TCP (port: 6556)
    Is aggregated:          no
    ...
      multipath       360080e50002d03ac000002cc4f69a431 2                                                                                                                                                                                                                            Multipath 360080e50002d03ac000002cc4f69a431
    ...
    

Results of testing step 5C: 2012-07-23

I did not repeat the entire test, but instead just verified that exoticket:29 has been resolved and members of bbnadmins now get consistent (guest login) behavior when browsing to the rci rack nagios.

Step 5D: GPO receives e-mail about BBN rack nagios alerts

(This is part of GST 3354 item 10.)

Using:

  • Request e-mail notifications for BBN rack nagios to be sent to GPO ops
  • Collect a number of notifications
  • Inspect three representative messages

Verify:

  • E-mail messages about rack nagios are received
  • For each inspected message, i can determine:
    • The affected device
    • The affected service
    • The type of problem being reported
    • The duration of the outage

Results of testing step 5D: 2012-05-23

E-mail notification for the chaos user on BBN rack nagios was configured on 2012-04-26. As of 2012-05-23T12:00, i have received 5472 notification messages.

Investigation of representative messages:

Item 1: messages related to the bbn-hn.exogeni.net service "Multipath 360080e50002d03ac000002cc4f69a431":

  • On 2012-04-26, i received a problem report:
    From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
    Date: Thu, 26 Apr 2012 03:22:08 +0000
    To: chaos@bbn.com
    Subject: *** PROBLEM *** bbn-hn.exogeni.net / Multipath
     360080e50002d03ac000002cc4f69a431 is CRITICAL
    
    --SERVICE-ALERT-------------------
    -
    - Hostaddress: 192.1.242.3
    - Hostname:    bbn-hn.exogeni.net
    - Service:     Multipath 360080e50002d03ac000002cc4f69a431
    - - - - - - - - - - - - - - - - -
    - State:       CRITICAL
    - Date:        2012-04-26 03:22:08
    - Output:      CRIT - (mpathb) paths expected: 4, paths active: 2
    -
    ----------------------------------
    
  • On 2012-05-23, i received the recovery report:
    From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
    Date: Wed, 23 May 2012 14:38:46 +0000
    To: chaos@bbn.com
    Subject: *** RECOVERY *** bbn-hn.exogeni.net / Multipath
     360080e50002d03ac000002cc4f69a431 is OK
    
    --SERVICE-ALERT-------------------
    -
    - Hostaddress: 192.1.242.3
    - Hostname:    bbn-hn.exogeni.net
    - Service:     Multipath 360080e50002d03ac000002cc4f69a431
    - - - - - - - - - - - - - - - - -
    - State:       OK
    - Date:        2012-05-23 14:38:46
    - Output:      OK - (mpathb) paths expected: 2, paths active: 2
    -
    ----------------------------------
    
  • The service history shows no other entries for this https://bbn-hn.exogeni.net/rack_bbn/check_mk/view.py?host=bbn-hn.exogeni.net&site=&service=Multipath%20360080e50002d03ac000002cc4f69a431&view_name=svcevents
  • The service recovered because Jonathan fixed an inaccurate check which was looking for 4 paths when it should have been looking for 2 paths.
  • The implication of these notices is that nagios sends notifications only when a service's state changes, and does not repeat notifications when a service remains in an unhealthy state.

Item 2: messages related to the 8052.bbn.xo services "Interface Ethernet30" and "Interface Ethernet40":

  • Approximately 50 times a day during the period of e-mail sending, we have gotten four messages like these:
    • 8052/Ethernet40 is down:
      From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
      Date: Wed, 23 May 2012 16:29:52 +0000
      To: chaos@bbn.com
      Subject: *** PROBLEM *** 8052.bbn.xo / Interface Ethernet40 is CRITICAL
      
      --SERVICE-ALERT-------------------
      -
      - Hostaddress: 192.168.103.2
      - Hostname:    8052.bbn.xo
      - Service:     Interface Ethernet40
      - - - - - - - - - - - - - - - - -
      - State:       CRITICAL
      - Date:        2012-05-23 16:29:52
      - Output:      CRIT - [168] (down)(!!) assuming 1GBit/s, in: 0.00B/s(0.0%), out: 68.91B/s(0.0%)
      -
      ----------------------------------
      
    • 8052/Ethernet30 is down:
      From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
      Date: Wed, 23 May 2012 16:30:52 +0000
      To: chaos@bbn.com
      Subject: *** PROBLEM *** 8052.bbn.xo / Interface Ethernet30 is CRITICAL
      
      --SERVICE-ALERT-------------------
      -
      - Hostaddress: 192.168.103.2
      - Hostname:    8052.bbn.xo
      - Service:     Interface Ethernet30
      - - - - - - - - - - - - - - - - -
      - State:       CRITICAL
      - Date:        2012-05-23 16:30:52
      - Output:      CRIT - [158] (down)(!!) assuming 1GBit/s, in: 0.00B/s(0.0%), out: 32.07B/s(0.0%)
      -
      ----------------------------------
      
    • 8052/Ethernet40 is back up:
      From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
      Date: Wed, 23 May 2012 16:31:52 +0000
      To: chaos@bbn.com
      Subject: *** RECOVERY *** 8052.bbn.xo / Interface Ethernet40 is OK
      
      --SERVICE-ALERT-------------------
      -
      - Hostaddress: 192.168.103.2
      - Hostname:    8052.bbn.xo
      - Service:     Interface Ethernet40
      - - - - - - - - - - - - - - - - -
      - State:       OK
      - Date:        2012-05-23 16:31:52
      - Output:      OK - [168] (up) 1GBit/s, in: 29.71B/s(0.0%), out: 73.21B/s(0.0%)
      -
      ----------------------------------
      
    • 8052/Ethernet30 is back up:
      From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
      Date: Wed, 23 May 2012 16:31:52 +0000
      To: chaos@bbn.com
      Subject: *** RECOVERY *** 8052.bbn.xo / Interface Ethernet30 is OK
      
      --SERVICE-ALERT-------------------
      -
      - Hostaddress: 192.168.103.2
      - Hostname:    8052.bbn.xo
      - Service:     Interface Ethernet30
      - - - - - - - - - - - - - - - - -
      - State:       OK
      - Date:        2012-05-23 16:31:52
      - Output:      OK - [158] (up) 1GBit/s, in: 0.00B/s(0.0%), out: 30.94B/s(0.0%)
      -
      ----------------------------------
      
  • Based on the information i was able to put together in our connection inventory so far doing step 4C, it appears that Ethernet30 is bbn-w4[eth1], and Ethernet40 is bbn-w4[eth0]. We know that bbn-w4 is the xCAT worker, and apparently it resets every time it can't PXE boot. So that's what all of that is about.

Item 3: messages related to the bbn-w1.local service "Check_MK inventory":

  • Approximately twice a day, i've been getting:
    • A check_mk inventory warning:
      From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
      Date: Wed, 23 May 2012 17:11:32 +0000
      To: chaos@bbn.com
      Subject: *** PROBLEM *** bbn-w1.local / Check_MK inventory is WARNING
      
      --SERVICE-ALERT-------------------
      -
      - Hostaddress: 10.100.0.11
      - Hostname:    bbn-w1.local
      - Service:     Check_MK inventory
      - - - - - - - - - - - - - - - - -
      - State:       WARNING
      - Date:        2012-05-23 17:11:32
      - Output:      WARNING - 3 unchecked services (lnx_if:2, qemu:1)
      -              lnx_if: Interface vnet2
      lnx_if: Interface vnet3
      qemu: VM i-00000556
      
      ----------------------------------
      
    • A check_mk inventory recovery:
      From: rack_bbn@bbn-hn.exogeni.net (OMD site rack_bbn)
      Date: Wed, 23 May 2012 17:13:32 +0000
      To: chaos@bbn.com
      Subject: *** RECOVERY *** bbn-w1.local / Check_MK inventory is OK
      
      --SERVICE-ALERT-------------------
      -
      - Hostaddress: 10.100.0.11
      - Hostname:    bbn-w1.local
      - Service:     Check_MK inventory
      - - - - - - - - - - - - - - - - -
      - State:       OK
      - Date:        2012-05-23 17:13:32
      - Output:      OK - no unchecked services found
      -
      ----------------------------------
      
  • I don't know what that is and the service page (e.g. https://bbn-hn.exogeni.net/rack_bbn/check_mk/view.py?view_name=service&site=&service=Check_MK%20inventory&host=bbn-w1.local) doesn't have a "check manual" defined, but i expect that what's going on is that there is some code which identifies new VMs and runs Check_MK checks on them, and whenever a new VM comes online or goes away, Check_MK is out of sync for a little while.
  • Jonathan verified via e-mail that my guess was correct.

However, it turns out that allowing me to receive e-mail from the rack, depends on a temporary configuration which suppressed synchronization of nagios users with LDAP changes (see 29). So i created 30 for a permanent solution to that problem, and that now blocks this step.

Results of testing step 5D: 2012-07-05

Based on my e-mail archives, i believe that nagios stopped sending e-mail about the Check_MK inventory issue and the ethernet 30/ethernet 40 up/down issue, on 7 June. So the nagios alerts are now substantially less noisy.

Results of testing step 5D: 2012-07-23

No further testing. We verified that exoticket:30 is resolved, and i think that's it for this test.

Step 6: Setup contact info and change control procedures

Step 6A: Exogeni operations staff should subscribe to response-team

(This is part of GST 3354 item 12.)

Using:

  • Ask ExoGENI operators to subscribe exogeni-ops@renci.org (or individual operators) to response-team@geni.net

Verify:

  • This subscription has happened. On daulis:
    sudo -u mailman /usr/lib/mailman/bin/find_member -l response-team exogeni-ops
    

Results of testing step 6A: 2012-05-23

On daulis:

daulis,[~],14:22(0)$ sudo -u mailman /usr/lib/mailman/bin/find_member -l response-team exogeni-ops
[sudo] chaos's password on daulis:
exogeni-ops@renci.org found in:
     response-team

Step 6B: Exogeni operations staff should provide contact info to GMOC

(This is part of GST 3354 item 12.)

Using:

  • Ask ExoGENI operators to submit primary and secondary e-mail and phone contact information to GMOC

Verify:

  • Browse to https://gmoc-db.grnoc.iu.edu/protected/, login, and look at the "organizations" table. Make sure either:
    • The RENCI contact information is up-to-date and includes exogeni-ops and some reasonable phone numbers
    • A new ExoGENI contact has been added

Results of testing step 6B: 2012-05-23

I don't have access to GMOC's contact database. Instead, i asked GMOC to verify:

  • The primary contact e-mail is exogeni-ops@renci.org
  • The secondary contact is a person's e-mail address
  • There is a phone number associated with one contact

Kevin Bohan of GMOC checked and was able to verify these things for me.

Step 6C: Negotiate an interim change control notification procedure

(This is GST 3354 item 6.)

Using:

Verify:

  • ExoGENI agrees to send notifications about planned outages and changes.

Results of testing step 6C: 2012-05-23

ExoGENI has agreed to notify exogeni-design or gpo-infra when there are outages.

We will want to revisit this test when GMOC has workflows in place to handle notifications for rack outages, and before there are additional rack sites and users who may need to be notified.

Results of testing step 6C: 2013-02-14

Eldar confirmed via e-mail that the long-term plan is set and working:

From: "Urumbaev, Eldar" <eurumbae@indiana.edu>
To: Josh Smift <jbs@bbn.com>, "exogeni-design@geni.net"
        <exogeni-design@geni.net>
Subject: Re: [exogeni-design] Outage reporting to GMOC
Date: Thu, 14 Feb 2013 13:33:20 +0000

Hi Josh,

We are all synched up. GMOC is subscribed to the [GENI-ORCA-USERS]
geni-orca-users@googlegroups.com mailing list. The ExoGENI team has been
putting [OUTAGE] or [MAINTENANCE] in subject line to help identify events
that are relevant to us and require our action for tracking rack
outages/maintenances. This has been working ok so far.

Thanks,

Eldar

So, this is all set.