[[PageOutline]] = Detailed test plan for IG-MON-3: GENI Active Experiment Inspection Test = ''This page is GPO's working page for performing IG-MON-3. It is public for informational purposes, but it is not an official status report. See [wiki:GENIRacksHome/InstageniRacks/AcceptanceTestStatus] for the current status of InstaGENI acceptance tests.'' ''Last substantive edit of this page: 2012-05-18'' == Page format == * The status chart summarizes the state of this test * The high-level description from test plan contains text copied exactly from the public test plan and acceptance criteria pages. * The steps contain things i will actually do/verify: * Steps may be composed of related substeps where i find this useful for clarity * Each step is either a preparatory step (identified by "(prep)") or a verification step (the default): * Preparatory steps are just things we have to do. They're not tests of the rack, but are prerequisites for subsequent verification steps * Verification steps are steps in which we will actually look at rack output and make sure it is as expected. They contain a '''Using:''' block, which lists the steps to run the verification, and an '''Expect:''' block which lists what outcome is expected for the test to pass. == Status of test == || '''Step''' || '''State''' || '''Date completed''' || '''Tickets''' || '''Comments''' || || 1 || [[Color(yellow,Completed)]] || || || needs retesting when 3 is retested || || 2 || || || || needs retesting when 3 is retested || || 3 || [[Color(yellow,Completed)]] || || || needs retesting once OpenFlow resources are available from InstaGENI AM || || 4 || [[Color(orange,Blocked)]] || || instaticket:26 || blocked on resolution of MAC reporting issue || || 5 || [[Color(orange,Blocked)]] || || || blocked on 3 || || 6 || [[Color(orange,Blocked)]] || || || blocked on 3, availability of FOAM || || 7 || [[Color(orange,Blocked)]] || || || blocked on 3 || || 8 || [[Color(orange,Blocked)]] || || || blocked on 3 || == High-level description from test plan == This test inspects the state of the rack data plane and control networks when experiments are running, and verifies that a site administrator can find information about running experiments. === Procedure === * An experimenter from the GPO starts up experiments to ensure there is data to look at: * An experimenter runs an experiment containing at least one rack OpenVZ VM, and terminates it. * An experimenter runs an experiment containing at least one rack OpenVZ VM, and leaves it running. * A site administrator uses available system and experiment data sources to determine current experimental state, including: * How many VMs are running and which experimenters own them * How many physical hosts are in use by experiments, and which experimenters own them * How many VMs were terminated within the past day, and which experimenters owned them * What !OpenFlow controllers the data plane switch, the rack !FlowVisor, and the rack FOAM are communicating with * A site administrator examines the switches and other rack data sources, and determines: * What MAC addresses are currently visible on the data plane switch and what experiments do they belong to? * For some experiment which was terminated within the past day, what data plane and control MAC and IP addresses did the experiment use? * For some experimental data path which is actively sending traffic on the data plane switch, do changes in interface counters show approximately the expected amount of traffic into and out of the switch? === Criteria to verify as part of this test === * VII.09. A site administrator can determine the MAC addresses of all physical host interfaces, all network device interfaces, all active experimental VMs, and all recently-terminated experimental VMs. (C.3.f) * VII.11. A site administrator can locate current configuration of flowvisor, FOAM, and any other OpenFlow services, and find logs of recent activity and changes. (D.6.a) * VII.18. Given a public IP address and port, an exclusive VLAN, a sliver name, or a piece of user-identifying information such as e-mail address or username, a site administrator or GMOC operator can identify the email address, username, and affiliation of the experimenter who controlled that resource at a particular time. (D.7) == Step 1 (prep): start a VM experiment and terminate it == * An experimenter requests an experiment from the InstaGENI AM containing two rack VMs and a dataplane VLAN * The experimenter logs into a VM, and sends dataplane traffic * The experimenter terminates the experiment === Results of testing: 2012-05-18 === * I'll use the following rspec to get two VMs: {{{ jericho,[~],05:29(0)$ cat IG-MON-nodes-C.rspec }}} * Then create a slice: {{{ omni createslice ecgtest2 }}} * Then create a sliver using that rspec: {{{ jericho,[~],05:31(0)$ omni -a http://www.utah.geniracks.net/protogeni/xmlrpc/am createsliver ecgtest2 ~/IG-MON-nodes-C.rspec INFO:omni:Loading config file /home/chaos/omni/omni_pgeni INFO:omni:Using control framework pg ERROR:omni.protogeni:Call for Get Slice Cred for slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 failed.: Exception: PG Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 does not exist. ERROR:omni.protogeni: ..... Run with --debug for more information ERROR:omni:Cannot create sliver urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2: Could not get slice credential: Exception: PG Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 does not exist. }}} * It looks like the slice just wasn't ready yet. Trying again after a minute, the same thing worked: {{{ jericho,[~],05:31(0)$ omni -a http://www.utah.geniracks.net/protogeni/xmlrpc/am createsliver ecgtest2 ~/IG-MON-nodes-C.rspec INFO:omni:Loading config file /home/chaos/omni/omni_pgeni INFO:omni:Using control framework pg INFO:omni:Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 expires on 2012-05-19 10:30:51 UTC INFO:omni:Creating sliver(s) from rspec file /home/chaos/IG-MON-nodes-C.rspec for slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 INFO:omni:Asked http://www.utah.geniracks.net/protogeni/xmlrpc/am to reserve resources. Result: INFO:omni: INFO:omni: INFO:omni: INFO:omni: ------------------------------------------------------------ INFO:omni: Completed createsliver: Options as run: aggregate: http://www.utah.geniracks.net/protogeni/xmlrpc/am configfile: /home/chaos/omni/omni_pgeni framework: pg native: True Args: createsliver ecgtest2 /home/chaos/IG-MON-nodes-C.rspec Result Summary: Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 expires on 2012-05-19 10:30:51 UTC Reserved resources on http://www.utah.geniracks.net/protogeni/xmlrpc/am. INFO:omni: ============================================================ }}} * According to sliverstatus, my nodes are: {{{ pc2.utah.geniracks.net port 30266 pc5.utah.geniracks.net port 30266 }}} * However, pc2 needs to run frisbee before this is ready. Wait awhile. * Login to pc2.utah.geniracks.net on port 30266 with agent forwarding * Find that it is virt1 and has eth1=10.10.1.1 * Find a big file: {{{ [chaos@virt1 ~]$ ls -l /usr/lib/locale/locale-archive-rpm -rw-r--r-- 1 root root 99154656 May 20 2011 /usr/lib/locale/locale-archive-rpm }}} * Copy the big file over the dataplane: {{{ [chaos@virt1 ~]$ scp /usr/lib/locale/locale-archive 10.10.1.2:/tmp/ The authenticity of host '10.10.1.2 (10.10.1.2)' can't be established. RSA key fingerprint is 6d:1d:76:53:a5:25:99:39:e2:89:ea:b0:99:e3:d3:b9. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.10.1.2' (RSA) to the list of known hosts. locale-archive 100% 95MB 11.8MB/s 00:08 }}} * Look at the arps table on virt1 and virt2: {{{ [chaos@virt1 ~]$ /sbin/arp -a virt2-virt1-virt2-0 (10.10.1.2) at 82:02:0a:0a:01:02 [ether] on mv1.1 pc2.utah.geniracks.net (155.98.34.12) at 00:01:ac:11:02:01 [ether] on eth999 boss.utah.geniracks.net (155.98.34.4) at 00:01:ac:11:02:01 [ether] on eth999 [chaos@virt1 ~]$ ssh 10.10.1.2 Last login: Fri May 18 13:35:41 2012 from capybara.bbn.com [chaos@virt2 ~]$ /sbin/arp -a virt1-virt1-virt2-0 (10.10.1.1) at 82:01:0a:0a:01:01 [ether] on mv2.2 boss.utah.geniracks.net (155.98.34.4) at 00:01:ac:11:05:02 [ether] on eth999 pc5.utah.geniracks.net (155.98.34.15) at 00:01:ac:11:05:02 [ether] on eth999 }}} * Delete the sliver: {{{ jericho,[~],05:53(0)$ omni -a http://www.utah.geniracks.net/protogeni/xmlrpc/am deletesliver ecgtest2 }}} == Step 2 (prep): start a bare metal node experiment and terminate it == * An experimenter requests an experiment from the InstaGENI AM containing two rack hosts and a dataplane VLAN * The experimenter logs into a host, and sends dataplane traffic * The experimenter terminates the experiment === Results of testing: 2012-05-18 === * Here is an rspec for two physical nodes with no OS specified: {{{ jericho,[~],05:39(0)$ cat IG-MON-nodes-D.rspec }}} * Create a slice for this experiment: {{{ omni createslice ecgtest3 }}} * Create a sliver using this rspec: {{{ jericho,[~],05:40(0)$ omni -a http://www.utah.geniracks.net/protogeni/xmlrpc/am createsliver ecgtest3 ~/IG-MON-nodes-D.rspec INFO:omni:Loading config file /home/chaos/omni/omni_pgeni INFO:omni:Using control framework pg INFO:omni:Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest3 expires on 2012-05-19 10:40:34 UTC INFO:omni:Creating sliver(s) from rspec file /home/chaos/IG-MON-nodes-D.rspec for slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest3 INFO:omni:Asked http://www.utah.geniracks.net/protogeni/xmlrpc/am to reserve resources. Result: INFO:omni: INFO:omni: INFO:omni: INFO:omni: ------------------------------------------------------------ INFO:omni: Completed createsliver: Options as run: aggregate: http://www.utah.geniracks.net/protogeni/xmlrpc/am configfile: /home/chaos/omni/omni_pgeni framework: pg native: True Args: createsliver ecgtest3 /home/chaos/IG-MON-nodes-D.rspec Result Summary: Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest3 expires on 2012-05-19 10:40:34 UTC Reserved resources on http://www.utah.geniracks.net/protogeni/xmlrpc/am. INFO:omni: ============================================================ }}} * According to sliverstatus, my nodes are pc1 and pc4. * Login to pc1.utah.geniracks.net with agent forwarding * Find that it is phys2 and has eth1=10.10.1.2 * Find a big file: {{{ [chaos@phys2 ~]$ ls -l /usr/lib/locale/locale-archive -rw-r--r-- 1 root root 104997424 Aug 10 2011 /usr/lib/locale/locale-archive }}} * Copy the big file over the dataplane in a loop: {{{ [chaos@phys2 ~]$ while [ 1 ]; do scp /usr/lib/locale/locale-archive 10.10.1.1:/tmp/; done locale-archive 100% 100MB 50.1MB/s 00:02 locale-archive 100% 100MB 50.1MB/s 00:02 locale-archive 100% 100MB 50.1MB/s 00:02 ... }}} * After a bit of that, delete the sliver: {{{ jericho,[~],05:53(0)$ omni -a http://www.utah.geniracks.net/protogeni/xmlrpc/am deletesliver ecgtest3 }}} == Step 3 (prep): start an experiment and leave it running == * An experimenter requests an experiment from the InstaGENI AM containing two rack VMs connected by an OpenFlow-controlled dataplane VLAN * The experimenter configures a simple OpenFlow controller to pass dataplane traffic between the VMs * The experimenter logs into one VM, and begins sending a continuous stream of dataplane traffic === Results of testing: 2012-05-18 === ''Note: per discussion on instageni-design on 2012-05-17, request of an OpenFlow-controlled dataplane is not yet possible. So this will need to be retested once OpenFlow control is available.'' * Not creating a new experiment here, but instead reusing my experiment, ecgtest, created yesterday for `IG-MON-1`. * Login to pc3, whose eth1 is 10.10.1.1 * Make a bigger dataplane file by catting the other a few times, then start copying it around again: {{{ [chaos@phys1 ~]$ ls -l /tmp/locale-archive -rw-r--r-- 1 chaos pgeni-gpolab-bbn 3149922720 May 18 04:14 /tmp/locale-archive while [ 1 ]; do scp /tmp/locale-archive 10.10.1.2:/tmp/; done }}} * This lets me see that the first instance of the file copy takes about a minute, at about 55MBps: {{{ [chaos@phys1 ~]$ while [ 1 ]; do scp /tmp/locale-archive 10.10.1.2:/tmp/; done locale-archive 100% 3004MB 55.6MB/s 00:54 }}} * Leave this running. == Step 4: view running VMs == '''Using:''' * On boss, use AM state, logs, or administrator interfaces to determine: * What experiments are running right now * How many VMs are allocated for those experiments * Which OpenVZ node is each VM running on * On OpenVZ nodes, use system state, logs, or administrative interfaces to determine what VMs are running right now, and look at any available configuration or logs of each. '''Verify:''' * A site administrator can determine what experiments are running on the InstaGENI AM * A site administrator can determine the mapping of VMs to active experiments * A site administrator can view some state of running VMs on the VM server === Results of testing: 2012-05-18 === * Per-host view of current state: * From [https://boss.utah.geniracks.net/nodecontrol_list.php3?showtype=dl360] in red dot mode, i can once again see that pc3 is allocated as phys1 to `pgeni-gpolab-bbn-com/ecgtest`. * I can see that pc5 is configured as an OpenVZ shared host, but i can't see how many experiments it is running. * Per-experiment view of current state: * Browse to [https://boss.utah.geniracks.net/genislices.php] and find one slice running on the Component Manager: {{{ ID HRN Created Expires 362 bbn-pgeni.ecgtest (ecgtest) 2012-05-17 08:12:37 2012-05-18 18:00:00 }}} * Click `(ecgtest)` to view the details of that experiment at [https://boss.utah.geniracks.net/showexp.php3?experiment=363#details]. * This shows what nodes it's using, including that its VM has been put on pc5: {{{ Physical Node Mapping: ID Type OS Physical --------------- ------------ --------------- ------------ phys1 dl360 FEDORA15-STD pc3 virt1 pcvm OPENVZ-STD pcvm5-1 (pc5) }}} * Here are some other interesting things: {{{ IP Port allocation: Low High --------------- ------------ 30000 30255 SSHD Port allocation ('ssh -p portnum'): ID Port SSH command --------------- ---------- ---------------------- Physical Lan/Link Mapping: ID Member IP MAC NodeID --------------- --------------- --------------- -------------------- --------- phys1-virt1-0 phys1:0 10.10.1.1 e8:39:35:b1:4e:8a pc3 1/1 <-> 1/34 procurve2 phys1-virt1-0 virt1:0 10.10.1.2 pcvm5-1 }}} * That last one is mysterious, because the experimenter's sliverstatus command contains: {{{ { 'attributes': { 'client_id': 'phys1:if0', 'component_id': 'urn:publicid:IDN+utah.geniracks.net+interface+pc3:eth1', 'mac_address': 'e83935b14e8a', ... { 'attributes': { 'client_id': 'virt1:if0', 'component_id': 'urn:publicid:IDN+utah.geniracks.net+interface+pc5:eth1', 'mac_address': '00000a0a0102', }}} * So i think it should be possible for the admin interface to know that virtual mac address too. * Huh, but also, that mac address reported in sliverstatus is in fact wrong. Let me summarize: {{{ MAC addrs reported for phys1:0 == 10.10.1.1 E8:39:35:B1:4E:8A: from /sbin/ifconfig eth1 run on phys1 (authoritative) e83935b14e8a: from sliverstatus as experimenter (correct) e8:39:35:b1:4e:8a: from: https://boss.utah.geniracks.net/showexp.php3?experiment=363#details (correct) MAC addrs reported for virt1:0 == 10.10.1.2 82:01:0A:0A:01:02: from /sbin/ifconfig mv1.1 run on virt1 (authoritative) 00000a0a0102: from sliverstatus as experimenter (incorrect: first four digits are wrong) - : from https://boss.utah.geniracks.net/showexp.php3?experiment=363#details (not reported) }}} I opened [instaticket:26] for this issue. * Now, use the OpenVZ host itself to view activity: * As an admin, login to pc5.utah.geniracks.net * Poking around, i was led to a couple of prospective data sources: * Logs in `/var/emulab` * The `vzctl` RPM, containing a number of OpenVZ control commands * The latter seems to give a list of running VMs easily: {{{ vhost1,[/var/emulab],05:00(1)$ sudo vzlist -a CTID NPROC STATUS IP_ADDR HOSTNAME 1 15 running - virt1.ecgtest.pgeni-gpolab-bbn-com.utah.geniracks.net }}} * I also see a command to figure out which container is running a given PID. Suppose i run top and am concerned about an sshd process chewing up all system CPU: {{{ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 51817 20001 20 0 116m 3780 872 R 94.4 0.0 0:05.74 sshd }}} * Since the user is numeric, i can assume this process is probably running in a container, so find out which one: {{{ vhost1,[/var/emulab],05:05(0)$ sudo vzpid 51766 Pid CTID Name 51766 1 sshd chaos 51804 51163 0 05:04 pts/0 00:00:00 grep --color=auto ssh }}} * and then look up the container info as above. * The files in `/var/emulab` give details about how each experiment was created. In particular: {{{ Information about experiment startup attributes: /var/emulab/boot/tmcc.pcvm5-1/ /var/emulab/boot/tmcc.pcvm5-2/ Logs of experiment progress: /var/emulab/logs/tbvnode-pcvm5-1.log /var/emulab/logs/tbvnode-pcvm5-2.log /var/emulab/logs/tmccproxy.pcvm5-1.log /var/emulab/logs/tmccproxy.pcvm5-2.log }}} * These may be useful for running and terminated experiments ''if'' the context IDs are unique. ==== Side test: are experiment context IDs unique over time on an OpenVZ server? ==== * rspec to create a single OpenVZ container: {{{ jericho,[~],07:12(0)$ cat IG-MON-nodes-E.rspec }}} * use existing slice `ecgtest2` to create a sliver: {{{ jericho,[~],07:13(0)$ omni -a http://www.utah.geniracks.net/protogeni/xmlrpc/am createsliver ecgtest2 IG-MON-nodes-E.rspec INFO:omni:Loading config file /home/chaos/omni/omni_pgeni INFO:omni:Using control framework pg INFO:omni:Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 expires within 1 day on 2012-05-19 10:30:51 UTC INFO:omni:Creating sliver(s) from rspec file IG-MON-nodes-E.rspec for slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 INFO:omni:Asked http://www.utah.geniracks.net/protogeni/xmlrpc/am to reserve resources. Result: INFO:omni: INFO:omni: INFO:omni: INFO:omni: ------------------------------------------------------------ INFO:omni: Completed createsliver: Options as run: aggregate: http://www.utah.geniracks.net/protogeni/xmlrpc/am configfile: /home/chaos/omni/omni_pgeni framework: pg native: True Args: createsliver ecgtest2 IG-MON-nodes-E.rspec Result Summary: Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+ecgtest2 expires within 1 day(s) on 2012-05-19 10:30:51 UTC Reserved resources on http://www.utah.geniracks.net/protogeni/xmlrpc/am. INFO:omni: ============================================================ }}} Summary: this means that VM IDs are reused. At this point, i was going to gather more information about logs, when the Utah rack became totally unavailable: i was no longer able to use my shell sessions to any machines in the rack, and got ping timeouts to boss. After about 8 minutes, things became available again. I went looking for logs of my dataplane file copy activity to see whether the dataplane had been interrupted, at which point i found out that sshd on the dataplane does not appear to be logged anywhere, either in `/var/log` within the container or on pc5 itself. That's not a rack requirement, but it seems non-ideal for experimenters. I opened [instaticket:27] to report it. == Step 5: get information about terminated experiments == '''Using:''' * On boss, use AM state, logs, or administrator interfaces to find evidence of the two terminated experiments. * Determine how many other experiments were run in the past day. * Determine which GENI user created each of the terminated experiments. * Determine the mapping of experiments to OpenVZ or exclusive hosts for each of the terminated experiments. * Determine the control and dataplane MAC addresses assigned to each VM in each terminated experiment. * Determine any IP addresses assigned by InstaGENI to each VM in each terminated experiment. '''Verify:''' * A site administrator can get ownership and resource allocation information for recently-terminated experiments which used OpenVZ VMs. * A site administrator can get ownership and resource allocation information for recently-terminated experiments which used physical hosts. * A site administrator can get information about MAC addresses and IP addresses used by recently-terminated experiments. == Step 6: get !OpenFlow state information == '''Using:''' * On the dataplane switch, get a list of controllers, and see if any additional controllers are serving experiments. * On the flowvisor VM, get a list of active FV slices from the !FlowVisor * On the FOAM VM, get a list of active slivers from FOAM * Use FV, FOAM, or the switch to list the flowspace of a running !OpenFlow experiment. '''Verify:''' * A site administrator can get information about the !OpenFlow resources used by running experiments. * When an !OpenFlow experiment is started by InstaGENI, a new controller is added directly to the switch. * No new !FlowVisor slices are added for new !OpenFlow experiments started by InstaGENI. * No new FOAM slivers are added for new !OpenFlow experiments started by InstaGENI. == Step 7: verify MAC addresses on the rack dataplane switch == '''Using:''' * Establish a privileged login to the dataplane switch * Obtain a list of the full MAC address table of the switch * On boss and the experimental hosts, use available data sources to determine which host or VM owns each MAC address. '''Verify:''' * It is possible to identify and classify every MAC address visible on the switch == Step 8: verify active dataplane traffic == '''Using:''' * Establish a privileged login to the dataplane switch * Based on the information from Step 7, determine which interfaces are carrying traffic between the experimental VMs * Collect interface counters for those interfaces over a period of 10 minutes * Estimate the rate at which the experiment is sending traffic '''Verify:''' * The switch reports interface counters, and an administrator can obtain plausible estimates of dataplane traffic quantities by looking at them.