11 | | * 2011-06-10 @ 17:15: We observed problems with all connections involving TCP and Stanford. We believe that load on the Stanford !FlowVisor was the underlying cause. |
12 | | |
13 | | * 2011-06-10 @ 17:30: A critical VM server at BBN needed to be rebooted, which included the server where the controllers for the slices were running. In the course of that, the Stanford and Internet2 !FlowVisors died completely. We haven't yet tried to reproduce this; we don't know of any reason why shutting down an experimenter's controller (even in a sudden emergency fashion) should affect upstream !FlowVisors. |
14 | | |
15 | | * 2011-06-10 @ 21:00: The !FlowVisors at Indiana and NLR also crashed, when the VM server that hosts them crashed. This again seems oddly coincidental, but there's no obvious causal chain. We discovered in here that an unrelated test slice at BBN had partially cross-connected VLAN 3715 and 3716, which created a loop when the Indiana switches failed open rather than closed. All of these problems were eventually corrected. |
| 11 | * 2011-06-10 @ 17:15: We observed problems with all connections involving TCP and Stanford. We believe that non-GENI load on the Stanford !FlowVisor was the underlying cause. |
| 12 | |
| 13 | * 2011-06-10 @ 17:30: A critical VM server at BBN needed to be rebooted, which included the server where the controllers for the slices were running. In the course of that, the Stanford and Internet2 !FlowVisors went down. We haven't yet tried to reproduce this; we don't know of any reason why shutting down an experimenter's controller (even in a sudden emergency fashion) should affect upstream !FlowVisors. |
| 14 | |
| 15 | * 2011-06-10 @ 21:00: The !FlowVisors at Indiana and NLR also crashed, when the VM server that hosts them crashed. This again seems oddly coincidental, but there's no obvious causal chain. While investigating, we also discovered that a GENI test slice at BBN unrelated to Plastic Slices baseline testing had partially cross-connected VLAN 3715 and 3716, which created a loop when the misconfigured Indiana switches failed open rather than closed. All of these problems were eventually corrected. |
19 | | * 2011-06-12 @ 01:45: The Stanford !FlowVisor was down again. |
20 | | |
21 | | * 2011-06-12 @ 13:30: One of the Rutgers MyPLC nodes (orbitplc2) rebooted, and lost its static ARP table, killing all of the experiments that used that node. |
22 | | |
23 | | * 2011-06-13 @ 12:30: Stanford upgraded their !FlowVisor to a specific Git commit that addressed bugs that were affecting them, and Indiana's switch/!FlowVisor configuration problems were corrected. orbitpclc2 still didn't have its static ARP table. |
24 | | |
25 | | * 2011-06-13 @ 15:15: orbitplc2's static ARP table was fixed; everything was running smoothly again at this point (for the first time in three days). |
| 19 | * 2011-06-12 @ 01:45: The Stanford !FlowVisor was down. |
| 20 | |
| 21 | * 2011-06-12 @ 13:30: One of the Rutgers MyPLC nodes (orbitplc2) rebooted, and lost its static ARP table, stopping all of the experiments that used that node. |
| 22 | |
| 23 | * 2011-06-13 @ 12:30: Stanford upgraded their !FlowVisor to a specific Git commit that addressed load-related bugs that were affecting them earlier in the baseline, and Indiana's switch/!FlowVisor configuration problems were corrected. The orbitpclc2 still had no static ARP table. |
| 24 | |
| 25 | * 2011-06-13 @ 15:15: orbitplc2's static ARP table was fixed; everything was running smoothly again at this point. |
29 | | * 2011-06-14 @ 17:20: A routing configuration in Internet2 cut off the I2 OpenFlow switches from the I2 !FlowVisor, which was running in Indiana University testlab IP space; we worked with I2 engineers to change the I2 switches to point to a new OpenFlow software stack in I2 production IP space (which we had planned to do at some point anyway, but this proved an opportune time). All traffic involving Internet2 was down at this point. |
30 | | |
31 | | * 2011-06-15 @ 14:10: The Internet2 move was complete, I2 traffic resumed flowing, and things continued to run smoothly for the brief remainder of the baseline, once we revived the experiments. |
32 | | |
33 | | * 2011-06-15 @ 19:30: all experiments shut down. We discover shortly thereafter that we've lost some log data (see below for more details). |
34 | | |
35 | | That isn't a comprehensive record of when things actually went down and came back; we plan to add that. We believe that these outages explain most (if not all) of the anomalies in the results below. |
| 29 | * 2011-06-14 @ 17:20: An unexpected routing configuration change in Internet2 cut off the I2 OpenFlow switches from the I2 !FlowVisor, which was running in Indiana University testlab IP space; we worked with I2 engineers to change the I2 switches to point to a new OpenFlow software stack in I2 production IP space (which we had planned to do at some point anyway, but this proved an opportune time). All traffic involving Internet2 was down at this point. |
| 30 | |
| 31 | * 2011-06-15 @ 14:10: The Internet2 address-related configuration change was complete, I2 traffic resumed flowing, and things continued to run smoothly for the brief remainder of the baseline, once we revived the experiments. |
| 32 | |
| 33 | * 2011-06-15 @ 19:30: all experiments shut down. We discovered shortly thereafter that we'd lost some log data (see below for more details). |
| 34 | |
| 35 | That isn't a comprehensive record of when things actually went down and came back, but a general summary based on ticket notes. We plan to add a more detailed record based on log and monitoring analysis that is still in progress. We believe that these recorded outages explain most (if not all) of the anomalies in the results below. |