Changes between Version 5 and Version 6 of PlasticSlices/BaselineEvaluation/Baseline6Details


Ignore:
Timestamp:
07/11/11 17:06:07 (11 years ago)
Author:
hdempsey@bbn.com
Comment:

minor style edits

Legend:

Unmodified
Added
Removed
Modified
  • PlasticSlices/BaselineEvaluation/Baseline6Details

    v5 v6  
    55The raw logs of each experiment are at http://www.gpolab.bbn.com/plastic-slices/baseline-logs/baseline-6/.
    66
    7 This baseline was plagued by a variety of outages. Here's a timeline of when we observed problems:
     7A variety of outages ocurred during this baseline. Here's a timeline of when we observed problems:
    88
    99 * 2011-06-09 @ 18:10: All experiments started.
    1010
    11  * 2011-06-10 @ 17:15: We observed problems with all connections involving TCP and Stanford. We believe that load on the Stanford !FlowVisor was the underlying cause.
    12 
    13  * 2011-06-10 @ 17:30: A critical VM server at BBN needed to be rebooted, which included the server where the controllers for the slices were running. In the course of that, the Stanford and Internet2 !FlowVisors died completely. We haven't yet tried to reproduce this; we don't know of any reason why shutting down an experimenter's controller (even in a sudden emergency fashion) should affect upstream !FlowVisors.
    14 
    15  * 2011-06-10 @ 21:00: The !FlowVisors at Indiana and NLR also crashed, when the VM server that hosts them crashed. This again seems oddly coincidental, but there's no obvious causal chain. We discovered in here that an unrelated test slice at BBN had partially cross-connected VLAN 3715 and 3716, which created a loop when the Indiana switches failed open rather than closed. All of these problems were eventually corrected.
     11 * 2011-06-10 @ 17:15: We observed problems with all connections involving TCP and Stanford. We believe that non-GENI load on the Stanford !FlowVisor was the underlying cause.
     12
     13 * 2011-06-10 @ 17:30: A critical VM server at BBN needed to be rebooted, which included the server where the controllers for the slices were running. In the course of that, the Stanford and Internet2 !FlowVisors went down. We haven't yet tried to reproduce this; we don't know of any reason why shutting down an experimenter's controller (even in a sudden emergency fashion) should affect upstream !FlowVisors.
     14
     15 * 2011-06-10 @ 21:00: The !FlowVisors at Indiana and NLR also crashed, when the VM server that hosts them crashed. This again seems oddly coincidental, but there's no obvious causal chain. While investigating, we also discovered that a GENI test slice at BBN unrelated to Plastic Slices baseline testing had partially cross-connected VLAN 3715 and 3716, which created a loop when the misconfigured Indiana switches failed open rather than closed. All of these problems were eventually corrected.
    1616
    1717 * 2011-06-11 @ 00:00: The NLR and Internet2 !FlowVisors came back online, and we were able to revive all of the experiments, except for a few involving Indiana.
    1818
    19  * 2011-06-12 @ 01:45: The Stanford !FlowVisor was down again.
    20 
    21  * 2011-06-12 @ 13:30: One of the Rutgers MyPLC nodes (orbitplc2) rebooted, and lost its static ARP table, killing all of the experiments that used that node.
    22 
    23  * 2011-06-13 @ 12:30: Stanford upgraded their !FlowVisor to a specific Git commit that addressed bugs that were affecting them, and Indiana's switch/!FlowVisor configuration problems were corrected. orbitpclc2 still didn't have its static ARP table.
    24 
    25  * 2011-06-13 @ 15:15: orbitplc2's static ARP table was fixed; everything was running smoothly again at this point (for the first time in three days).
     19 * 2011-06-12 @ 01:45: The Stanford !FlowVisor was down.
     20
     21 * 2011-06-12 @ 13:30: One of the Rutgers MyPLC nodes (orbitplc2) rebooted, and lost its static ARP table, stopping all of the experiments that used that node.
     22
     23 * 2011-06-13 @ 12:30: Stanford upgraded their !FlowVisor to a specific Git commit that addressed load-related bugs that were affecting them earlier in the baseline, and Indiana's switch/!FlowVisor configuration problems were corrected. The orbitpclc2 still had no static ARP table.
     24
     25 * 2011-06-13 @ 15:15: orbitplc2's static ARP table was fixed; everything was running smoothly again at this point.
    2626
    2727 * 2011-06-14 @ 11:30: We changed the controller configuration to not include a !FlowVisor between the switches and the controllers; shortly thereafter, Stanford's !FlowVisor crashed, and all traffic (including e.g. between hosts at BBN) stopped flowing reliably. As before, we don't see any reason why one of these events should have caused the others, but the timing is very suspicious. Once Stanford restarted their !FlowVisor, traffic within BBN (and elsewhere) returned to normal, and we were able to revive all of the experiments.
    2828
    29  * 2011-06-14 @ 17:20: A routing configuration in Internet2 cut off the I2 OpenFlow switches from the I2 !FlowVisor, which was running in Indiana University testlab IP space; we worked with I2 engineers to change the I2 switches to point to a new OpenFlow software stack in I2 production IP space (which we had planned to do at some point anyway, but this proved an opportune time). All traffic involving Internet2 was down at this point.
    30 
    31  * 2011-06-15 @ 14:10: The Internet2 move was complete, I2 traffic resumed flowing, and things continued to run smoothly for the brief remainder of the baseline, once we revived the experiments.
    32 
    33  * 2011-06-15 @ 19:30: all experiments shut down. We discover shortly thereafter that we've lost some log data (see below for more details).
    34 
    35 That isn't a comprehensive record of when things actually went down and came back; we plan to add that. We believe that these outages explain most (if not all) of the anomalies in the results below.
     29 * 2011-06-14 @ 17:20: An unexpected routing configuration change in Internet2 cut off the I2 OpenFlow switches from the I2 !FlowVisor, which was running in Indiana University testlab IP space; we worked with I2 engineers to change the I2 switches to point to a new OpenFlow software stack in I2 production IP space (which we had planned to do at some point anyway, but this proved an opportune time). All traffic involving Internet2 was down at this point.
     30
     31 * 2011-06-15 @ 14:10: The Internet2 address-related configuration change was complete, I2 traffic resumed flowing, and things continued to run smoothly for the brief remainder of the baseline, once we revived the experiments.
     32
     33 * 2011-06-15 @ 19:30: all experiments shut down. We discovered shortly thereafter that we'd lost some log data (see below for more details).
     34
     35That isn't a comprehensive record of when things actually went down and came back, but a general summary based on ticket notes.  We plan to add a more detailed record based on log and monitoring analysis that is still in progress. We believe that these recorded outages explain most (if not all) of the anomalies in the results below.
    3636
    3737= plastic-101 =