Changes between Version 6 and Version 7 of PlasticSlices/BaselineEvaluation/Baseline6Details


Ignore:
Timestamp:
07/11/11 17:32:26 (13 years ago)
Author:
Josh Smift
Comment:

Added more details about 'screen' logging problems; also, minor style edits from Heidi.

Legend:

Unmodified
Added
Removed
Modified
  • PlasticSlices/BaselineEvaluation/Baseline6Details

    v6 v7  
    1313 * 2011-06-10 @ 17:30: A critical VM server at BBN needed to be rebooted, which included the server where the controllers for the slices were running. In the course of that, the Stanford and Internet2 !FlowVisors went down. We haven't yet tried to reproduce this; we don't know of any reason why shutting down an experimenter's controller (even in a sudden emergency fashion) should affect upstream !FlowVisors.
    1414
    15  * 2011-06-10 @ 21:00: The !FlowVisors at Indiana and NLR also crashed, when the VM server that hosts them crashed. This again seems oddly coincidental, but there's no obvious causal chain. While investigating, we also discovered that a GENI test slice at BBN unrelated to Plastic Slices baseline testing had partially cross-connected VLAN 3715 and 3716, which created a loop when the misconfigured Indiana switches failed open rather than closed. All of these problems were eventually corrected.
     15 * 2011-06-10 @ 21:00: The !FlowVisors at Indiana and NLR also crashed, when the VM server that hosts them crashed. This again seems oddly coincidental, but there's no obvious causal chain. While investigating, we also discovered that a GENI test slice at BBN unrelated to Plastic Slices baseline testing had partially cross-connected VLAN 3715 and 3716, which created a loop when the misconfigured Indiana switches failed open rather than closed. All of these problems were corrected during the next few hours.
    1616
    1717 * 2011-06-11 @ 00:00: The NLR and Internet2 !FlowVisors came back online, and we were able to revive all of the experiments, except for a few involving Indiana.
     
    1919 * 2011-06-12 @ 01:45: The Stanford !FlowVisor was down.
    2020
    21  * 2011-06-12 @ 13:30: One of the Rutgers MyPLC nodes (orbitplc2) rebooted, and lost its static ARP table, stopping all of the experiments that used that node.
    22 
    23  * 2011-06-13 @ 12:30: Stanford upgraded their !FlowVisor to a specific Git commit that addressed load-related bugs that were affecting them earlier in the baseline, and Indiana's switch/!FlowVisor configuration problems were corrected. The orbitpclc2 still had no static ARP table.
     21 * 2011-06-12 @ 13:30: One of the Rutgers MyPLC nodes (orbitplc2) rebooted, and the node lost its static ARP table, stopping all of the experiments that used that node.
     22
     23 * 2011-06-13 @ 12:30: Stanford upgraded their !FlowVisor to a specific Git commit that addressed load-related bugs that were affecting them in earlier baseline testing. Indiana's switch/!FlowVisor configuration problems were corrected. The orbitpclc2 still had no static ARP table.
    2424
    2525 * 2011-06-13 @ 15:15: orbitplc2's static ARP table was fixed; everything was running smoothly again at this point.
     
    3131 * 2011-06-15 @ 14:10: The Internet2 address-related configuration change was complete, I2 traffic resumed flowing, and things continued to run smoothly for the brief remainder of the baseline, once we revived the experiments.
    3232
    33  * 2011-06-15 @ 19:30: all experiments shut down. We discovered shortly thereafter that we'd lost some log data (see below for more details).
     33 * 2011-06-15 @ 19:30: All experiments shut down. We discovered shortly thereafter that we'd lost some log data from some experiments (detailed below), we believe due to a bug in 'screen', which we were using for logging. In particular, it appears that the buffered log data in memory wasn't written to disk when the screen process terminated, possibly due to the huge size of the logs.
    3434
    3535That isn't a comprehensive record of when things actually went down and came back, but a general summary based on ticket notes.  We plan to add a more detailed record based on log and monitoring analysis that is still in progress. We believe that these recorded outages explain most (if not all) of the anomalies in the results below.