Use Case: Unplanned Outage

  1. The NOC detects (or is notified by AggOps of) a component outage.
  2. The NOC investigates and determines that it is a real outage and not just "a ghost" outage indication.
  3. Using the component registry the NOC identifies the RP for the component and contacts them. They confirm that there is an outage and give the NOC an indication of the extent (other components that are, or are likely to be, affected) and estimated restoral time.
    • It's possible the owner's response is "That component is just bad (or obsolete), we're not going to fix it."? That's a different case, we don't explore that here.
  4. Using the slice registry, the NOC finds all slices that have a sliver on the affected components.
    • Not just those allocated now, but for the expected duration of the outage (and maybe a little beyond :-)? Can that be done?
  5. The NOC notifies the RP of each slice (if they have asked to be notified) of the outage, with the estimated duration. The notification indicates the ticket number which is tracking the outage if the RP wishes to watch it or sign up for updates.
    • Not discussed in this example, but a possible response, is for a researcher to migrate their experiment (see GeniOmisUsePlanned for a similar discussion) off the sliver that is out. In some experiments, this migration may actually be part of the experiment.
  6. The NOC monitors for restoral (and bugs the component owner or operator for new time estimates when the old one is exceeded).
    • Notice that based on my operational experience, I said "when" not "if" there...but there's always that exceptional case where they do meet the original restoral estimate. :-)
  7. The NOC determines (or is notified and verifies) that the component has returned to operation.
  8. The RPs of the affected slices are notified of the restoral, so they can check on their experiment.

Open questions

These are some questions remaining about this example.

  • When a component fails, can the owner just decide that it's not worth fixing and pull it from service? That's essentially saying that the time to repair is infinite. This certainly seems reasonable. Perhaps we need a Use Case to explore that.
Last modified 13 years ago Last modified on 04/24/08 21:16:20