Use Case: Planned Outage

This is a Use case for a planned outage. It has some superficial similarities to the Emergency shutdown use cases (GeniOmisUseEmergShut and GeniOmisUseEmergNot). It's for an outage of a particular component and the effect on a particular experiment. Different experiments may choose to react differently.

  1. Owner (or operator) of a component decides that it needs to be shutdown at some future time, for some reason. For the purpose of this example, the component is a compute cluster that needs to be unplugged to move it to a new location.
    • Planned outages could be a single device, a set of components, or an entire aggregate. These would involve only slight differences in the scenario.
    • There are many possible reasons for a planned outage. Maintenance or moving a lab are just two examples. The detail is irrelevant, only the fact that there is a planned outage and they actually notify the NOC (if they did it without notifying the NOC, it looks like an [GeniOmisUseOutageUnplanned unplanned outage] instead).
  2. The owner (or AggOps for the aggregate the component is in [or is]) notifies the NOC, with details of the time and extent of the outage.
    • There probably needs to be a web form for this...or some other automated way to report it. (Q: Could it be done with the ticket system?)
    • They also update availability info (in RSpecs or somewhere) so additional allocations won't be made.
  3. The NOC finds all allocated slivers on the component(s) and determines the slices (and specifically the RP and a flag on whether they want notifications) they correspond to.
    • Not actually those allocated now, but at the future time of the outage.
  4. All Researchers who wanted notification are sent notices.
    • Researchers who did not elect to get notifications will see this like any [GeniOmisUseOutageUnplanned unplanned outage].
  5. One researcher using one of the constituent slivers decides to avoid the outage.
    • Other researchers decide that their experiments are supposed to be robust against outages of some compute components and will use this outage as a test of that, but having been advised in advance of when the outage will be they can observe it as it happens.
  6. The researcher goes to the researcher portal used to set up the experiment (slice) and allocates a new compute resource not subject to the outage, and communications resources to connect it to the currently running experiment.
  7. The researcher loads the same code into the new resource and brings it up in the existing experiment. At this point the two nodes share the load.
  8. Once the new node is up, and the researcher determines that it is functioning properly as part of the experiment, the old node is shut down cleanly by the researcher.
  9. All data relevant to the experiment are transferred off the node that will be shut down as part of the outage.
  10. The researcher goes back to the portal and releases the old node and the communications between it and the rest of the experiment. (They will be available to other researchers after the outage, or before if they agree to that.)
  11. The owner or operator of the equipment turn it off and move it with no effect on the experiment since it is no longer using the component.

Open questions

These are some questions remaining about this example.

  • Can the ticket system do it all for the tracking of planned outages? This seems highly desireable, and therefore needs to be in the requirements for it.
  • We talk about not doing an allocation that covers the planned outage, but if a researcher wants to allocate for a week or a month, or longer, which happens to include a one hour planned outage, they may not care. I guess it's up to the portal that they use to let them decide. That means that "unavailable" resource specs need to have info that lets them make that decision.
Last modified 13 years ago Last modified on 04/24/08 21:13:15