wiki:GeniOmisUseEmergShut

Use Case: Emergency with Slice Shutdown

This describes the steps that might be taken when (a piece of) a slice misbehaves and needs to be shutdown. In this specific case, the problem is noticed by the operators of some equipment the slice is using.

This is an expansion of the mini-use-case in Aaron's original Use Case slides. As such it tries to cover the exact same cases that would result in what is shown there. A variant where the NOC decides not to shutdown is worked out on another page. Many other variants could be envisioned, for example if original problem detection happens some other way. However, at this point we're not trying to enumerate all possible scenarios. However, when there is a point at which a diversion could happen, it should be called out.

Relation to original slides

Eventually, this section will be irrelevant. But for now many readers may have seen the original, so this relates the description here to that.

In the version shown at GEC#2 this is all compressed into one slide, number 18, but there are several steps done as effects...this expands on that sequence with more details. To see the relationship, single digit numbers in angle brackets (i.e. <1>) are Aaron's original steps, the sentence following that mark is basically his description with possible minor wordsmithing. This may be followed by notes expanding on the step. Some steps are additions and don't have a corresponding entry in the original slides, either they were skipped or implied as part of another step, but they are made explicit here. The original slides had: <4> NOC staff review the report and elect to shutdown the rest of the slice. That is, a single step 4 in the original slide, but it's actually a number of steps (detailed in the list below), which require the information from step 5 in the slide, so should have followed it as they do here.

Steps of the Use Case

  1. <1> Aggregate Operations notices (or has received reports of) misbehavior by a processor sliver in the CPU cluster.
    • Note: Presumably the Aggregate Operations has some standard procedures in place that verify that this is actually misbehavior rather than expected or allowed behavior. Their procedures might even be different for internally detected, as opposed to externally reported, problems.
    • However, operations internal to an aggregate are local considerations and are not considered in this use case (but the limits and requirements might be an OMIS spec.)
  2. <2> Aggregate Ops shuts down the sliver using their internal control plane. This action does not shut-down other slivers running in other aggregates or possibly on other components in this aggregate. The "shut down" does not destroy the sliver or any state, just somehow prevents it from doing whatever it was doing, setting it to a "safe" state. For example, setting a process to 0 cycles or halting a physical processor into its ROM. The sliver should still be there and part of whatever slice it was in, and specifically any state that the researcher might use to determine what went wrong needs to be preserved.
    • However, AggOps "owns" the equipment (in some sense) and, as someone said, they can always take their bat and ball and go home. This might, however, be a violation of their agreement to offer the equipment for GENI use (requiring non-technical enforcement).
    • It's also possible that they just report a suspicion to the NOC and don't do the shutdown themselves. They let the NOC make the hard decision about whether to interrupt an ongoing experiment.
  3. <3> The NOC is informed of the sliceID and the nature of the failure.
    • This implies that the CM needs to remember the SliceID that it got when the sliver was allocated (when the ticket was redeemed).
    • Such reporting might be through a web form, e.g. opening a ticket, which would incorporate the next step as well. ...or there might be an automated way that AggOps can submit this, say a tool that collects the data and sends an XML object encapsulating the info to a NOC tool that sets up the ticket.
  4. The NOC collects the following data:
    • Designation of the actual component and the sliver within it that were suspected of misbehavior
      • it may not be possible to talk to the component, what is really needed here is whatever ID and contact point is used to get the monitoring info needed later.
    • The exact nature of the problem, how it was measured and how it was isolated to that slice.
    • What actions Aggregate Ops took up to and including the shutdown and calling the NOC
  5. <5> Using the Slice ID, the Slice Registry provides the NOC the other slivers & associated CMs in the slice, as well as contact info for the researcher.
  6. The NOC attempts to contact the RP for the Slice. For the purposes of this case we assume that the initial contact attempts do not succeed (e.g. they leave a message on an answering machine at the contact phone number and send email that they are attempting to get in touch, but don't get any reply).
    • It's not clear how frequent Emergency Shutdown will be on GENI. It happens on PlanetLab and we should learn from their experience. But GENI's slice isolation, if it works as intended, should make it less critical, since any misbehavior should be isolated to that slice. If it's infrequent enough, it would be courteous to let the RP know, if possible. For that reason attempting to reach them by phone is included. However, if these happen too frequently it might be too much of a burden on the NOC staff. And there may not be enough funding (i.e. staff) to cover that. Email will probably be generated automatically and sent off, causing no delay in the NOC dealing with the emergency, so that should be done in all cases. In fact, the email is probably sent when the ticket is opened, which may be before the NOC staff actually starts looking at it.
  7. NOC staff correlates their own measurement data with that reported by AggOps. Assume that it matches (at least roughly).
    • Different procedures would apply if they don't see the effect that AggOps reported.
  8. NOC staff checks measurements of other slices running in the same component and/or aggregate. These checks show that the misbehavior had an effect on other slices.
  9. NOC staff checks measurements of other parts of the slice, and finds the same anomalous measurements ad interference there, too
  10. After that analysis, the NOC staff decides to shut down the slice.
  11. Before the actual shutdown, the NOC makes a final attempt to contact the RP for the slice, with the same results as before.
  12. <6> The NOC sends SliceShutdown messages to every CM in the slice (includes NOC credentials and SliceID).
  13. <7> NOC notifies the researcher of the suspension. Presumably this happens automatically because the tool they use to issue the SliceShutdown adds a note to the ticket that they did that and the researcher is on the list to get updates to the ticket.
  14. NOC notifies owners (via appropriate AggOps?) of the other components of the potential misbehavior. Presumably this is also automatic, probably just sending them a pointer to the ticket.
  15. NOC logs any detected interference with other slices and notifies their owners of potential measurement errors. Probably just more stuff that gets automatically added to the ticket. But there needs to be some way for a researcher, in the future, to ask "what happened at this point in time?" so measurement anomalies can be correlated back to tickets.

Open questions

These are some questions remaining about this example.

  • Does the NOC also have to feed the status up to other portals or services that are built on top of the slice to provide an explanation to them for what happened to the slice that could be passed on to users? There are likely to be lots more users than RPs.
  • Does the NOC do anything to prevent the researcher from just restarting everything without figuring out what went wrong? Or do we trust the researchers to behave responsibly and respond to the notification of their own accord, assuming that they won't try to run the experiment again until they've done something to figure out and address whatever went wrong.
Last modified 13 years ago Last modified on 04/24/08 21:11:36