wiki:GeniOmisUseEmergNot

Version 1 (modified by Mike Patton, 16 years ago) (diff)

--

Use Case: Emergency with no Slice Shutdown

This is a variant of the mini-use-case in Aaron's original slides. There is an expansion of the original at GeniOmisUseEmergShut and this builds on that. It starts off the same, but in this case the NOC's investigation causes it to not do a shutdown. As such, the first steps here are the same as there, and then they slowly diverge.

  1. Aggregate Operations notices (or has received reports of) misbehavior by a processor sliver in the CPU cluster.
    • Note: Presumably the Aggregate Operations has some standard procedures in place that verify that this is actually misbehavior rather than expected or allowed behavior. Their procedures might even be different for internally detected, as opposed to externally reported, problems. However, operations internal to an aggregate are not considered in this use case (but the limits and requirements might be an OMIS spec.)
  2. Aggregate Ops shuts down the sliver using their internal control plane. This action does not shut-down other slivers running in other aggregates or possibly on other components in this aggregate. The "shut down" does not destroy the sliver or any state, just somehow prevents it from doing whatever it was doing, setting it to a "safe" state. For example, setting a process to 0 cycles or halting a physical processor into its ROM. The sliver should still be there and part of whatever slice it was in, and specifically any state that the researcher might use to determine what went wrong needs to be preserved.
    • It's also possible that they just report a suspicion to the NOC and don't do the shutdown themselves. They let the NOC make the hard decision about whether to interrupt an ongoing experiment.
  3. The NOC is informed of the sliceID and the nature of the failure.
    • This implies that the CM needs to remember the SliceID that it got when the sliver was allocated (when the ticket was redeemed).
    • Such reporting might be through a web form, e.g. opening a ticket, which would incorporate the next step as well. ...or there might be an automated way that AggOps can submit this, say a tool that collects the data and sends an XML object encapsulating the info to a NOC tool that sets up the ticket.
  4. The NOC collects the following data:
    • Designation of the actual component and the sliver within it that were suspected of misbehavior
      • it may not be possible to talk to the component, what is really needed here is whatever ID and contact point is used to get the monitoring info needed later.
    • The exact nature of the problem, how it was measured and how it was isolated to that slice.
    • What actions Aggregate Ops took up to and including the shutdown and calling the NOC
  5. Using the Slice ID, the Slice Registry provides the NOC the other slivers & associated CMs in the slice, as well as contact info for the researcher.
  6. The NOC contacts the RP for the slice and notifies them that a problem has been reported and they are investigating. This happens because the RP gets listed on the ticket and they get an email from the ticket system.
  7. NOC staff correlates their own measurement data with that reported by AggOps. Assume that it matches (at least roughly).
    • Different procedures would apply if they don't see the effect that AggOps reported.
  8. NOC staff checks measurements of other slices running in the same component and/or aggregate. These checks show no effect on other slices by the alleged misbehavior.
  9. NOC staff checks measurements of other parts of the slice, and finds no interference there, either. Although the same anomalous measurements that AggOps first reported are seen in other locations, the NOC sees no interference with other slices.
  10. The NOC elects NOT to shut down the slice.
  11. NOC notifies the researcher of the result of their analysis: One sliver was shutdown by AggOps. Several anomalous measurements were noted, possibly indicating a malfunction in the experiment. But, since slice isolation successfully kept other slices from being impacted, no further action was taken. It's up to the researcher to determine what caused the anomalous data and (probably) fix the experiment.
    • The researcher might add a note to the ticket which explains what the experiment was doing (whether expected or not to the researcher, it was clearly unexpected by the instigating AggOps). This might start a discussion between the AggOps and researcher that allows the experiment to be tuned to not trigger the alarm, or the alarms to be adjusted to not trigger on what the experiment is expecting to do.
  12. NOC notifies original AggOps that they have concluded that, while the sliver was doing strange and unexpected things, the GENI slice isolation was working and there was no need to shutdown. (Of course, it's entirely up to the Aggregate owner to redress the original sliver shutdown or not.)
  13. NOC logs the anomalous measurements, so that if other researchers notice that there was interference, they have this info.

The rest of these may be optional? I'm not sure about this, it may depend on how often these false positives happen.

  • NOC notifies owners (via appropriate AggOps?) of the other components belonging to the slice of the reported misbehavior, and the NOC's analysis.
  • NOC notifies the owners of other slices using commmon components of potential measurement errors.