Version 7 (modified by, 5 years ago) (diff)

Added 'Normal' priotity for Monitoring Infrastructuve Setup type.

GENI Operations Trial

This page captures procedures and documentation required for the GENI Operations Trials.

GENI Operations Responsibilities

GENI Operations groups need to track and hand activities off to each other, based on the type of GENI event that happens. Following is a table of organizations that provide support listed by area of operations responsibility:

Team mail list Area of Responsibility/Tools
GMOC Ops gmoc@grnoc.iu.eduTracking and Escalation, Emergency Stop, Legal, Law Enforcement and Regulatory Requests, New Site Tracking, GENI AL2S services (OESS, OESS-AM, GENI SCS, FSFW, GRNOC coordination)
GPO Dev GENI Tools (gcf, omni, stitcher), GENI Clearinghouse and Portal
GPO Ops Stitching, Escalation Response, OpenFlow
MAX Dev and SCS Escalation
Nick Bastin Dev* nick.bastin@gmail.comFOAM and FlowVisor
RENCI Ops/Dev ExoGENI Racks, ExoGENI Stitching
UKY Ops/Dev GENI Monitoring System, Stitching Computation System
U of Utah InstaGENI Racks
Utah Dev Jacks Tool, Emulab, Apt
  • FOAM and FV events are not included in the current Operations Trial.

Other operations groups may support GENI for particular aggregates or particular services.

Team mail list Area of Responsibility/Tools
CloudLab CloudLab
ESNET Operations
iMINDS Operations iMinds Aggregates, iMinds SCS
Starlight Operations

GENI Event Type Prioritization and Dispatch

As events are reported the operations team must determine their priority in order to respond appropriately. Critical events require immediate attention, and should be resolved within 1 business day. High priority events should be investigated within 1 business day and resolved withing 2 business days. Normal priority events should be investigated and closed within 1 business week, except for scheduled events (including scheduled maintenance events), which should be verified and closed by the end of their scheduled window.

The GMOC must also set appropriate priorities for related GENI tickets. High priority events may be up leveled to Critical if the person reporting the issue identifies it as Critical based on the reported circumstances. For example, if an issue will impact a demo, training session, or conference, the reporter may ask for it to be treated as Critical. Following are guidelines for prioritizing issues that are reported by a person or by information from a tool (such as a Nagios alert). This prioritization does not cover scheduled events, or routine requeists, which fall under the Normal priority classification, below all events listed here. Dispatch in the table below indicates the group with primary responsibility to begin/track the operations response in an area.

Priority#Event type DispatchProcedure
Critical 1 Emergency Stop and LLR GMOC GMOC Emergency Stop procedure and GMOC LLR procedure
Critical 2 Security Event [*] GMOC OPS-002-A: Security: System Security Events?
Critical 3 GENI Clearinghouse/Portal Event GMOC <<Insert OPS procedure link>>
Critical 4 Stitching Computation Service GMOC GMOC I2 SCS Procedure exists
Critical 5 AL2S Aggregate Event GMOC GMOC OESS Procedures exist
Critical 6 GENI Stitching Event GMOC OPS-003-B: Network Stitching Experiment Debugging Procedure?
High 7 WiMAX Multipoint VLANs Events GMOC OPS-006-B: GENI WiMAX Dataplane Debugging?
High 8 Regional (AM+Switches+links) Rack or Site Group <<Insert OPS procedures links>>
High 9 Site Events reported by site contact GMOC OPS-004-A: Site Power Outage?, OPS-005-A: Scheduled Site Maintenance?
High 10 Site Events(AM+Switches+links reported by experimenters or tools) [] Rack or Site Group OPS-004-A: Site Power Outage?, OPS-005-A: Scheduled Site Maintenance?, OPS-006-A: GENI WiMAX Base Station Debugging?
High 11 Experimenter Tools Events(Portal, jacks, omni..) Tools Contacts
High 12 Monitoring Infrastructure Events UKY Monitoring team OPS-001-C: Monitoring: Monitoring System Outage?
Normal 13 Monitoring Infrastructure Setup UKY Monitoring team / GPO OPS-001-A: Monitoring: Creating Monitoring Event Alerts?, OPS-001-B: Monitoring: Adding Monitoring Sites?

[*] Security Events start as Critical and may be re-prioritized upon investigation.
[] Some Site Events may affect multiple sites (ExoSM) or non-GENI functions (CloudLab, Emulab, Apt). These events require no special GMOC action and should be assigned to the team that owns the resources.

GENI Monitoring Tools and Documentation

The GENI Monitoring System is available to report status for resources, services and experimenters in GENI. See the GENI Monitoring wiki page for information on features, reporting and alerts fro GENI infrastructure. Some pages that maybe initially helpful:

  • Getting an Account - An overview of account types and instructions on how to get an account.
  • GENI Monitoring Overview - An overview of the functionality and highlights of potential usage, available functions, and notes restrictions or potential issues for a experimenters.
  • GENI Monitoring Resources Check - An example on how to monitoring a specific experiment's resources, how to access slices, slivers, VLAN and interfaces information and determine mapping between these objects.

Other tools are available to determine resource status in GENI:

The GMOC, rack teams, and (maybe) Nick Bastin use other operations tools that are not covered in this wiki. Tools that are required for operations should be mentioned in the relevant procedures, along with a link to documentation and (when available) source code.

GENI Operations Procedures

Procedures to handle common operational events are available to define how the GENI Operations team should responde, track, and resolve events. A GENI Operations Template is available for defining new procedures.

Procedure Name Procedure Description
OPS-001-A: Monitoring: Creating Monitoring Event Alerts Instructions to how to define alerts for operation events.
OPS-001-B: Monitoring: Adding Monitoring Sites? Defines the required information to request the addition, modification, or removal of a GENI site in monitoring.
OPS-001-C: Monitoring: Monitoring System Outage? Defines how to report and respond to monitoring system events such as: collector or GUI is down, report non-critical bugs, or incorrect measurements
OPS-002-A: Security: System Security Events? Describes how to report a security event to the GENI Monitoring team, and how to respond to that event. This type of event can also include Emergency Stop events.
OPS-003-A: GENI Network: Dataplane Debugging? Defines the processes to be used for detecting and debugging GENI experiment issues with the data plane and control plane in experiments. [1]
OPS-003-B: GENI Network: Stitching Experiment Debugging? Defines detecting and debugging GENI Network Stitching experiment issues with the stitching of Layer 2 VLAN for GENI experiments.
OPS-004-A: Site Power Outage? Defines how handle a site power failure and how to verify that the site's aggregate is fully restored upon recovery [1]
OPS-005-A: Scheduled Site Maintenance? Defines schedules one-time or periodic maintenance activities [2]
OPS-006-A: GENI WiMAX Base Station Debugging? Defines how to identify and remedy GENI WiMAX base station configuration at a WiMAX site. [3]
OPS-006-B: GENI WiMAX Dataplane Debugging? Defines how to identify and remedy GENI WiMAX multipoint VLAN(s)between sites

[1] Before investigating this type of issue, first verify that it is not due to a Scheduled Maintenance?.
[2] Sites, services or resources outages may be caused by a Scheduled Maintenance?.
[3] A WiMAX issue arises when the frequency or transmit power of a campus WiMAX base station changes. These variables are fixed by site administrators and negotiated via a GENI-Sprint agreement. An event of this type should be very rare, but if monitoring reports a change, handling the event is high priority.