wiki:GENIOperationsTrial

Version 30 (modified by adaadwil@indiana.edu, 3 years ago) (diff)

--

GENI Operations Trial

This page captures procedures and documentation required for the GENI Operations Trials.

GENI Operations Responsibilities

GENI Operations groups track and hand activities off to each other, based on the type of GENI event that happens. The GMOC provides a public support line/email, tracking and a public ticket view for all GENI experimenters and operators. Following is a table of organizations that provide support listed by their area(s) of operations responsibility:

Team mail list Area of Responsibility/Tools
GMOC Ops gmoc@grnoc.iu.eduTracking and Escalation, Emergency Stop, Legal, Law Enforcement and Regulatory Requests, New Site Tracking, GENI AL2S services (OESS, OESS-AM, GENI SCS, FSFW, GRNOC coordination)
GPO Dev gpo-sw-dev@geni.net GENI Tools (gcf, omni, stitcher), GENI Clearinghouse and Portal
GPO Ops geni-ops@googlegroups.com Stitching, Escalation Response, OpenFlow
MAX Dev maxyang@umd.edu and tlehman@umd.edu SCS Escalation
Nick Bastin Dev* nick.bastin@gmail.comFOAM and FlowVisor
RENCI Ops/Dev exogeni-ops@renci.org ExoGENI Racks, ExoGENI Stitching
UKY Ops/Dev geni-ops@googlegroups.com GENI Monitoring System, Stitching Computation System (IG rack endpoints) (Note, UKY is coming up to speed on IG racks)
U of Utah Opsgeni-ops@googlegroups.com InstaGENI Racks (Note, Utah is gradually transitioning some IG activities to UKY)
OpenGENI OpsTBD OpenGENI Racks
Utah Dev geni-dev-utah@flux.utah.edu Jacks Tool, Emulab, Apt
  • FOAM and FV events are not included in the current Operations Trial.

Other operations groups may support GENI for particular aggregates or particular services.

Team mail list Area of Responsibility/Tools
CloudLab Operationscloudlab-ops@cloudlab.us CloudLab
ESNET Operations
iMINDS Operations vwall-ops@atlantis.ugent.be iMinds Aggregates, iMinds SCS
Starlight Operations

GENI Event Type Prioritization and Dispatch

As events are reported the operations team must determine their priority in order to respond appropriately. Critical events require immediate attention, and should be resolved within 1 business day. High priority events should be investigated within 1 business day and resolved withing 2 business days. Normal priority events should be investigated and closed within 1 business week, except for scheduled events (including scheduled maintenance events), which should be verified and closed by the end of their scheduled window.

The GMOC must also set appropriate priorities for related GENI tickets. High priority events may be up leveled to Critical if the person reporting the issue identifies it as Critical based on the reported circumstances. For example, if an issue will impact a demo, training session, or conference, the reporter may ask for it to be treated as Critical. Following are guidelines for prioritizing issues that are reported by a person or by information from a tool (such as a Nagios alert). This prioritization does not cover scheduled events, or routine requeists, which fall under the Normal priority classification, below all events listed here. Dispatch in the table below indicates the group with primary responsibility to begin/track the operations response in an area.

Priority#Event type DispatchProcedure
Critical 1 Emergency Stop and LLR GMOC GMOC Emergency Stop procedure and GMOC LLR procedure
Critical 2 Security Event [*] GMOC OPS-002-A: Security: System Security Events
Critical 3 GENI Clearinghouse/Portal Event GMOC <<TBD>>
Critical 4 Stitching Computation Service GMOC GMOC I2 SCS Procedure exists
Critical 5 AL2S Aggregate Event GMOC GMOC OESS Procedures exist
Critical 6 GENI Stitching Event GMOC OPS-003-B: Network Stitching Experiment Debugging Procedure
High 7 WiMAX Multipoint VLANs Events GMOC OPS-006-B: GENI WiMAX Dataplane Debugging
High 8 Regional (AM+Switches+links) Rack or Site Group <<TBD>>
High 9 Site Events reported by site contact GMOC OPS-004-A: Site Power Outage, OPS-005-A: Scheduled Site Maintenance
High 10 Site Events(AM+Switches+links reported by experimenters or tools) [] Rack or Site Group OPS-004-A: Site Power Outage, OPS-005-A: Scheduled Site Maintenance, OPS-006-A: GENI WiMAX Base Station Debugging
High 11 Experimenter Tools Events(Portal, jacks, omni..) Tools Contacts
High 12 Monitoring Infrastructure Events UKY Monitoring team OPS-001-C: Monitoring: Monitoring System Outage
Normal 13 Monitoring Infrastructure Setup UKY Monitoring team / GPO OPS-001-A: Monitoring: Creating Monitoring Event Alerts, OPS-001-B: Monitoring: Adding Monitoring Sites

[*] Security Events start as Critical and may be re-prioritized upon investigation.
[] Some Site Events may affect multiple sites (ExoSM) or non-GENI functions (CloudLab, Emulab, Apt). These events require no special GMOC action and should be assigned to the team that owns the resources.

GENI Monitoring Tools and Documentation

The GENI Monitoring System is available to report status for resources, services and experimenters in GENI. See the GENI Monitoring wiki page for information on features, reporting and alerts fro GENI infrastructure. Some pages that maybe initially helpful:

  • Getting an Account - An overview of account types and instructions on how to get an account.
  • GENI Monitoring Overview - An overview of the functionality and highlights of potential usage, available functions, and notes restrictions or potential issues for a experimenters.
  • GENI Monitoring Resources Check - An example on how to use monitoring a for specific experiment's resources, how to access slices, slivers, VLAN and interfaces information and determine mapping between these objects.
  • GENI Monitoring Site Issue Investigation - An example on how to investigate a site specific issue using monitoring to determine resource allocation and source of issue.

Other tools are available to determine resource status in GENI:

The GMOC, rack teams, and (maybe) Nick Bastin use other operations tools that are not covered in this wiki. Tools that are required for operations should be mentioned in the relevant procedures, along with a link to documentation and (when available) source code.

GENI Operations Procedures

Procedures to handle common operational events are available to define how the GENI Operations team should response, track, and resolve events. A GENI Operations Template is available for defining new procedures.

Procedure Name Procedure Description
OPS-001-A: Monitoring: Creating Monitoring Event Alerts Instructions to how to define alerts for operation events.
OPS-001-B: Monitoring: Adding Monitoring Sites Defines the required information to request the addition, modification, or removal of a GENI site in monitoring.
OPS-001-C: Monitoring: Monitoring System Outage Defines how to report and respond to monitoring system events such as: collector or GUI is down, report non-critical bugs, or incorrect measurements
OPS-002-A: Security: System Security Events Describes how to report a security event to the GENI Monitoring team, and how to respond to that event. This type of event can also include Emergency Stop events.
OPS-003-A: GENI Network: Dataplane Debugging Defines the processes to be used for detecting and debugging GENI experiment issues with the data plane and control plane in experiments. [1]
OPS-003-B: GENI Network: Stitching Experiment Debugging Defines detecting and debugging GENI Network Stitching experiment issues with the stitching of Layer 2 VLAN for GENI experiments.
OPS-004-A: Site Power Outage Defines how handle a site power failure and how to verify that the site's aggregate is fully restored upon recovery [1]
OPS-005-A: Scheduled Site Maintenance Defines schedules one-time or periodic maintenance activities [2]
OPS-006-A: GENI WiMAX Base Station Debugging Defines how to identify and remedy GENI WiMAX base station configuration at a WiMAX site. [3]
OPS-006-B: GENI WiMAX Dataplane Debugging Defines how to identify and remedy GENI WiMAX multipoint VLAN(s)between sites

[1] Before investigating this type of issue, first verify that it is not due to a Scheduled Maintenance.
[2] Sites, services or resources outages may be caused by a Scheduled Maintenance.
[3] A WiMAX issue arises when the frequency or transmit power of a campus WiMAX base station changes. These variables are fixed by site administrators and negotiated via a GENI-Sprint agreement. An event of this type should be very rare, but if monitoring reports a change, handling the event is high priority.

GENI Operations Daily Checks

Procedures for Daily Checks are available to define how the GENI Operations team should verify the state GENI Resources and escalate as needed. A GENI Daily Check Template is available for defining new procedures.

Daily Checks Daily Checks Description
CHK-001-A: GENI Racks Security Checks? Defines daily security checks for InstaGENI rack logs
CHK-001-B: GENI Stitching Computation Service Security Checks Defines daily security checks for Stitching Computation Service (SCS)
CHK-001-C: GENI AL2S AM Security Checks? Defines daily security checks for GENI AL2S Aggregate Manager
CHK-001-D: GENI Clearinghouse Security Checks Defines daily security checks for GENI Clearing House
CHK-001-E: GENI Portal Security Checks Defines daily security checks for GENI Portal
CHK-001-F: GENI WiMAX Checks Defines daily security checks for WiMAX sites
CHK-002: GENI Experiment Resources Checks Defines a daily check for all compute, and network resources in GENI.
CHK-003: GENI Network Stitching Checks Defines daily check of GENI Network stitching resources
CHK-004: GENI Long Running Slice Checks Defines daily check for a long running experiment named triangle, a 3-node topology continuously exchanging traffic.
CHK-005: GENI Network Connectivity OpenFlow Checks Defines daily check of the GENI network through connectivity tests using OpenFlow resources

Completed Work

- GMOC Security Procedure