Version 27 (modified by 8 years ago) (diff) | ,
---|
GENI Operations Trial
This page captures procedures and documentation required for the GENI Operations Trials.
GENI Operations Responsibilities
GENI Operations groups track and hand activities off to each other, based on the type of GENI event that happens. The GMOC provides a public support line/email, tracking and a public ticket view for all GENI experimenters and operators. Following is a table of organizations that provide support listed by their area(s) of operations responsibility:
Team | mail list | Area of Responsibility/Tools |
GMOC Ops | gmoc@grnoc.iu.edu | Tracking and Escalation, Emergency Stop, Legal, Law Enforcement and Regulatory Requests, New Site Tracking, GENI AL2S services (OESS, OESS-AM, GENI SCS, FSFW, GRNOC coordination) |
GPO Dev | gpo-sw-dev@geni.net | GENI Tools (gcf, omni, stitcher), GENI Clearinghouse and Portal |
GPO Ops | geni-ops@googlegroups.com | Stitching, Escalation Response, OpenFlow |
MAX Dev | maxyang@umd.edu and tlehman@umd.edu | SCS Escalation |
Nick Bastin Dev* | nick.bastin@gmail.com | FOAM and FlowVisor |
RENCI Ops/Dev | exogeni-ops@renci.org | ExoGENI Racks, ExoGENI Stitching |
UKY Ops/Dev | geni-ops@googlegroups.com | GENI Monitoring System, Stitching Computation System (IG rack endpoints) (Note, UKY is coming up to speed on IG racks) |
U of Utah Ops | geni-ops@googlegroups.com | InstaGENI Racks (Note, Utah is gradually transitioning some IG activities to UKY) |
OpenGENI Ops | TBD | OpenGENI Racks |
Utah Dev | geni-dev-utah@flux.utah.edu | Jacks Tool, Emulab, Apt |
- FOAM and FV events are not included in the current Operations Trial.
Other operations groups may support GENI for particular aggregates or particular services.
Team | mail list | Area of Responsibility/Tools |
CloudLab Operations | cloudlab-ops@cloudlab.us | CloudLab |
ESNET Operations | ||
iMINDS Operations | vwall-ops@atlantis.ugent.be | iMinds Aggregates, iMinds SCS |
Starlight Operations |
GENI Event Type Prioritization and Dispatch
As events are reported the operations team must determine their priority in order to respond appropriately. Critical
events require immediate attention, and should be resolved within 1 business day. High
priority events should be investigated within 1 business day and resolved withing 2 business days. Normal
priority events should be investigated and closed within 1 business week, except for scheduled events (including scheduled maintenance events), which should be verified and closed by the end of their scheduled window.
The GMOC must also set appropriate priorities for related GENI tickets. High
priority events may be up leveled to Critical
if the person reporting the issue identifies it as Critical
based on the reported circumstances. For example, if an issue will impact a demo, training session, or conference, the reporter may ask for it to be treated as Critical
. Following are guidelines for prioritizing issues that are reported by a person or by information from a tool (such as a Nagios alert). This prioritization does not cover scheduled events, or routine requeists, which fall under the Normal
priority classification, below all events listed here. Dispatch
in the table below indicates the group with primary responsibility to begin/track the operations response in an area.
Priority | # | Event type | Dispatch | Procedure |
Critical | 1 | Emergency Stop and LLR | GMOC | GMOC Emergency Stop procedure and GMOC LLR procedure |
Critical | 2 | Security Event [*] | GMOC | OPS-002-A: Security: System Security Events |
Critical | 3 | GENI Clearinghouse/Portal Event | GMOC | <<TBD>> |
Critical | 4 | Stitching Computation Service | GMOC | GMOC I2 SCS Procedure exists |
Critical | 5 | AL2S Aggregate Event | GMOC | GMOC OESS Procedures exist |
Critical | 6 | GENI Stitching Event | GMOC | OPS-003-B: Network Stitching Experiment Debugging Procedure |
High | 7 | WiMAX Multipoint VLANs Events | GMOC | OPS-006-B: GENI WiMAX Dataplane Debugging |
High | 8 | Regional (AM+Switches+links) | Rack or Site Group | <<TBD>> |
High | 9 | Site Events reported by site contact | GMOC | OPS-004-A: Site Power Outage, OPS-005-A: Scheduled Site Maintenance |
High | 10 | Site Events(AM+Switches+links reported by experimenters or tools) [] | Rack or Site Group | OPS-004-A: Site Power Outage, OPS-005-A: Scheduled Site Maintenance, OPS-006-A: GENI WiMAX Base Station Debugging |
High | 11 | Experimenter Tools Events(Portal, jacks, omni..) | Tools Contacts | |
High | 12 | Monitoring Infrastructure Events | UKY Monitoring team | OPS-001-C: Monitoring: Monitoring System Outage |
Normal | 13 | Monitoring Infrastructure Setup | UKY Monitoring team / GPO | OPS-001-A: Monitoring: Creating Monitoring Event Alerts, OPS-001-B: Monitoring: Adding Monitoring Sites |
[*] Security Events start as Critical and may be re-prioritized upon investigation.
[] SomeSite Events
may affect multiple sites (ExoSM) or non-GENI functions (CloudLab, Emulab, Apt). These events require no special GMOC action and should be assigned to the team that owns the resources.
GENI Monitoring Tools and Documentation
The GENI Monitoring System is available to report status for resources, services and experimenters in GENI. See the GENI Monitoring wiki page for information on features, reporting and alerts fro GENI infrastructure. Some pages that maybe initially helpful:
- Getting an Account - An overview of account types and instructions on how to get an account.
- GENI Monitoring Overview - An overview of the functionality and highlights of potential usage, available functions, and notes restrictions or potential issues for a experimenters.
- GENI Monitoring Resources Check - An example on how to use monitoring a for specific experiment's resources, how to access slices, slivers, VLAN and interfaces information and determine mapping between these objects.
- GENI Monitoring Site Issue Investigation - An example on how to investigate a site specific issue using monitoring to determine resource allocation and source of issue.
Other tools are available to determine resource status in GENI:
- A GENI Nagios System provides hosts status and reports and on alerts based on GENI Monitoring events.
- The GENI TRAC Ticket System captures known issues.
- The GENI NSF Git Hub sites also track issues for:
geni-portal
- the user interface for a GENI clearinghousegeni-ch
- the GENI clearinghouse servicesgeni-tools
- GENI tools including omni, stitcher, sample aggregate manager, etc. Formerly "gcf."
- The GMOC public ticket system
The GMOC, rack teams, and (maybe) Nick Bastin use other operations tools that are not covered in this wiki. Tools that are required for operations should be mentioned in the relevant procedures, along with a link to documentation and (when available) source code.
GENI Operations Procedures
Procedures to handle common operational events are available to define how the GENI Operations team should response, track, and resolve events. A GENI Operations Template is available for defining new procedures.
Procedure Name | Procedure Description |
OPS-001-A: Monitoring: Creating Monitoring Event Alerts | Instructions to how to define alerts for operation events.
|
OPS-001-B: Monitoring: Adding Monitoring Sites | Defines the required information to request the addition, modification, or removal of a GENI site in monitoring. |
OPS-001-C: Monitoring: Monitoring System Outage | Defines how to report and respond to monitoring system events such as: collector or GUI is down, report non-critical bugs, or incorrect measurements |
OPS-002-A: Security: System Security Events | Describes how to report a security event to the GENI Monitoring team, and how to respond to that event. This type of event can also include Emergency Stop events. |
OPS-003-A: GENI Network: Dataplane Debugging | Defines the processes to be used for detecting and debugging GENI experiment issues with the data plane and control plane in experiments. [1] |
OPS-003-B: GENI Network: Stitching Experiment Debugging | Defines detecting and debugging GENI Network Stitching experiment issues with the stitching of Layer 2 VLAN for GENI experiments. |
OPS-004-A: Site Power Outage | Defines how handle a site power failure and how to verify that the site's aggregate is fully restored upon recovery [1] |
OPS-005-A: Scheduled Site Maintenance | Defines schedules one-time or periodic maintenance activities [2] |
OPS-006-A: GENI WiMAX Base Station Debugging | Defines how to identify and remedy GENI WiMAX base station configuration at a WiMAX site. [3] |
OPS-006-B: GENI WiMAX Dataplane Debugging | Defines how to identify and remedy GENI WiMAX multipoint VLAN(s)between sites |
[1] Before investigating this type of issue, first verify that it is not due to a Scheduled Maintenance.
[2] Sites, services or resources outages may be caused by a Scheduled Maintenance.
[3] A WiMAX issue arises when the frequency or transmit power of a campus WiMAX base station changes. These variables are fixed by site administrators and negotiated via a GENI-Sprint agreement. An event of this type should be very rare, but if monitoring reports a change, handling the event is high priority.
GENI Operations Daily Checks
Procedures for Daily Checks are available to define how the GENI Operations team should verify the state GENI Resources and escalate as needed. A GENI Daily Check Template is available for defining new procedures.
Daily Checks | Daily Checks Description |
CHK-001-A: GENI Racks Security Checks? | Defines daily security checks for InstaGENI rack logs |
CHK-001-B: GENI Stitching Computation Service Security Checks | Defines daily security checks for Stitching Computation Service (SCS) |
CHK-001-C: GENI AL2S AM Security Checks? | Defines daily security checks for GENI AL2S Aggregate Manager |
CHK-001-D: GENI Clearinghouse Security Checks | Defines daily security checks for GENI Clearing House |
CHK-001-E: GENI Portal Security Checks | Defines daily security checks for GENI Portal |
CHK-001-F: GENI WiMAX Checks | Defines daily security checks for WiMAX sites |
CHK-002: GENI Experiment Resources Checks | Defines a daily check for all compute, and network resources in GENI. |
CHK-003: GENI Network Stitching Checks | Defines daily check of GENI Network stitching resources |
CHK-004: GENI Long Running Slice Checks | Defines daily check for a long running experiment named triangle , a 3-node topology continuously exchanging traffic.
|
CHK-005: GENI Network Connectivity OpenFlow Checks | Defines daily check of the GENI network through connectivity tests using OpenFlow resources |