wiki:LuisaSandbox/GENIOperationsTrial/Template

OPS-### GENI Procedure Name

GENI Operation Procedures define the handling of events and issues for the GENI Environment. The source of GENI network events may include:

  1. Person reports a problem via email or phone to an entity, which may or may not be the GMOC. This issue reporting source may be a GENI Experimenter, GENI site administrator, GENI tools contact, or a GENI Operations member. GENI Operations includes: GPO, Rack Teams, UKY Ops, GMOC.
  2. A tool or log reports a problem that should be investigated.
  3. A scheduled event starts.

Regardless of the source for the reported event, a ticket must be written to handle the investigation and resolution of the problem. The are exceptions, such as a security or emergency event, were a public ticket is not desired. For a scheduled event, a ticket should already exists and the GMOC only tracks the completion of the event.

Each procedure must identify the types of event sources for which it defines handling instructions.

1. Issue Reported

GMOC gathers technical details for failures including:

  • Requester Organization
  • Requester Name
  • Requester email
  • Requester GENI site-name
  • Slice Name, any site sliver details available
  • Problem symptoms and impact

GMOC classifies the problem priority based on its urgency, which determines the level of attention that should be applied:

  • Critical: Needs immediate attention from GMOC staff and reporter until the situation is resolved. Should be resolved within 1 business day.
  • High: Should to be investigated within 1 business day and resolved withing 2 business days.
  • Normal: Usually a routine installation/provisioning or reporter initiated maintenance. Should be investigated and closed within 1 business week, except for scheduled events (including scheduled maintenance events), which should be verified and closed by the end of their scheduled window.

1.1 GENI Event Type Prioritization

As events are reported the GMOC must determine the ticket priority for tickets. High priority events may be deemed Critical if the person reporting the issue identifies it as Critical. For example if issue impact a demo, training or a conference. Following are guidelines for prioritizing issues that are reported by a person or tool source. This prioritization does not cover scheduled events which fall under the Normal priority classification.

Priority#Event type DispatchProcedure
Critical 1 Emergency Stop and LLR GMOC GMOC Emergency Stop; GMOC LLR procedures
Critical 2 Security Event [*] GMOC OPS-002-A: System Security Events procedure
Critical 3 GENI Clearinghouse/Portal Event GMOC GENI Clearing House; GENI Portal procedures
Critical 4 Stitching Computation Service GMOC GMOC I2 SCS Procedure
Critical 5 AL2S Aggregate Event GMOC GMOC OESS Procedures
Critical 6 GENI Stitching Event GMOC OPS-003-B: Network Stitching Experiment Debugging Procedure
High 7 WiMAX Multicast VLANs Events GMOC OPS-006-B: GENI WiMAX Dataplane Debugging
High 8 Regional (AM+Switches+links) Rack or Site Group <<Insert OPS procedures links>>
High 9 Site Events reported by site contact GMOC OPS-004-A: Site Power Outage; OPS-005-A: Scheduled Site Maintenance
High 10 Site Events(AM+Switches+links reported by experimenters or tools) [] Rack or Site Group OPS-004-A: Site Power Outage; OPS-005-A: Scheduled Site Maintenance; OPS-006-A: GENI WiMAX Base Station Debugging
High 11 Experimenter Tools Events(Portal, jacks, omni..) Tools Contacts
High 12 Monitoring Infrastructure Events UKY Monitoring team OPS-001-A: Creating Monitoring Event Alerts; OPS-001-B: Adding Monitoring Sites; OPS-001-C: Monitoring System Outage

[*] Security Events start as Critical and may be re-prioritized upon investigation.
[] Some Site Events may affect multiple sites (ExoSM) or non-GENI functions (CloudLab, Emulab, Apt). These events require no special GMOC action and should be assigned to the team that owns the resources.

1.2 Create Ticket

The procedure must describe the type of information that is to be collected for the specific issue, providing names and locations of potential files to be collected, such as log files, configuration files, resource information, etc. The information will vary based on the event type of the procedure.

The GMOC ticketing system is used to capture issue information. GMOC may follow up to request additional information as the problem is investigated. The ticket creation operation results in an email notification to the reporter. Subsequent updates and interactions between GMOC and reporter will also generate notifications to the issue reporter.

2. Investigate and Identify Response

GENI Operation Procedures must provide a description of the problem it addresses, along with the symptoms of the problem and any impact to GENI functionality and availability. The GMOC will isolates the source of the problem reported by mapping description of symptoms reported in initial problem report to one of the GENI Operations procedures.

Once the problem is identified, the response may be to simply hand off issue to a different operations group. For example, if the problem is an ExoGENI rack issue--don't investigate, just dispatch to ExoGENI Operations team and notify the reporter about the hand-off. As part of the hand off all problem details are shared along with the problem prioritization.

2.1 Investigate the Problem

Each procedure must provide list of symptoms that can be used to identify the problem. In addition procedures must define the impact of the failure or symptom being documented whenever possible. GMOC reviews the list of symptoms captured in the procedure to identify which problem is being addressed.

2.2 Identify Potential Response

Each procedures provides detailed Response Procedure which captures the actions required to address problem along with potential impact of each action. GMOC reviews these potential Response Procedures actions and decides which is the best course of action.

3. GMOC Response

The GMOC implements the actions identified in the procedure response and updates the ticket to capture actions taken. In some scenarios the GMOC may dispatch a problem to other organizations, following is a table of organizations that will provide support listed by area of responsibility:

Team Area of Responsibility/Tools
GPO Dev Team GENI Tools (gcf, omni, stitcher), GENI Portal, GENI Clearinghouse
RENCI Dev Team ExoGENI Racks, ExoGENI Stitching
GENI Operations InstaGENI Racks
UKY Operations Team GENI Monitoring System, Stitching Computation System
Utah Dev Team Jack Tool, CloudLab, Emulab, Apt

3.1 Implement Response

In this section, the procedure must provides a simple to follow, step by step set of instruction to address the problem, to be also captures are the expected outcome of each step.

The GMOC executes the steps outlined. The response implementation may take few iteration, as some attempt may not yield the expected results. GMOC may may have to go back and try further actions in case where new symptoms may occur, or where the procedure is found to be lacking. For both cases, an update to the procedures may be required. Actions should be taken to get the procedures updated.

3.2 Procedure Updates

When instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use.

4. Resolution

GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully. There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.

4.1 Document Resolution and Close Ticket

GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.

Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.

For a scheduled event, the ticket may be closed or rescheduled when it cannot be completed in the scheduled time.

Last modified 9 years ago Last modified on 06/04/15 09:15:05