Changes between Initial Version and Version 1 of LuisaSandbox/GENIOperationsTrial/Template


Ignore:
Timestamp:
06/04/15 09:15:05 (9 years ago)
Author:
lnevers@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LuisaSandbox/GENIOperationsTrial/Template

    v1 v1  
     1[[PageOutline(1-2)]]
     2
     3= OPS-### GENI Procedure Name =
     4
     5GENI Operation Procedures define the handling of events and issues for the GENI Environment. The source of GENI network events may include:
     6
     7 1. Person reports a problem via email or phone to an entity, which may or may not be the GMOC. This issue reporting source may be a GENI Experimenter, GENI site administrator, GENI tools contact, or a GENI Operations member. GENI Operations includes: GPO, Rack Teams, UKY Ops, GMOC.
     8 2. A tool or log reports a problem that should be investigated. 
     9 3. A scheduled event starts.
     10
     11Regardless of the source for the reported event, a ticket must be written to handle the investigation and resolution of the problem. The are exceptions, such as a security or emergency event, were a public ticket is not desired. For a scheduled event, a ticket should already exists and the GMOC only tracks the completion of the event.
     12
     13Each procedure must identify the types of event sources for which it defines handling instructions.
     14
     15= 1. Issue Reported =
     16
     17GMOC gathers technical details for failures including:
     18 - Requester Organization
     19 - Requester Name
     20 - Requester email
     21 - Requester GENI site-name
     22 - Slice Name, any site sliver details available
     23 - Problem symptoms and impact 
     24
     25GMOC classifies the problem priority based on its urgency, which determines the level of attention that should be applied:
     26 - `Critical`: Needs immediate attention from GMOC staff and reporter until the situation is resolved. Should be resolved within 1 business day.
     27 - `High`: Should to be investigated within 1 business day and resolved withing 2 business days.
     28 - `Normal`: Usually a routine installation/provisioning or reporter initiated maintenance. Should be investigated and closed within 1 business week, except for scheduled events (including scheduled maintenance events), which should be verified and closed by the end of their scheduled window.
     29
     30== 1.1 GENI Event Type Prioritization ==
     31
     32As events are reported the GMOC must determine the ticket priority for tickets. `High` priority events may be deemed `Critical` if the person reporting the issue identifies it as `Critical`. For example if issue impact a demo, training or a conference.  Following are guidelines for prioritizing issues that are reported by a person or tool source. This prioritization does not cover scheduled events which fall under the `Normal` priority classification.
     33
     34||'''Priority'''||'''#'''||'''Event type'''                ||'''Dispatch'''||'''Procedure'''||
     35|| Critical     || 1     ||Emergency Stop and LLR          || GMOC         || GMOC Emergency Stop; GMOC LLR procedures  ||
     36|| Critical     || 2     ||Security Event ''[*]''          || GMOC         || OPS-002-A: System Security Events procedure  ||
     37|| Critical     || 3     ||GENI !Clearinghouse/Portal Event|| GMOC         || GENI Clearing House; GENI Portal procedures||
     38|| Critical     || 4     ||Stitching Computation Service   || GMOC         || GMOC I2 SCS Procedure      ||
     39|| Critical     || 5     ||AL2S Aggregate Event            || GMOC         || GMOC OESS Procedures       ||
     40|| Critical     || 6     ||GENI Stitching Event            || GMOC         || OPS-003-B: Network Stitching Experiment Debugging Procedure ||
     41|| High         || 7     ||WiMAX Multicast VLANs Events    || GMOC         ||   OPS-006-B: GENI WiMAX Dataplane Debugging||
     42|| High         || 8     ||Regional (AM+Switches+links)    || Rack or Site Group || <<Insert OPS procedures links>> ||
     43|| High         || 9     ||Site Events reported by site contact|| GMOC         ||  OPS-004-A: Site Power Outage; OPS-005-A: Scheduled Site Maintenance ||
     44|| High         || 10     ||Site Events(AM+Switches+links reported by experimenters or tools) ''[**]''|| Rack or Site Group || OPS-004-A: Site Power Outage; OPS-005-A: Scheduled Site Maintenance; OPS-006-A: GENI WiMAX Base Station Debugging||
     45|| High         || 11     ||Experimenter Tools Events(Portal, jacks, omni..)|| Tools Contacts ||  ||
     46|| High         || 12    ||Monitoring Infrastructure Events|| UKY Monitoring team ||OPS-001-A: Creating Monitoring Event Alerts; OPS-001-B: Adding Monitoring Sites; OPS-001-C: Monitoring System Outage  ||
     47
     48 ''[*] Security Events start as Critical and may be re-prioritized upon investigation. [[BR]]''
     49 ''[**] Some `Site Events` may affect multiple sites (ExoSM) or non-GENI functions (!CloudLab, Emulab, Apt). These events require no special GMOC action and should be assigned to the team that owns the resources.''
     50
     51
     52
     53== 1.2 Create Ticket ==
     54
     55The procedure must describe the type of information that is to be collected for the specific issue, providing names and locations of potential files to be collected, such as log files, configuration files, resource information, etc. The information will vary based on the event type of the procedure.
     56
     57The GMOC ticketing system is used to capture issue information. GMOC may follow up to request additional information as the problem is investigated. The ticket creation operation results in an email notification to the reporter.  Subsequent updates and interactions between GMOC and reporter will also generate notifications to the issue reporter.
     58
     59= 2. Investigate and Identify Response =
     60
     61GENI Operation Procedures must provide a description of the problem it addresses, along with the symptoms of the problem and any impact to GENI functionality and availability.  The GMOC will isolates the source of the problem reported by mapping description of symptoms reported in initial problem report to one of the GENI Operations procedures.
     62
     63Once the problem is identified, the response may be to simply hand off issue to a different operations group.  For example, if the problem is an ExoGENI rack issue--don't investigate, just dispatch to ExoGENI Operations team and notify the reporter about the hand-off. As part of the hand off all problem details are shared along with the problem prioritization. 
     64
     65== 2.1 Investigate the Problem ==
     66
     67Each procedure must provide list of symptoms that can be used to identify the problem. In addition procedures must define the impact of the failure or symptom being documented whenever possible.
     68GMOC reviews the list of symptoms captured in the procedure to identify which problem is being addressed.
     69
     70== 2.2 Identify Potential Response ==
     71
     72Each procedures provides detailed `Response Procedure` which captures the actions required to address problem along with potential impact of each action. GMOC reviews these potential `Response Procedures` actions and decides which is the best course of action.
     73
     74= 3. GMOC Response =
     75
     76The GMOC implements the actions identified in the procedure response and updates the ticket to capture actions taken.  In some scenarios the GMOC may dispatch a problem to other organizations, following is a table of organizations that will provide support listed by area of responsibility:
     77 
     78|| ''' Team '''        || ''' Area of !Responsibility/Tools''' ||
     79|| GPO Dev Team        || GENI Tools (gcf, omni, stitcher), GENI Portal, GENI Clearinghouse ||
     80|| RENCI Dev Team      || ExoGENI Racks, ExoGENI Stitching ||
     81|| GENI Operations     || InstaGENI Racks ||
     82|| UKY Operations Team || GENI Monitoring System, Stitching Computation System ||
     83|| Utah Dev Team       || Jack Tool, !CloudLab, Emulab, Apt||
     84
     85== 3.1 Implement Response ==
     86
     87In this section, the procedure must provides a simple to follow, step by step set of instruction to address the problem, to be also captures are the expected outcome of each step.
     88
     89The GMOC executes the steps outlined. The response implementation may take few iteration, as some attempt may not yield the expected results. GMOC may may have to go back and try further actions in case where new symptoms may occur, or where the procedure is found to be lacking.  For both cases, an update to the procedures may be required. Actions should be taken to get the procedures updated.
     90
     91
     92== 3.2 Procedure Updates ==
     93
     94When instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use.
     95
     96= 4. Resolution =
     97
     98GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully.  There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.
     99
     100== 4.1 Document Resolution and Close Ticket ==
     101
     102GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.
     103
     104Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.
     105
     106For a scheduled event, the ticket may be closed or rescheduled when it cannot be completed in the scheduled time.
     107