Changes between Initial Version and Version 1 of GENIOperationsTrial/SitePowerOutage


Ignore:
Timestamp:
06/18/15 08:10:42 (9 years ago)
Author:
lnevers@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GENIOperationsTrial/SitePowerOutage

    v1 v1  
     1[[PageOutline(1-2)]]
     2
     3= OPS-004-A Site Power Outage =
     4
     5
     6This procedures describes how to handle a GENI Site power outage for both scheduled and unscheduled events. A GENI Site power outage event may be reported by the GENI Monitoring System, a site contact or an experimenter.  Regardless of the source for the reported event, a ticket must be written to handle the investigation and resolution of the problem. Ticket must copy the issue reporter and the GENI Experimenters at  geni-users@googlegroups.com.
     7 
     8= 1. Issue Reported =
     9
     10GMOC gathers technical details for the power outage including:
     11 - Reporting Organization
     12 - Reporter Name
     13 - Reporter email
     14 - GENI site-name
     15
     16== 1.1 GENI Event Type Prioritization ==
     17
     18GENI Site Power outage events fall under two categories:
     19  - Unscheduled - A `Critical` issues that requires Experimenter notification.
     20  - Scheduled - Most likely a `High` priority issues, unless it has major impact major services or components of GENI. Also requires Experimenter notifications.
     21
     22== 1.2 Create Ticket ==
     23
     24The GMOC ticketing system is used to capture issue information. GMOC may follow up to request additional information as the problem is investigated. The ticket creation operation results in an email notification to the reporter.  Subsequent updates and interactions between GMOC and reporter will also generate notifications to the issue reporter.
     25
     26For GENI Site Power Outages it is crucial to contact site contacts and notify experimenters.
     27
     28= 2. Investigate and Identify Response =
     29
     30== 2.1 Investigate the Problem ==
     31
     32When a Site Power Outage is reported, first check existing scheduled activities to verify that this is not a scheduled event:
     33  a. If scheduled outage, see section 2.2.1
     34  b. If unscheduled outage, see section 2.2.2
     35
     36== 2.2 Identify Potential Response ==
     37
     38=== 2.2.1 Scheduled Power Outage ===
     39
     40For a Scheduled Power Outage a ticket exist that track the planned activity to completion. GENI Experimenters are notified (geni-users@googlegroups.com)
     41
     42=== 2.2.2 Unscheduled Power Outage ===
     43
     44A power outage can be reported by a GENI User or can be detected via the [http://tamassos.gpolab.bbn.com/nagios3/ GPO Nagios] system by selecting "Services":
     45
     46  - For a given site, for example NPS, the status of the "is_available" and "is_responsive" service checks will be "CRITICAL" for both Aggregate Managers (i.e. nps-ig and nps-ig-of).
     47 - All checks from GENI racks to the site in question (for example, gpo-ig_to_nps-ig_campus, assuming NPS is the site in question) will also be in the "CRITICAL" state.
     48
     49The GMOC notifies the Rack team about the outage. 
     50
     51= 3. GMOC Response =
     52
     53If scheduled power outage event, monitor progress towards completion and updates ticket which results in notifications to site contacts, GENI experimenters and Rack teams.
     54
     55If an unscheduled power outage is suspected the GMOC contacts the appropriate rack team:
     56 * For InstaGENI racks contact GENI-OPS at [mailto:geni-ops@googlegroups.com]
     57 * For ExoGENI racks contact ExoGENI-OPS at [mailto:exogeni-ops@renci.org].
     58
     59The respective rack teams will contact the Site Contact to investigate the failure.  Once the ticket is dispatched to the rack team, GMOC follows the same instructions as a  Scheduled Power Outages and monitors progress towards completion and updates ticket which results in notifications to site contacts, GENI experimenters and Rack teams.
     60
     61== 3.1 Implement Response ==
     62
     63The GMOC executes the steps outlined.
     64
     65== 3.2 Procedure Updates ==
     66
     67If instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use.
     68
     69= 4. Resolution =
     70
     71GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully.  There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.
     72
     73== 4.1 Document Resolution and Close Ticket ==
     74
     75GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.
     76
     77Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.
     78
     79For a scheduled event, the ticket may be closed or rescheduled when it cannot be completed in the scheduled time.
     80