wiki:GENIOperationsTrial/SitePowerOutage

OPS-004-A Site Power Outage

This procedures describes how to handle a GENI Site power outage for both scheduled and unscheduled events. A GENI Site power outage event may be reported by the GENI Monitoring System, a site contact or an experimenter. Regardless of the source for the reported event, a ticket must be written to handle the investigation and resolution of the problem. Ticket must copy the issue reporter and the GENI Experimenters at geni-users@googlegroups.com.

1. Issue Reported

GMOC gathers technical details for the power outage including:

  • Reporting Organization
  • Reporter Name
  • Reporter email
  • GENI site-name

1.1 GENI Event Type Prioritization

GENI Site Power outage events fall under two categories:

  • Unscheduled - A Critical issues that requires Experimenter notification.
  • Scheduled - Most likely a High priority issues, unless it has major impact major services or components of GENI. Also requires Experimenter notifications.

1.2 Create Ticket

The GMOC ticketing system is used to capture issue information. GMOC may follow up to request additional information as the problem is investigated. The ticket creation operation results in an email notification to the reporter. Subsequent updates and interactions between GMOC and reporter will also generate notifications to the issue reporter.

For GENI Site Power Outages it is crucial to contact site contacts and notify experimenters.

2. Investigate and Identify Response

2.1 Investigate the Problem

When a Site Power Outage is reported, first check existing scheduled activities to verify that this is not a scheduled event:

  1. If scheduled outage, see section 2.2.1
  2. If unscheduled outage, see section 2.2.2

2.2 Identify Potential Response

2.2.1 Scheduled Power Outage

For a Scheduled Power Outage a ticket exist that track the planned activity to completion. GENI Experimenters are notified (geni-users@googlegroups.com)

2.2.2 Unscheduled Power Outage

A power outage can be reported by a GENI User or can be detected via the GPO Nagios system by selecting "Services":

  • For a given site, for example NPS, the status of the "is_available" and "is_responsive" service checks will be "CRITICAL" for both Aggregate Managers (i.e. nps-ig and nps-ig-of).
  • All checks from GENI racks to the site in question (for example, gpo-ig_to_nps-ig_campus, assuming NPS is the site in question) will also be in the "CRITICAL" state.

The GMOC notifies the Rack team about the outage.

3. GMOC Response

If scheduled power outage event, monitor progress towards completion and updates ticket which results in notifications to site contacts, GENI experimenters and Rack teams.

If an unscheduled power outage is suspected the GMOC contacts the appropriate rack team:

The respective rack teams will contact the Site Contact to investigate the failure. Once the ticket is dispatched to the rack team, GMOC follows the same instructions as a Scheduled Power Outages and monitors progress towards completion and updates ticket which results in notifications to site contacts, GENI experimenters and Rack teams.

3.1 Implement Response

The GMOC executes the steps outlined.

3.2 Procedure Updates

If instructions in a procedure are found to miss symptoms, required actions, or potential impact, then action must be taken by the GMOC to provide feedback to enhance the procedure for future use.

4. Resolution

GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully. There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.

4.1 Document Resolution and Close Ticket

GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.

Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.

For a scheduled event, the ticket may be closed or rescheduled when it cannot be completed in the scheduled time.

Last modified 9 years ago Last modified on 06/18/15 08:10:42