wiki:GENIOperationsTrial/MonitoringSystemOutage

Version 2 (modified by lnevers@bbn.com, 9 years ago) (diff)

--

OPS-001-C Monitoring System Outage

This procedure defines the steps to detect and correct a GENI Monitoring System Outage

A Monitoring System Outage can be reported by an experimenter, a GENI Rack Team or by an operations group. Regardless of the source for the reported event, a ticket must be written to track the scheduled maintenance completion. Ticket must copy the issue reporter and does generate notifications to GENI users.

1. Issue Reported

GMOC gathers the technical details for a Monitoring System Outage including:

  • Requester Organization
  • Requester Name
  • Requester email
  • When the outage was first noticed.

1.1 GENI Event Type Prioritization

GMOC should classifies a Monitoring System Outage as High priority.

1.2 Create Ticket

The GMOC ticketing system is used to capture information above. GMOC may follow up to request additional information as the Monitoring System Outage is ongoing. This operation results in the requester getting a ticket email.

2. Investigate and Identify Response

There are 2 main types of outages:

  1. The main visualization web site is not responding or showing an error (HTTP error code, or displaying some error message).
  2. A data store has been identified as unresponsive or answering with invalid answers (HTTP errors, unparsable JSON or JSON-formatted error message)

2.1 Investigate the Problem

GMOC will make sure they see the reported issue.

If the issue is with the main visualization web site, GMOC will try to log in and confirm that the site is down or responding with an error.

If the issue is with a data store, GMOC can issue the following command:

curl -k --cert <collector cert> <data store URL>

and see the type of error that ensues.

Note:

  • the collector cert is one of the crypto certificates issued by the Clearing House to each of the different interested parties of the Ops monitoring project.
  • the data store URL should be provided by the reporter. If not, one can find a data store URL by logging into the main visualization web site, and look at the specific aggregate page. The data store URL appears on the Information panel for the SelfRef value.

2.2 Identify Potential Response

GMOC will notify the team or teams responsible for fixing the issue.

  1. In the case of an issue with the main visualization web site, the 'UKY Operations / Dev Team' team is in charge of investigating the problem.
  2. In the case of an issue with a data store, GMOC will identify the team responsible for the investigation. The team in charge is:
    1. GENI Ops for an InstaGENI rack data store
    2. RENCI Ops/Dev for an ExoGENI rack data store
    3. OpenGENI Ops for an OpenGENI rack data store.
    4. GMOC for the AL2S data store
    5. GPO for the GPO external check store and ops config data store.

The team in charge of investigation will use log files, check processes to determine what caused the issue.

3. GMOC Response

GMOC will dispatch the ticket to the team or teams responsible for fixing the issue, identified above.

4. Resolution

GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully. There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.

4.1 Document Resolution and Close Ticket

When notified by the team that was dispatched the ticket, GMOC will verify that the problem is resolved by following the same steps as in section 2.1 above. GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.

Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.