Changes between Initial Version and Version 1 of GENIOperationsTrial/MonitoringSystemOutage


Ignore:
Timestamp:
06/18/15 08:05:56 (9 years ago)
Author:
lnevers@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GENIOperationsTrial/MonitoringSystemOutage

    v1 v1  
     1[[PageOutline(1-2)]]
     2
     3= OPS-001-B Monitoring System Outage =
     4
     5This procedure defines the steps to detect and correct a GENI Monitoring System Outage
     6
     7A Monitoring System Outage can be reported by an experimenter, a GENI Rack Team or by an operations group.
     8Regardless of the source for the reported event, a ticket must be written to track the scheduled maintenance completion. Ticket must copy the issue reporter and __does__ generate notifications to GENI users.
     9
     10= 1. Issue Reported =
     11
     12GMOC gathers the technical details for a Monitoring System Outage including:
     13
     14 - Requester Organization
     15 - Requester Name
     16 - Requester email
     17 - When the outage was first noticed.
     18
     19
     20== 1.1 GENI Event Type Prioritization ==
     21
     22
     23GMOC should classifies a Monitoring System Outage as `High` priority.
     24
     25
     26== 1.2 Create Ticket ==
     27
     28The GMOC ticketing system is used to capture information above. GMOC may follow up to request additional information as the Monitoring System Outage is ongoing. This operation results in the requester getting a ticket email.
     29
     30
     31= 2. Investigate and Identify Response =
     32
     33There are 2 main types of outages:
     34
     35 1. The [http://genimon.uky.edu/ main visualization web site] is not responding or showing an error (HTTP error code, or displaying some error message).
     36 1. A data store has been identified as unresponsive or answering with invalid answers (HTTP errors, unparsable JSON or JSON-formatted error message)
     37
     38
     39== 2.1 Investigate the Problem ==
     40
     41GMOC will make sure they see the reported issue.
     42
     43If the issue is with the main visualization web site, GMOC will try to log in and confirm that the site is down or responding with an error.
     44
     45If the issue is with a data store, GMOC can issue the following command:
     46
     47''curl -k --cert <collector cert> <data store URL>''
     48
     49and see the type of error that ensues.
     50
     51''Note:''
     52 * the collector cert is one of the crypto certificates issued by the Clearing House to each of the different interested parties of the Ops monitoring project.
     53 * the data store URL should be provided by the reporter. If not, one can find a data store URL by logging into the [http://genimon.uky.edu/ main visualization web site], and look at the specific aggregate page. The data store URL appears on the Information panel for the !SelfRef value.
     54 
     55== 2.2 Identify Potential Response ==
     56
     57GMOC will notify the team or teams responsible for fixing the issue.
     58
     59 1. In the case of an issue with the [http://genimon.uky.edu/ main visualization web site], the 'UKY Operations / Dev Team' team is in charge of investigating the problem.
     60 1. In the case of an issue with a data store, GMOC will identify the team responsible for the investigation. The team in charge is:
     61   a. GENI Ops for an InstaGENI rack data store
     62   a. RENCI !Ops/Dev for an ExoGENI rack data store
     63   a. OpenGENI Ops for an OpenGENI rack data store.
     64   a. GMOC for the AL2S data store
     65   a. GPO for the GPO external check store and ops config data store.
     66
     67
     68The team in charge of investigation will use log files, check processes to determine what caused the issue.
     69
     70= 3. GMOC Response =
     71
     72GMOC will dispatch the ticket to the team or teams responsible for fixing the issue, identified above.
     73
     74= 4. Resolution =
     75
     76
     77GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully.  There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.
     78
     79== 4.1 Document Resolution and Close Ticket ==
     80
     81When notified by the team that was dispatched the ticket, GMOC will verify that the problem is resolved by following the same steps as in section 2.1 above.
     82GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.
     83
     84Whether the problem is fully resolved (ticket closed) or partially resolved (new ticket open), both should result in notification back to the problem reporter.
     85