Changes between Version 2 and Version 3 of GENIOperationsTrial/MonitoringSystemOutage


Ignore:
Timestamp:
07/01/15 12:47:54 (9 years ago)
Author:
lnevers@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GENIOperationsTrial/MonitoringSystemOutage

    v2 v3  
    33= OPS-001-C Monitoring System Outage =
    44
    5 This procedure defines the steps to detect and correct a GENI Monitoring System Outage
    6 
    7 A Monitoring System Outage can be reported by an experimenter, a GENI Rack Team or by an operations group.
    8 Regardless of the source for the reported event, a ticket must be written to track the scheduled maintenance completion. Ticket must copy the issue reporter and __does__ generate notifications to GENI users.
     5This procedure defines the steps to detect and correct a GENI Monitoring System Outage. A Monitoring System Outage can be reported by an experimenter, a GENI Rack Team or by an operations group. Regardless of the source for the reported event, a ticket must be written to track the scheduled maintenance completion. Ticket must copy the issue reporter and __does__ generate notifications to GENI users.
    96
    107= 1. Issue Reported =
    118
    129GMOC gathers the technical details for a Monitoring System Outage including:
    13 
    1410 - Requester Organization
    1511 - Requester Name
     
    2016== 1.1 GENI Event Type Prioritization ==
    2117
    22 
    2318GMOC should classifies a Monitoring System Outage as `High` priority.
    24 
    2519
    2620== 1.2 Create Ticket ==
     
    3428
    3529 1. The [http://genimon.uky.edu/ main visualization web site] is not responding or showing an error (HTTP error code, or displaying some error message).
    36  1. A data store has been identified as unresponsive or answering with invalid answers (HTTP errors, unparsable JSON or JSON-formatted error message)
     30 2. A data store has been identified as unresponsive or answering with invalid answers (HTTP errors, unparsable JSON or JSON-formatted error message).
    3731
    3832
     
    4135GMOC will make sure they see the reported issue.
    4236
    43 If the issue is with the main visualization web site, GMOC will try to log in and confirm that the site is down or responding with an error.
     37If the issue is with the [http://genimon.uky.edu/ main visualization web site], GMOC will try to log in and confirm that the site is down or responding with an error.
    4438
    4539If the issue is with a data store, GMOC can issue the following command:
    4640
    47 ''curl -k --cert <collector cert> <data store URL>''
     41 ''curl -k --cert <collector cert> <data store URL>''
    4842
    4943and see the type of error that ensues.
    5044
    5145''Note:''
    52  * the collector cert is one of the crypto certificates issued by the Clearing House to each of the different interested parties of the Ops monitoring project.
    53  * the data store URL should be provided by the reporter. If not, one can find a data store URL by logging into the [http://genimon.uky.edu/ main visualization web site], and look at the specific aggregate page. The data store URL appears on the Information panel for the !SelfRef value.
     46 * The collector certificate is one of the crypto certificates issued by the GENI Clearing House to each of the different interested parties of the GENI Operational Monitoring project.
     47 * The data store URL should be provided by the reporter. If not, one can find a data store URL by logging into the [http://genimon.uky.edu/ main visualization web site], and look at the specific aggregate page. The data store URL appears on the `Information` panel for the !SelfRef value.
    5448 
    5549== 2.2 Identify Potential Response ==
    5650
    5751GMOC will notify the team or teams responsible for fixing the issue.
    58 
    5952 1. In the case of an issue with the [http://genimon.uky.edu/ main visualization web site], the 'UKY Operations / Dev Team' team is in charge of investigating the problem.
    60  1. In the case of an issue with a data store, GMOC will identify the team responsible for the investigation. The team in charge is:
    61    a. GENI Ops for an InstaGENI rack data store
     53 2. In the case of an issue with a data store, GMOC will identify the team responsible for the investigation. Responsibility is as follows:
     54   a. GENI Operations for an InstaGENI rack data store
    6255   a. RENCI !Ops/Dev for an ExoGENI rack data store
    6356   a. OpenGENI Ops for an OpenGENI rack data store.
    6457   a. GMOC for the AL2S data store
    6558   a. GPO for the GPO external check store and ops config data store.
    66 
    6759
    6860The team in charge of investigation will use log files, check processes to determine what caused the issue.
     
    7466= 4. Resolution =
    7567
    76 
    7768GMOC verifies the the problem is no longer happening by coordinating with the problem reporter or by checking the tool/log that originally signaled the problem. For scheduled event, the GMOC coordinate with the person that originally scheduled the event to make sure that it was completed successfully.  There is also a potential for scheduled event tickets being postponed, and remaining open until the next scheduled time.
    7869
    7970== 4.1 Document Resolution and Close Ticket ==
    8071
    81 When notified by the team that was dispatched the ticket, GMOC will verify that the problem is resolved by following the same steps as in section 2.1 above.
     72When notified by the team that was dispatched for the ticket, GMOC will verify that the problem is resolved by following the same steps as in section 2.1 above.
     73
    8274GMOC captures how the problem is resolved in the ticket and closes the ticket. If the problem solution does not fully resolve the problem, a new ticket may be created to generate a new ticket to track the remaining issue.
    8375