wiki:GENIOperationsTrial/GENIExpResourcesCheck

Version 3 (modified by sblais@bbn.com, 9 years ago) (diff)

--

CHK-002: GENI Experiment Resources Checks

The GENI Experiment Resources Checks procedures define how to verify that:

  • GENI aggregates are operating as expected in the GENI Environment.
  • Monitoring data stores are operating as expected in the GENI Environment.

1.0 GENI Aggregates Availability Check

1.1 Goals of GENI Aggregates Availability Check

All GENI aggregates are providing GENI resources (nodes and links) management through an Aggregate Manager API. This check verifies that the AM APIs at the various GENI aggregates are behaving as expected.

1.2 Steps for GENI Aggregates Availability Check

  1. Log onto the alerting system.
  2. Select Service Group / Summary in the left pane.
    Nagios-Service Groups Summary
  3. In the "GENI aggregates and services availability" group row, check for the presence of CRITICAL or PENDING service under the "Service Status Summary" column.
  4. Click on the OK link under the "Service Status Summary" column, which will bring you to the "Service Status Details" for all the services in OK state. Nagios-Service Groups Details OK
  5. Sort the service with the "Last Check" columns values (click on the up (ascending) orange arrow). Make sure that the time stamps are all within the last 15 minutes or so.

1.3 GENI Aggregates Availability Check - Pass Criteria

This check passes if there are no CRITICAL or PENDING services on step 3 of the Steps above,

AND

if the time stamps of the OK services are recent on step 5 of the Steps above.

1.4 GENI Aggregates Availability - Fail Criteria and Escalation

If there are CRITICAL services in step 3 above:

  1. click on the CRITICAL link under the "Service Status Summary" column, which will bring you to the "Service Status Details" for all the services in CRITICAL state. Nagios-Service Groups Details CRITICAL
  2. Sort the service with the "Last Check" columns values (click on the up (ascending) orange arrow). Make sure that the time stamps are all within the last 15 minutes or so.

If the time stamps are within the accepted range, the services are indeed in CRITICAL states. If the time stamps are not within the accepted range, something is amiss in the monitoring system and is preventing timely status updates.

If there are PENDING services in step 3 above:

  1. click on the PENDING link under the "Service Status Summary" column, which will bring you to the "Service Status Details" for all the services in PENDING state.

A PENDING state, means that the monitoring system has never reported on the availability status of a particular aggregate.

Escalation: If there are availability services in CRITICAL states: Report to ??? GMOC team - gmoc@grnoc.iu.edu
Escalation: If there are availability services in PENDING states: Report to UKY team - ???
Escalation: If there are availability services with stale time stamps: Report to UKY team - ???

2.0 GENI Data Stores Responsiveness Check

2.1 Goals of GENI Data Stores Responsiveness Check

There are various monitoring data stores throughout the GENI Environment. This check verifies that the data stores are responsive, i.e. reporting data and generally behaving as expected.

2.2 Steps for GENI Data Stores Responsiveness Check

  1. Log onto the alerting system.
  2. Select Service Group / Summary in the left pane.
  3. In the "GENI data stores responsiveness" group row, check for the presence of CRITICAL or PENDING service under the "Service Status Summary" column.
  4. Click on the OK link under the "Service Status Summary" column, which will bring you to the "Service Status Details" for all the services in OK state.
  5. Sort the service with the "Last Check" columns values (click on the up (ascending) orange arrow). Make sure that the time stamps are all within the last 15 minutes or so.

1.3 GENI Data Stores Responsiveness Check - Pass Criteria

This check passes if there are no CRITICAL or PENDING services on step 3 of the Steps above,

AND

if the time stamps of the OK services are recent on step 5 of the Steps above.

1.4 GENI Data Stores Responsiveness - Fail Criteria and Escalation

If there are CRITICAL services in step 3 above:

  1. click on the CRITICAL link under the "Service Status Summary" column, which will bring you to the "Service Status Details" for all the services in CRITICAL state.
  2. Sort the service with the "Last Check" columns values (click on the up (ascending) orange arrow). Make sure that the time stamps are all within the last 15 minutes or so.

If the time stamps are within the accepted range, the services are indeed in CRITICAL states. If the time stamps are not within the accepted range, something is amiss in the monitoring system and is preventing timely status updates.

If there are PENDING services in step 3 above:

  1. click on the PENDING link under the "Service Status Summary" column, which will bring you to the "Service Status Details" for all the services in PENDING state.

A PENDING state, means that the monitoring system has never reported on the availability status of a particular aggregate.

Escalation: If there are availability services in CRITICAL states: Report to ??? rack teams
Escalation: If there are availability services in PENDING states: Report to UKY team - ???
Escalation: If there are availability services with stale time stamps: Report to UKY team - ???

Note: The response to a CRITICAL is_responsive service check is detailed in Monitoring System Outage procedure

Attachments (3)

Download all attachments as: .zip