Changes between Initial Version and Version 1 of PlasticSlices/MonitoringRecommendations


Ignore:
Timestamp:
05/04/11 23:04:13 (13 years ago)
Author:
chaos@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PlasticSlices/MonitoringRecommendations

    v1 v1  
     1[[PageOutline]]
     2
     3''This is a draft of the page which will become [ggw:PlasticSlices/MonitoringRecommendations].''
     4
     5= Plastic Slices Monitoring Recommendations =
     6
     7== Introduction ==
     8
     9The goal of monitoring for the Plastic Slices project is to collect status data which can be used to assess the health and utilization of the resources involved in the Plastic Slices project.  Data will be collected in a central location, and in a format which is consistent across aggregates.  Once this is done:
     10 * Experimenters will be able to go to a single place to see comparable information about all of their resources, regardless of what site contains those resources.
     11 * Resource admins will be able to write simple tools to detect and alert on error conditions at all participating aggregates.
     12
     13== Site monitoring architecture ==
     14
     15The goal of this architecture is to make it simple for sites participating in Plastic Slices to monitor relatively homogeneous resources --- [wiki:MyPlc MyPLC] hosts, MyPLC PlanetLab nodes, and OpenFlow !FlowVisors (including Expedient servers) --- and submit a common set of data about these resources to a central database hosted by [wiki:GENIMetaOps the GENI Meta-Operations Center].
     16
     17To keep things simple, there is no central monitoring server at each campus.  Instead, the following things are installed on each node:
     18 * A metric collecting cron job or simple daemon which runs once a minute, polls the node for a set of metrics, and stores numeric values for these metrics in RRD files on the system.
     19 * A reporting cron job which runs every five minutes, aggregates the recently collected data into a [http://gmoc.grnoc.iu.edu/gmoc/index/documents/gmoc-data-exchange-format--initial-document-describing-native-gmoc-data-exchange-format data exchange format], and submits it to the GMOC database via HTTP.
     20
     21For the exact steps to add monitoring to your site resources, please see:
     22 * For MyPLC server hosts, see [wiki:ChaosSandbox/PlasticSlicesMonitoringRecommendations/MyplcConfiguration]
     23 * For MyPLC !PlanetLab nodes, see [wiki:ChaosSandbox/PlasticSlicesMonitoringRecommendations/PlnodeConfiguration]
     24 * For !OpenFlow !FlowVisor, see [wiki:ChaosSandbox/PlasticSlicesMonitoringRecommendations/FlowvisorConfiguration]
     25
     26== What is monitored ==
     27
     28Each type of node reports the following metrics to the GMOC database.  Other metrics may be collected locally, but are not reported to the central database.
     29
     30On MyPLC server hosts, the reporting cron job sends:
     31 * `plc_node_count`: How many nodes are attached to the MyPLC instance, and how many are in each state (boot, failboot, reinstall, etc)
     32 * `plc_site_count`: How many sites are affiliated with the MyPLC instance (including both local sites you have defined, and remote sites representing slice authorities trusted by this MyPLC instance)
     33 * `plc_session_count`: How many web sessions are active in the MyPLC server's UI
     34 * `plc_site_state`: ''For each site'', how many nodes, slices, slivers, and users are defined at that site
     35 * `plc_node_state`: ''For each site'', how many interfaces and slivers are defined at that site
     36 * `plc_slice_state`: ''For each slice'', how many nodes and users are included in the slice
     37
     38On MyPLC !PlanetLab nodes, the reporting cron job sends:
     39 * `plnode_state`: How many users are logged into the node
     40 * `plnode_sliver_state`: ''For each sliver on the node'':
     41   * What percentage of allowed CPU, disk, and memory is that sliver using
     42   * How much total disk space and how many processes is the sliver using
     43   * What is the sliver's memory footprint
     44   * How many minutes has the sliver been alive
     45 * `plnode_sliver_network`: ''For each sliver on the node'': How much IPv4 data is the sliver sending and receiving, in packets/s and bytes/s (this includes both control and dataplane traffic)
     46
     47On !FlowVisor nodes, the reporting cron job sends:
     48 * `flowvisor_state`:
     49   * How many read-only and read/write flowspace rules are defined in total
     50   * How many DPIDs (virtual switches) are reporting to the flowvisor
     51   * How many slices are defined on the flowvisor (including the `root` slice, so there will always be at least one)
     52 * `flowvisor_dpid_state`: ''For each reporting DPID'':
     53   * On how many switch ports is the DPID reporting to this flowvisor
     54   * How many read-only and read/write flowspace rules are defined which refer to this DPID (including catchall rules that match any DPID)
     55   * How many total control messages are being received and sent to the flowvisor for this DPID (in messages/s)
     56 * `flowvisor_slice_state`: ''For each defined slice'':
     57   * How many read-only and read/write flowspace rules are defined for this slice
     58   * How many total control messages are being received and sent to the flowvisor for this slice (in messages/s)
     59
     60== How to look at the data ==
     61
     62Currently, the data is available:
     63 * For real-time viewing via the GMOC's web-based UI at [http://gmoc-dev.grnoc.iu.edu/api-demo/data/]
     64 * For download in the GMOC data exchange format at [http://gmoc-dev.grnoc.iu.edu/api-demo/gen_api.pl] (contains the past 10 minutes of data)
     65
     66We will edit this section over time, to add usage documentation and links to other interfaces.
     67
     68== Optional things a site can do with these tools ==
     69
     70The install processes linked above describe how to use these tools to report exactly the data set described on this page to the GMOC.  However, there are additional things you can do with these tools.  As time permits, we will document how to do all of these things, but, meanwhile, if you want to do any of them, GPO infra can provide pointers and/or additional scripts.
     71
     72=== Other ways to use the collected data ===
     73
     74A simple interface is used to report data into local RRD files contained on the system.  You can use this to:
     75 * Look at these RRDs by hand
     76 * Modify the collecting script to report the data to ganglia, or to some other monitoring service at your site, in addition to storing them in the RRDs.
     77
     78At GPO, we report our data into ganglia, and can share details and example code.
     79
     80=== Reporting additional data ===
     81
     82If your site is collecting additional data which is of interest, you should be able to:
     83 * Use the provided python interface to store your data in RRDs
     84 * Write or generate a config file to send your data to the GMOC's collection
     85
     86If you are interested in doing these things, please:
     87 * Coordinate with the GPO about naming your metrics.
     88 * Check with the GMOC if you plan to report a substantial amount of additional data, or for advice on working with their API to generate useful SNAPP output.