wiki:PlasticSlices/MonitoringRecommendations

Version 2 (modified by chaos@bbn.com, 13 years ago) (diff)

--

Plastic Slices Monitoring Recommendations

Introduction

The goal of monitoring for the Plastic Slices project is to collect status data which can be used to assess the health and utilization of the resources involved in the Plastic Slices project. Data will be collected in a central location, and in a format which is consistent across aggregates. Once this is done:

  • Experimenters will be able to go to a single place to see comparable information about all of their resources, regardless of what site contains those resources.
  • Resource admins will be able to write simple tools to detect and alert on error conditions at all participating aggregates.

Site monitoring architecture

The goal of this architecture is to make it simple for sites participating in Plastic Slices to monitor relatively homogeneous resources --- MyPLC hosts, MyPLC PlanetLab nodes, and OpenFlow FlowVisors (including Expedient servers) --- and submit a common set of data about these resources to a central database hosted by the GENI Meta-Operations Center.

To keep things simple, there is no central monitoring server at each campus. Instead, the following things are installed on each node:

  • A metric collecting cron job or simple daemon which runs once a minute, polls the node for a set of metrics, and stores numeric values for these metrics in RRD files on the system.
  • A reporting cron job which runs every five minutes, aggregates the recently collected data into a data exchange format, and submits it to the GMOC database via HTTP.

For the exact steps to add monitoring to your site resources, please see:

What is monitored

Each type of node reports the following metrics to the GMOC database. Other metrics may be collected locally, but are not reported to the central database.

On MyPLC server hosts, the reporting cron job sends:

  • plc_node_count: How many nodes are attached to the MyPLC instance, and how many are in each state (boot, failboot, reinstall, etc)
  • plc_site_count: How many sites are affiliated with the MyPLC instance (including both local sites you have defined, and remote sites representing slice authorities trusted by this MyPLC instance)
  • plc_session_count: How many web sessions are active in the MyPLC server's UI
  • plc_site_state: For each site, how many nodes, slices, slivers, and users are defined at that site
  • plc_node_state: For each site, how many interfaces and slivers are defined at that site
  • plc_slice_state: For each slice, how many nodes and users are included in the slice

On MyPLC PlanetLab nodes, the reporting cron job sends:

  • plnode_state: How many users are logged into the node
  • plnode_sliver_state: For each sliver on the node:
    • What percentage of allowed CPU, disk, and memory is that sliver using
    • How much total disk space and how many processes is the sliver using
    • What is the sliver's memory footprint
    • How many minutes has the sliver been alive
  • plnode_sliver_network: For each sliver on the node: How much IPv4 data is the sliver sending and receiving, in packets/s and bytes/s (this includes both control and dataplane traffic)

On FlowVisor nodes, the reporting cron job sends:

  • flowvisor_state:
    • How many read-only and read/write flowspace rules are defined in total
    • How many DPIDs (virtual switches) are reporting to the flowvisor
    • How many slices are defined on the flowvisor (including the root slice, so there will always be at least one)
  • flowvisor_dpid_state: For each reporting DPID:
    • On how many switch ports is the DPID reporting to this flowvisor
    • How many read-only and read/write flowspace rules are defined which refer to this DPID (including catchall rules that match any DPID)
    • How many total control messages are being received and sent to the flowvisor for this DPID (in messages/s)
  • flowvisor_slice_state: For each defined slice:
    • How many read-only and read/write flowspace rules are defined for this slice
    • How many total control messages are being received and sent to the flowvisor for this slice (in messages/s)

How to look at the data

Currently, the data is available:

We will edit this section over time, to add usage documentation and links to other interfaces.

Optional things a site can do with these tools

The install processes linked above describe how to use these tools to report exactly the data set described on this page to the GMOC. However, there are additional things you can do with these tools. As time permits, we will document how to do all of these things, but, meanwhile, if you want to do any of them, GPO infra can provide pointers and/or additional scripts.

Other ways to use the collected data

A simple interface is used to report data into local RRD files contained on the system. You can use this to:

  • Look at these RRDs by hand
  • Modify the collecting script to report the data to ganglia, or to some other monitoring service at your site, in addition to storing them in the RRDs.

At GPO, we report our data into ganglia, and can share details and example code.

Reporting additional data

If your site is collecting additional data which is of interest, you should be able to:

  • Use the provided python interface to store your data in RRDs
  • Write or generate a config file to send your data to the GMOC's collection

If you are interested in doing these things, please:

  • Coordinate with the GPO about naming your metrics.
  • Check with the GMOC if you plan to report a substantial amount of additional data, or for advice on working with their API to generate useful SNAPP output.