Changes between Version 7 and Version 8 of OperationalMonitoring/DataSchema


Ignore:
Timestamp:
01/21/14 16:50:32 (10 years ago)
Author:
chaos@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • OperationalMonitoring/DataSchema

    v7 v8  
    66
    77This page will eventually document the schema or schemas using which the [wiki:OperationalMonitoring/DatastorePollingApi datastore polling API] will request data from datastores.  Right now, it simply lists the pieces of data which are needed in order to meet the short list of use cases on which are focusing for GEC19 and GEC20.  As such, this page is closely tied to the use case component details pages at [http://www.gpolab.bbn.com/monitoring/components/].
     8
     9== Table of all data ==
     10
     11Types of data:
     12 * measurement: a time-series value which is collected frequently
     13 * state: an existing set of relations which may change as a result of an event
     14 * config: data which is unlikely to change frequently, but should be polled occasionally in case it has changed
     15
     16|| '''Subject'''       || '''Metric'''         || '''Type'''  || '''Units''' || '''Description'''                                                                      || '''Use Cases''' ||
     17|| shared compute node || CPU utilization      || measurement || percent     ||                                                                                        || 3               ||
     18|| shared compute node || swap free            || measurement || percent     || percent of total swap which is free                                                    || 3               ||
     19|| shared compute node || memory total         || config      || bytes       || total physical memory on the node                                                      || 3               ||
     20|| shared compute node || memory used          || measurement || bytes       || total memory in active use on the node                                                 || 3               ||
     21|| shared compute node || disk part max used   || measurement || percent     || highest percent utilization of any local partition                                     || 3               ||
     22|| shared compute node || ctrl net max bytes   || config      || integer     || sum of maximum bytes per second available on all control interfaces                    || 3               ||
     23|| shared compute node || ctrl net RX bytes    || measurement || integer     || sum of bytes received on all control interfaces since last reset                       || 3               ||
     24|| shared compute node || ctrl net TX bytes    || measurement || integer     || sum of bytes transmitted on all control interfaces since last reset                    || 3               ||
     25|| shared compute node || ctrl net max packets || config      || integer     || sum of maximum packets per second available on all control interfaces                  || 3               ||
     26|| shared compute node || ctrl net RX packets  || measurement || integer     || sum of packets received on all control interfaces since last reset                     || 3               ||
     27|| shared compute node || ctrl net TX packets  || measurement || integer     || sum of packets transmitted on all control interfaces since last reset                  || 3               ||
     28|| shared compute node || ctrl net RX errs     || measurement || integer     || sum of receive errors on all control interfaces since last reset                       || 3               ||
     29|| shared compute node || ctrl net TX errs     || measurement || integer     || sum of transmit errors on all control interfaces since last reset                      || 3               ||
     30|| shared compute node || ctrl net RX drops    || measurement || integer     || sum of receive drops (how does it know?) on all control interfaces since last reset    || 3               ||
     31|| shared compute node || ctrl net TX drops    || measurement || integer     || sum of transmit drops on all control interfaces since last reset                       || 3               ||
     32|| shared compute node || data net max bytes   || config      || integer     || sum of maximum bytes per second available on all dataplane interfaces                  || 3               ||
     33|| shared compute node || data net RX bytes    || measurement || integer     || sum of bytes received on all dataplane interfaces since last reset                     || 3               ||
     34|| shared compute node || data net TX bytes    || measurement || integer     || sum of bytes transmitted on all dataplane interfaces since last reset                  || 3               ||
     35|| shared compute node || data net max packets || config      || integer     || sum of maximum packets per second available on all dataplane interfaces                || 3               ||
     36|| shared compute node || data net RX packets  || measurement || integer     || sum of packets received on all dataplane interfaces since last reset                   || 3               ||
     37|| shared compute node || data net TX packets  || measurement || integer     || sum of packets transmitted on all dataplane interfaces since last reset                || 3               ||
     38|| shared compute node || data net RX errs     || measurement || integer     || sum of receive errors on all dataplane interfaces since last reset                     || 3               ||
     39|| shared compute node || data net TX errs     || measurement || integer     || sum of transmit errors on all dataplane interfaces since last reset                    || 3               ||
     40|| shared compute node || data net RX drops    || measurement || integer     || sum of receive drops (how does it know?) on all dataplane interfaces since last reset  || 3               ||
     41|| shared compute node || data net TX drops    || measurement || integer     || sum of transmit drops on all dataplane interfaces since last reset                     || 3               ||
     42|| shared compute node || is available         || measurement || boolean     || is the node considered to be online as the result of a simple check at the given time? || 3               ||
    843
    944== Data needed to meet use case 3 ==
     
    2055 * '''Node health metrics''': people suggested we might want to alert on RAID failures and on NTP sync issues.  I'd like to keep track of those requests, but they're not part of the initial thin thread, so they won't be included here.
    2156 * We probably also need some form of metadata about each node: not collected all the time, but available for periodic query.  For instance, we probably need to know what type of VM server it is (for general information), and what the maximum values are for any metrics we're reporting as rates or counters (e.g. network utilization) rather than as percentages, because we can't tell if we're hitting the maximum if we don't know what the maximum is.
    22 
    23 Restating that in tabular form:
    24 
    25 || '''Metric type''' || '''Metric'''    || '''Units''' || '''Comments'''                                                                         ||
    26 || CPU               || CPU utilization || percent     ||                                                                                        ||
    27 || Memory            || swap free       || percent     || percent of total swap which is free                                                    ||
    28 || Memory            || memory used     || bytes       || total memory in active use on the node                                                 ||
    29 || Disk              || part max used   || percent     || highest percent utilization of any local partition                                     ||
    30 || Network           || ctrl RX bytes   || integer     || sum of bytes received on all control interfaces since last reset                       ||
    31 || Network           || ctrl TX bytes   || integer     || sum of bytes transmitted on all control interfaces since last reset                    ||
    32 || Network           || ctrl RX packets || integer     || sum of packets received on all control interfaces since last reset                     ||
    33 || Network           || ctrl TX packets || integer     || sum of packets transmitted on all control interfaces since last reset                  ||
    34 || Network           || ctrl RX errs    || integer     || sum of receive errors on all control interfaces since last reset                       ||
    35 || Network           || ctrl TX errs    || integer     || sum of transmit errors on all control interfaces since last reset                      ||
    36 || Network           || ctrl RX drops   || integer     || sum of receive drops (how does it know?) on all control interfaces since last reset    ||
    37 || Network           || ctrl TX drops   || integer     || sum of transmit drops on all control interfaces since last reset                       ||
    38 || Network           || data RX bytes   || integer     || sum of bytes received on all dataplane interfaces since last reset                     ||
    39 || Network           || data TX bytes   || integer     || sum of bytes transmitted on all dataplane interfaces since last reset                  ||
    40 || Network           || data RX packets || integer     || sum of packets received on all dataplane interfaces since last reset                   ||
    41 || Network           || data TX packets || integer     || sum of packets transmitted on all dataplane interfaces since last reset                ||
    42 || Network           || data RX errs    || integer     || sum of receive errors on all dataplane interfaces since last reset                     ||
    43 || Network           || data TX errs    || integer     || sum of transmit errors on all dataplane interfaces since last reset                    ||
    44 || Network           || data RX drops   || integer     || sum of receive drops (how does it know?) on all dataplane interfaces since last reset  ||
    45 || Network           || data TX drops   || integer     || sum of transmit drops on all dataplane interfaces since last reset                     ||
    46 || Availability      || online          || boolean     || is the node considered to be online as the result of a simple check at the given time? ||
    4757
    4858== Data needed to meet use case 6 ==