Changes between Version 2 and Version 3 of OperationalMonitoring/DataForUseCases20


Ignore:
Timestamp:
08/29/14 10:51:22 (10 years ago)
Author:
lnevers@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • OperationalMonitoring/DataForUseCases20

    v2 v3  
    1 [[PageOutline]]
    21
    3 = Data needed to meet operational monitoring use cases =
    4 
    5 ''This is a working page for the [wiki:OperationalMonitoring operational monitoring project]. It is a draft representing work in progress.''
    6 
    7 This page will eventually document the schema or schemas using which the [wiki:OperationalMonitoring/DatastorePolling datastore polling] will request data from datastores.  Right now, it simply lists the pieces of data which are needed in order to meet the short list of use cases on which are currently focusing.  As such, this page is closely tied to the use case component details pages at [http://www.gpolab.bbn.com/monitoring/components/].
    8 
    9 == Table of all data ==
    10 
    11 Types of data:
    12  * measurement: a time-series value which is collected frequently
    13  * state: an existing set of relations which may change as a result of an event
    14  * config: data which is unlikely to change frequently, but should be polled occasionally in case it has changed
    15 
    16 || '''Subject'''                                        || '''Metric'''           || '''Type'''  || '''Units''' || '''Description'''                                                                      || '''Use Cases''' ||
    17 || shared compute node                                  || CPU utilization        || measurement || percent     ||                                                                                        || 3               ||
    18 || shared compute node                                  || swap free              || measurement || percent     || percent of total swap which is free                                                    || 3               ||
    19 || shared compute node                                  || memory total           || config      || bytes       || total physical memory on the node                                                      || 3               ||
    20 || shared compute node                                  || memory used            || measurement || bytes       || total memory in active use on the node                                                 || 3               ||
    21 || shared compute node                                  || disk part max used     || measurement || percent     || highest percent utilization of any local partition                                     || 3               ||
    22 || shared compute node interface (control or dataplane) || max bytes              || config      || integer     || bytes per second available on the interface                                            || 3               ||
    23 || shared compute node interface (control or dataplane) || RX bytes               || measurement || float       || bytes per second received n the interface                                              || 3               ||
    24 || shared compute node interface (control or dataplane) || TX bytes               || measurement || float       || bytes per second transmitted on the interface                                          || 3               ||
    25 || shared compute node interface (control or dataplane) || max packets            || config      || integer     || packets per second available on the interface                                          || 3               ||
    26 || shared compute node interface (control or dataplane) || RX packets             || measurement || float       || packets per second received on the interface                                           || 3               ||
    27 || shared compute node interface (control or dataplane) || TX packets             || measurement || float       || packets per second transmitted on the interface                                        || 3               ||
    28 || shared compute node interface (control or dataplane) || RX errs                || measurement || float       || receive errors per second on the interface                                             || 3               ||
    29 || shared compute node interface (control or dataplane) || TX errs                || measurement || float       || transmit errors per second on the interface                                            || 3               ||
    30 || shared compute node interface (control or dataplane) || RX drops               || measurement || float       || receive drops per second (how does it know?) on the interface                          || 3               ||
    31 || shared compute node interface (control or dataplane) || TX drops               || measurement || float       || transmit drops per second on the interface                                             || 3               ||
    32 || shared compute node                                  || is available           || measurement || boolean     || is the node considered to be online as the result of a simple check at the given time? || 3               ||
    33 || aggregate                                            || current sliver list    || state       || list        || list of slivers (URN + UUID) currently existing or reserved on the aggregate           || 6               ||
    34 || sliver                                               || slice URN/UUID         || state       || string      || unique identifier of slice mapped to sliver (URN + UUID)                               || 6               ||
    35 || sliver                                               || creation time          || state       || timestamp   || creation time of sliver                                                                || 6               ||
    36 || sliver                                               || expiration time        || state       || timestamp   || current expiration time of sliver                                                      || 6               ||
    37 || sliver                                               || creator URN            || state       || string      || URN of sliver creator                                                                  || 6               ||
    38 || sliver                                               || resources              || state       || list        || list of resource URNs on which the sliver has a current reservation                    || 6               ||
    39 || slice                                                || creator                || state       || string      || URN of slice creator                                                                   || 6               ||
    40 || slice                                                || participants           || state       || list        || list of experimenters who have privileges on a slice                                   || 6               ||
    41 || experimenter                                         || email                  || state       || string      || contact address for experimenter                                                       || 6               ||
    42 || config datastore                                     || current datastore list || config      || list        || list of local datastores to query for GENI monitoring data                             || 3, 6            ||
    43 
    44 == Details of data needed to meet all use cases ==
    45 
    46 A source of configuration data about operational monitoring is needed to tell collectors where to look for local datastores.  For now, we think the data needed is:
    47  * For aggregates, URN of the aggregate, URL of datastore about the aggregate, type of aggregate
    48  * For authorities, URN of the authority, URL of datastore about the authority
    49 
    50 === Data needed to meet use case 3 ===
    51 
    52 Use case description: Track node compute utilization, interface, and health statistics for shared rack nodes, and allow operators to get notifications when they are out of bounds.
    53  * [http://www.gpolab.bbn.com/monitoring/components/use_case_03.html Proposed components for this use case]
    54 
    55 In general, for this use case, we want:
    56  * '''CPU utilization''': it's pretty standard for this to be a percentage, so we'll do that too.
    57  * '''Memory utilization''': there's not as much of a standard for this.  Purely as an alert metric, "swap free" is a good indication of when the node is too busy.  That doesn't tell you much about whether the node's memory is active over time.  I believe ganglia reports the difference of two stats from /proc/meminfo, "Active" - "Cached", and calls that "Memory Used".  Is that a good/well-understood metric?
    58  * '''Disk utilization''': i am partial to ganglia's "part max used" check, which looks at the local utilization of all local partitions on a node, and reports the fullest (highest) utilization percent it sees.  It doesn't tell you what your problem is, but it tells you if you have a problem, and it's a single metric regardless of the number of partitions on a node.
    59  * '''Network utilization''': in order to measure utilization, i think we want metrics for control traffic and dataplane traffic, each of which is the sum of counters for all control or dataplane interfaces of the node (if there is more than one of either).  Linux /proc/net/dev reports rx_bytes, rx_packets, rx_errs, rx_drops, and the same four items for tx.  So that would be 16 pieces of data per node.  Does that seem right, or does that seem like too much?  Another thing i don't know is, where in the system do we want to translate a number into a rate --- is it actually correct to just report these numbers as integers upstream, and have the collector be responsible for generating a rate, or is it better for a rate to be created locally?
    60  * '''Node availability''': this is ''not'' intended as a detailed check of whether the node is usable for some particular experimental purpose --- that would be out of scope for this use case.  It's more like a simple "is this thing on?" check.  It would be fine for this to be reported as "OK" if any other data is received from the node at a given time, and "not okay" otherwise, or it would be fine for the aggregate to try to ping the node control plane and report that.  This doesn't have to be consistent, and shouldn't be complicated.
    61  * '''Node health metrics''': people suggested we might want to alert on RAID failures and on NTP sync issues.  I'd like to keep track of those requests, but they're not part of the initial thin thread, so they won't be included here.
    62  * We probably also need some form of metadata about each node: not collected all the time, but available for periodic query.  For instance, we probably need to know what type of VM server it is (for general information), and what the maximum values are for any metrics we're reporting as rates or counters (e.g. network utilization) rather than as percentages, because we can't tell if we're hitting the maximum if we don't know what the maximum is.
    63 
    64 === Data needed to meet use case 6 ===
    65 
    66 Use case description: Find out what slivers will be affected by a maintenance or outage of some resource, and get contact information for the owners of those slivers so targeted notifications can be sent
    67  * [http://www.gpolab.bbn.com/monitoring/components/use_case_06.html Proposed components for this use case]
    68 
    69 In general, for this use case, we want:
    70  * '''Sliver data''':
    71    * '''What slivers exist on a GENI aggregate right now:''' i think we always want "right now" even if the outage isn't going to be right now --- if reservations are implemented and thus there's an idea of known slivers that will exist in the future but don't exist yet, we'll want that.  But, while a reporting tool might choose to omit slivers which are expiring before the time of interest, it might choose not to on the grounds that slivers often get renewed --- it should be up to the tool, so always report the maximum number of slivers the AM knows about now or in the future.
    72    * '''Information about each sliver''':
    73      * Sliver URN and UUID
    74      * Slice URN and UUID
    75      * Creation and expiration times
    76      * Creator (maybe this is optional because some AMs will always tell us to ask the SA?  not sure)
    77      * Resources this sliver has reserved:
    78        * URN of each named resource of types: bare-metal host, shared host, VLAN, flowspace (what else?)
    79  * '''Slice experimenter data''': for each relevant slice URN and UUID, find out from the authority:
    80    * Experimenters affiliated with the slice (creator, participants)
    81    * E-mail contact info for each of those experimenters
    82 
    83 === Data about links and VLANs ===
    84 
    85 Any VLAN usage within a rack, between rack to a network aggregate, or traversing a network aggregate need to be reported to monitor the GENI data plane.  The [wiki:OperationalMonitoring/DataSchema20#Linkcallandresponse link schema] contains a collection of [wiki:OperationalMonitoring/DataSchema20#Interface-VLANcallandresponse interface-vlan] points in the GENI data plane.  It is required that rack and network aggregates provide the endpoints of interface-vlans.
     2Moved to http://groups.geni.net/geni/wiki/OperationalMonitoring/DataForUseCases