wiki:OperationalMonitoring/DataSchema

Context Navigation

Version 3 (modified by chaos@bbn.com, 10 years ago) (diff)
--

Data needed to meet operational monitoring use cases
1. Data needed to meet use case 3

Data needed to meet operational monitoring use cases

This is a working page for the operational monitoring project. It is a draft representing work in progress.

This page will eventually document the schema or schemas using which the datastore polling API? will request data from datastores. Right now, it simply lists the pieces of data which are needed in order to meet the short list of use cases on which are focusing for GEC19 and GEC20. As such, this page is closely tied to the use case component details pages at http://www.gpolab.bbn.com/monitoring/components/.

Data needed to meet use case 3

Use case description: Track node compute utilization, interface, and health statistics for shared rack nodes, and allow operators to get notifications when they are out of bounds.

In general, for this use case, we want:

CPU utilization: it's pretty standard for this to be a percentage, so we'll do that too.
Memory utilization: there's not as much of a standard for this. Purely as an alert metric, "swap free" is a good indication of when the node is too busy. That doesn't tell you much about whether the node's memory is active over time. I believe ganglia reports the difference of two stats from /proc/meminfo, "Active" - "Cached", and calls that "Memory Used". Is that a good/well-understood metric?
Disk utilization: i am partial to ganglia's "part max used" check, which looks at the local utilization of all local partitions on a node, and reports the fullest (highest) utilization percent it sees. It doesn't tell you what your problem is, but it tells you if you have a problem, and it's a single metric regardless of the number of partitions on a node.
Network utilization: in order to measure utilization, i think we want metrics for control traffic and dataplane traffic, each of which is the sum of counters for all control or dataplane interfaces of the node (if there is more than one of either). Linux /proc/net/dev reports rx_bytes, rx_packets, rx_errs, rx_drops, and the same four items for tx. So that would be 16 pieces of data per node. Does that seem right, or does that seem like too much? Another thing i don't know is, where in the system do we want to translate a number into a rate --- is it actually correct to just report these numbers as integers upstream, and have the aggregator be responsible for generating a rate, or is it better for a rate to be created locally?
Node availability: this is not intended as a detailed check of whether the node is usable for some particular experimental purpose --- that would be out of scope for this use case. It's more like a simple "is this thing on?" check. It would be fine for this to be reported as "OK" if any other data is received from the node at a given time, and "not okay" otherwise, or it would be fine for the aggregate to try to ping the node control plane and report that. This doesn't have to be consistent, and shouldn't be complicated.
Node health metrics: people suggested we might want to alert on RAID failures and on NTP sync issues. I'd like to keep track of those requests, but they're not part of the initial thin thread, so they won't be included here.

Restating that in tabular form:

Metric type	Metric	Units	Comments
CPU	CPU utilization	percent
Memory	swap free	percent	percent of total swap
Disk	part max used	percent	highest percent utilization of any local partition
Network	ctrl RX bytes	integer	sum of bytes received on all control interfaces since last reset
Network	ctrl TX bytes	integer	sum of bytes transmitted on all control interfaces since last reset
Network	ctrl RX packets	integer	sum of packets received on all control interfaces since last reset
Network	ctrl TX packets	integer	sum of packets transmitted on all control interfaces since last reset
Network	ctrl RX errs	integer	sum of receive errors on all control interfaces since last reset
Network	ctrl TX errs	integer	sum of transmit errors on all control interfaces since last reset
Network	ctrl RX drops	integer	sum of receive drops (how does it know?) on all control interfaces since last reset
Network	ctrl TX drops	integer	sum of transmit drops on all control interfaces since last reset
Network	data RX bytes	integer	sum of bytes received on all dataplane interfaces since last reset
Network	data TX bytes	integer	sum of bytes transmitted on all dataplane interfaces since last reset
Network	data RX packets	integer	sum of packets received on all dataplane interfaces since last reset
Network	data TX packets	integer	sum of packets transmitted on all dataplane interfaces since last reset
Network	data RX errs	integer	sum of receive errors on all dataplane interfaces since last reset
Network	data TX errs	integer	sum of transmit errors on all dataplane interfaces since last reset
Network	data RX drops	integer	sum of receive drops (how does it know?) on all dataplane interfaces since last reset
Network	data TX drops	integer	sum of transmit drops on all dataplane interfaces since last reset
Availability	online	boolean	is the node considered to be online as the result of a simple check at the given time?

Download in other formats:

Plain Text