Changes between Version 4 and Version 5 of OperationalMonitoring/DataForUseCases


Ignore:
Timestamp:
03/17/14 14:24:06 (10 years ago)
Author:
rirwin@bbn.com
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • OperationalMonitoring/DataForUseCases

    v4 v5  
    4444== Details of data needed to meet all use cases ==
    4545
    46 A source of configuration data about operational monitoring is needed to tell aggregators where to look for local datastores.  For now, we think the data needed is:
     46A source of configuration data about operational monitoring is needed to tell collectors where to look for local datastores.  For now, we think the data needed is:
    4747 * For aggregates, URN of the aggregate, URL of datastore about the aggregate, type of aggregate
    4848 * For authorities, URN of the authority, URL of datastore about the authority
     
    5757 * '''Memory utilization''': there's not as much of a standard for this.  Purely as an alert metric, "swap free" is a good indication of when the node is too busy.  That doesn't tell you much about whether the node's memory is active over time.  I believe ganglia reports the difference of two stats from /proc/meminfo, "Active" - "Cached", and calls that "Memory Used".  Is that a good/well-understood metric?
    5858 * '''Disk utilization''': i am partial to ganglia's "part max used" check, which looks at the local utilization of all local partitions on a node, and reports the fullest (highest) utilization percent it sees.  It doesn't tell you what your problem is, but it tells you if you have a problem, and it's a single metric regardless of the number of partitions on a node.
    59  * '''Network utilization''': in order to measure utilization, i think we want metrics for control traffic and dataplane traffic, each of which is the sum of counters for all control or dataplane interfaces of the node (if there is more than one of either).  Linux /proc/net/dev reports rx_bytes, rx_packets, rx_errs, rx_drops, and the same four items for tx.  So that would be 16 pieces of data per node.  Does that seem right, or does that seem like too much?  Another thing i don't know is, where in the system do we want to translate a number into a rate --- is it actually correct to just report these numbers as integers upstream, and have the aggregator be responsible for generating a rate, or is it better for a rate to be created locally?
     59 * '''Network utilization''': in order to measure utilization, i think we want metrics for control traffic and dataplane traffic, each of which is the sum of counters for all control or dataplane interfaces of the node (if there is more than one of either).  Linux /proc/net/dev reports rx_bytes, rx_packets, rx_errs, rx_drops, and the same four items for tx.  So that would be 16 pieces of data per node.  Does that seem right, or does that seem like too much?  Another thing i don't know is, where in the system do we want to translate a number into a rate --- is it actually correct to just report these numbers as integers upstream, and have the collector be responsible for generating a rate, or is it better for a rate to be created locally?
    6060 * '''Node availability''': this is ''not'' intended as a detailed check of whether the node is usable for some particular experimental purpose --- that would be out of scope for this use case.  It's more like a simple "is this thing on?" check.  It would be fine for this to be reported as "OK" if any other data is received from the node at a given time, and "not okay" otherwise, or it would be fine for the aggregate to try to ping the node control plane and report that.  This doesn't have to be consistent, and shouldn't be complicated.
    6161 * '''Node health metrics''': people suggested we might want to alert on RAID failures and on NTP sync issues.  I'd like to keep track of those requests, but they're not part of the initial thin thread, so they won't be included here.