wiki:GeniMgmtMonitor

Version 8 (modified by sedwards@bbn.com, 7 years ago) (diff)

--

Get involved in monitoring in GENI

If you are interested in the topic of monitoring, you are welcome to participate in a bi-weekly phone call (typically held on Fridays at 2PM Eastern). Meeting information and minutes are posted to the monitoring mailing list.

If you are interested, sign up for the mailing list or e-mail Sarah Edwards (sedwards at bbn dot com).

Monitoring & Management

Monitoring is the act of collecting data and measuring what is happening.

Management is the act of fixing problems and responding to requests.

Monitoring focuses on areas of shared interest across GENI:

  1. Tools to aid people with responsibilities (e.g. campuses who have signed the aggregate agreement)
  2. Lower burden on monitoring done multiple places (e.g. monitoring MyPLC happens at lots of campuses)
  3. Coordination when we rely on each other (e.g. debugging a VLAN path across multiple networks)
  4. Monitoring that is truly GENI-wide (e.g. GMOC DB is currently acting as a shared repo of data)
  5. Monitoring of GENI racks

Management focuses on the ability to control GENI resources (in particular GENI racks).

Schedule

As GENI scales during Spiral 4, monitoring and management will mature in a manner that scales to include new campuses and new racks.

Spiral 4 schedule leverages existing software to define long-term requirements for monitoring/mgmt software:

  • At GEC12: Agree on long-term requirements for monitoring/mgmt software
  • At GEC13: Agree on APIs & Architecture
  • GEC14:
    • Design and implement relational data API
    • Rack monitoring is operational (threshold: collecting time series data; objective: collecting relational data).
    • There exist public, documented web interfaces suitable for operators and experimenters to locate and use monitoring data
    • Flesh out issues such as: event interface design (but not implement), privacy of monitoring data, and consistent naming
  • GEC15:
    • Rack monitoring is operational (threshold: collecting time series and relational data; objective: collecting event data)
    • Have been collecting rack data since GEC 14

Related Efforts:

Related Policies:

Monitoring data:

Graphs and visualizations of monitoring data:

Monitoring Requirements

Monitoring requirements were discussed at GEC12.

GENI monitoring and management requirements include ...

  1. GENI Monitoring Requirements
    1. Information must be shareable
    2. Information must be collected
    3. Information must be available when needed
    4. Cross-GENI operational statistics must be collected and synthesized to indicate GENI as a whole is working
    5. Preserve privacy of users (opt-in, experimenters, other users of resources)
  2. GENI Management Requirements
    1. For both debugging and security problems:
      1. Must be possible to escalate events
      2. Meta-operations and aggregate operators must work together to resolve problems in a timely manner
    2. Must be possible to do an emergency stop in case of a problem
    3. Organizations must manage GENI resources consistent with local policy and best practices
      • e.g security procedures, logging, backups, etc
    4. Develop policies for monitoring
    5. All parties should implement agreed upon policies
    6. Secure GENI as a whole and secure the pieces of GENI

Monitoring Architecture Overview

The monitoring architecture was discussed at GEC13.

The monitoring architecture includes the major actors in the monitoring architecture, the interfaces between them, use cases, and the data to be collected. (See slides 4-22).

The major actors are:

  • Meta-operations, which collects and makes available operational data and also generates cross-GENI monitoring data
  • The future GENI Clearinghouse, which is the authoritative repository for project, user, and slice information
  • Aggregates, which contain resources
  • Campuses, which host resources
  • Experimenters, which under limited circumstances might provide some monitoring information to meta-operations
  • Regional Networks and Backbone Networks (I2, NLR)

The major monitoring interfaces allow:

  • Aggregates/Campuses/Experimenters/Meta-operations to submit data to Meta-operations
  • Anyone to query data from Meta-operations
  • Out-of-band communication to experimenters, aggregates and campuses

In addition, GENI monitoring relies on other interfaces which allow:

  • Access to definitive data about slices, projects, and users from the future GENI Clearinghouse (via interfaces defined by the GENI Architecture group)
  • Resolution of problems via out-of-band access between Campuses and their Regional Networks
  • Resolution of problems via out-of-band access between GMOC and GENI backbone networks (I2, NLR)

The structure of the monitoring data falls in one of three categories:

  • Time series data is a particular piece of data measured over time.
  • Relational data conveys entities and the relationship between them as observed at time T.
  • Event data denotes something happening or being noticed at a particular moment in time.

The precision of the data can be described as either:

  • Definitive data is known to be definitive because it was generated by the source that created it. The future GENI clearinghouse will contain a lot of definitive data.
  • Sampled data is collected by measuring the state of the world periodically. The Meta-operations database contains a lot of sampled data.

Monitoring use cases include:

  • An operator responds to a request from an experimenter who is unable to create a sliver at an aggregate. The operator needs to see the current and recent status of that aggregate (eg is the AM responding to queries?)
  • Assess the availability/utilization of GENI resources over a four month period.
  • A core OpenFlow switch misbehaves leading to intermittent failures. Various operators need information about whether L2 network resources are usable and the utilization of relevant network resources in order to debug the failure.
  • Assess the historic usage of network resources in order to plan for future expansion of those resources.
  • An experimenter wants to access basic information about their sliver to assess the general health and activity of their experiment.
  • An aggregate operator receives a complaint from campus IT about unusual traffic on the campus network. The aggregate operator needs to looks at the current state of resources at their campus in order to demonstrate the problem is not caused by his aggregate.

The data required to support the above use cases is also required for the architecture.

In addition several other topics such as emergency stop, various policies (such as Aggregate Provider Agreement and LLR, and privacy of monitoring data are also important to monitoring data in GENI.

GEC Session Summaries

GEC13 Summary

At GEC 13 we had a productive discussion during the Monitoring session. The notes and slides are posted on the session wiki page.

We discussed four major items:

  1. A proposed monitoring architecture which built on the monitoring requirements discussed at GEC12. Major components include the interfaces, use cases, and data to be collected. We reviewed these items and how they relate to the rest of GENI. No major objections to these items were raised.
  1. A proposal for privacy of GENI monitoring data. We proposed that some items, such as slice name, should always be public; some items should only be shared with GENI operators; and some items should not be shared at all. During the discussion of this session, it was proposed that we should create an operator's "code of ethics" and an experimenter's "privacy recommendations" documents to inform operators and experimenter of information they should know in order to keep their information private.
  1. The status of the GMOC monitoring interfaces and web portals. In addition to improvements to the API for submitting time series monitoring data implemented during Spiral 3, work continues on an API for relational data as well as a web portal to make operational and health data easier for members of the GENI community to locate and visualize.
  1. We reviewed the ongoing list of "monitoring pains". In addition to the items discussed above, since GEC12 progress has been made on: the packaging of monitoring software; and, the release and installation of the latest OpenFlow Aggregate Manager (FOAM) which facilitates improved monitoring.

GEC12 Summary

The session included a discussion of long-term GENI monitoring requirements and provided updates on several near-term topics including:

  • how the new OpenFlow aggregate manager, FOAM, makes OpenFlow management and monitoring easier;
  • a demonstration of the GMOC DB's SNAPP user interface and new Portal;
  • and a discussion of the importance of monitoring topology.

There is a summary and a link to detailed notes on the session page.

GEC11 Summary

The first half of the monitoring mini-workshop included presentations on the current status of what tools are used and what data is collected as part of monitoring today. The second half of the workshop was a discussion about issues of interest to the community including: what OpenFlow stats are available and the difficulty of current OpenFlow Opt-In, the importance of privacy and incorporating privacy into the design now, and concerns about how campuses could share data without necessarily giving others admin access to their resources. This community will be working to address these issues in the coming months.

Detailed notes are on the GEC11 wiki.