wiki:GEC13Agenda/Monitoring

Version 5 (modified by sedwards@bbn.com, 12 years ago) (diff)

Added notes

Monitoring

Schedule

Wednesday, 4:00pm - 5:30pm

Session Leaders

Sarah Edwards and Chaos Golubitsky, GENI Project Office

Original Session Description

At this working session, we will review and aim to approve an updated GENI monitoring architecture that satisfies the monitoring requirements discussed at the last GEC. In addition, we will discuss extensions to the design of the monitoring API for relational, event, and time series data. Finally, we will review the status of solutions to previously identified "pain points" of concern to the monitoring and operations community. Campus IT staff, aggregate operators and software developers are encouraged to attend.

Agenda / Details

  • Monitoring Architecture
    • Sarah Edwards, GPO
  • Monitoring Privacy
    • Heidi Picher Dempsey, GPO
  • Monitoring APIs + Experimenter/Operator Portal
    • Chaos Golubitsky, GPO
  • Status of Monitoring Pains
    • Sarah Edwards, GPO

Notes

Sarah Edwards described major elements of a proposed monitoring architecture including the major actors in the monitoring architecture, the interfaces between them, use cases, and the data to be collected. (See slides 4-22)

The major actors are:

  • Meta-operations, which collects and makes available operational data and also generates cross-GENI monitoring data
  • The future GENI Clearinghouse, which is the authoritative repository for project, user, and slice information
  • Aggregates, which contain resources
  • Campuses, which host resources
  • Experimenters, which under limited circumstances might provide some monitoring information to meta-operations
  • Regional Networks and Backbone Networks (I2, NLR)

The major monitoring interfaces allow:

  • Aggregates/Campuses/Experimenters/Meta-operations to submit data to Meta-operations
  • Anyone to query data from Meta-operations
  • Out-of-band communication to experimenters, aggregates and campuses

In addition, GENI monitoring relies on other interfaces which allow:

  • Access to definitive data about slices, projects, and users from the future GENI Clearinghouse (via interfaces defined by the GENI Architecture group)
  • Resolution of problems via out-of-band access between Campuses and their Regional Networks
  • Resolution of problems via out-of-band access between GMOC and GENI backbone networks (I2, NLR)

The structure of the monitoring data falls in one of three categories:

  • Time series data is a particular piece of data measured over time.
  • Relational data conveys entities and the relationship between them as observed at time T.
  • Event data denotes something happening or being noticed at a particular moment in time.

The precision of the data can be described as either:

  • Definitive data is known to be definitive because it was generated by the source that created it. The future GENI clearinghouse will contain a lot of definitive data.
  • Sampled data is collected by measuring the state of the world periodically. The Meta-operations database contains a lot of sampled data.

Several uses cases were reviewed with notes about the data required to support them:

  • An operator responds to a request from an experimenter who is unable to create a sliver at an aggregate. The operator needs to see the current and recent status of that aggregate (eg is the AM responding to queries?)
  • Assess the availability/utilization of GENI resources over a four month period.
  • A core OpenFlow switch misbehaves leading to intermittent failures. Various operators need information about whether L2 network resources are usable and the utilization of relevant network resources in order to debug the failure.
  • Assess the historic usage of network resources in order to plan for future expansion of those resources.
  • An experimenter wants to access basic information about their sliver to assess the general health and activity of their experiment.
  • An aggregate operator receives a complaint from campus IT about unusual traffic on the campus network. The aggregate operator needs to looks at the current state of resources at their campus in order to demonstrate the problem is not caused by his aggregate.

The data needed to be collected for the above use cases was also reviewed.

Discussion:

Heidi Picher Dempsey asked about the scope of the use cases. Sarah Edwards responded that use cases aren't meant to be exhaustive, but are based on real things we wanted to know over the fairly short time we've been monitoring GENI. If people know of data or use cases that they think are important, talk to Sarah about adding them.

Victor Orlikowsky noted that I&M overlaps with monitoring. The group discussed the distinction for awhile. In particular, Heidi noted that "we don't want every experimenter to go out and collect all of the same data."

Heidi Picher Dempsey described a proposal for privacy of monitoring data in GENI. (Slides 24-31)

Heidi said that the goal of the proposal is to make some decisions. We've been collecting data, but we've given no guidance about what to do about it. This proposal is based on input from Adam Slagell.

General Recommendation:

  • Data that CAN be shared publicly: existence of slice (urn, UUID); slice is active (has resources); slice name
  • Some data CAN be shared IF you have access controls. Aggregates should make sure they are not sharing user data. LLR requests are a special case.
  • Data that should NOT be shared: opt-in user data; data that identifies experimenters by username/real name; data that identifies experimenter contact info

Privacy recommendations to experimenters

  • GENI needs to warn experimenters that slice name is public
  • Slice credential will (soon) include a slice e-mail. If you don't want it shared, then don't use a personal email.
  • Your personal email is part of your experimenter credential and will be shared with the aggregate and GMOC. Should not be shared outside GMOC.

Discussion:

Victor suggested that we create two documents:

  1. An operator "Code of Ethics": Have a privacy policy of what we will or won't do with your data. (Chaos Golubitsky suggested looking at the sysadmin code of ethics for guidance).
  2. "Experimenter's privacy recommendations": Here's what you should do/not do as an experimenter.

One person expressed surprise that slice name is public and that we should make the warning about slice name being public front and center.

Sarah noted that basic information about an experiment (eg nodes used, interface stats) will be publicly available.

Chaos Golubitsky presented on monitoring interfaces and the experimenter portal including updates done in conjunction with Mitch McCracken. (Slides 33-43)

GMOC has been collecting slice metadata, measurement data, and operational data from GENI aggregates and slice authorities. Starting in Spiral 4, they will run the Service Desk which will detect and respond to operational problems in GENI. Chaos introduced Mitch McCracken who is GMOC's primary software developer who is working on GENI monitoring; Mitch started working on monitoring data submission and interfaces in January.

GMOC tools allow for the collection, display and alerting on operational slice metadata and measurement data.

Work to improve these tools has focused on:

  • standardizing data submission to make it easy for new projects (like racks) to submit data to GMOC and provide consistent data naming
  • Improving user interfaces to make it easy for data of interest to be easy for experimenters and operators to find. In particular, data should be tied together in standard ways and operational health data should be easy to find.

Currently there are two APIs for submitting data: a measurement API for submitting time series data, and a relational API for submitting meta-data. In Spiral 3 GMOC and GPO tested the measurement API and collected a year of data improving reliability and tools for data submission.

Chaos demonstrated progress with the GMOC Experimenter web interface which supports finding and viewing time series data for slices and nodes. In addition, there is now submission of slice data from the GPO's SA using the relational API.

In spiral 4, work will continue to test and improve the relational data API, refine the time-series API, and collect data from the new racks. In addition, work will continue on improving the user interfaces including: improved documentation, improved health reports, tying together relational data with time-series data, use of URNs and UUIDs consistently with other GENI entities to distinguish slices over time, and to support specific use cases.

Chaos then invited participation by others on the following items:

  • early adopters for the experimenter web portal
  • ideas for GENI health tests that people are running or would like to see run
  • feedback

Sarah Edwards reviewed the status of the ongoing list of "monitoring pains". (Slides 45-47)

Items which have made progress since the last GEC are listed on the slides with check marks:

  • Chaos and Mitch have made progress on packaging the plastic slices monitoring software
  • Nick Bastin and Josh Smift have released FOAM (the OpenFlow Aggregate Manager) and installed it throughout the meso-scale which supports improved monitoring
  • The experimenter web interfaces improves the ability to find data by slice
  • Heidi (based on input from Adam Slagell) outlined a proposal regarding privacy of monitoring data at this session
  • Sarah presented some monitoring data that should be collected at this session

For more information please contact one of the following people or attend the biweekly monitoring call...

Contacts:

Biweekly monitoring conference call:

  • Every other Friday at 2pm Eastern
  • Information on GENI monitoring mailing list

GENI operational monitoring list: monitoring@geni.net (Sign up at http://lists.geni.net)

Attachments (2)

Download all attachments as: .zip