wiki:GEC14Agenda/IMMonitoring/DetailedNotes

Version 1 (modified by sedwards@bbn.com, 9 years ago) (diff)

--

GEC 14 I&M and Monitoring Session Detailed Notes

Part I: The state of the world

Sarah Edwards introduced the purpose of the session. This is the first combined I&M and monitoring session. We are here because there are many tools using various techniques to collect various types of data. We believe folks in this room will benefit from and will want to share tools and data.

The session is broken into two parts. In the first part, a series of speakers will discuss some monitoring and I&M tools and data. In the second part, Jeanne Ohren will lead a discussion about common issues.

Kevin Bohan of GMOC demonstrated the new monitoring user interface.

GENI Meta-operations Center (GMOC) supports a cross-cutting meta-operations framework.

The user interface has two goals.

  1. For experimenters, provide confidence that their resources will work as they expect.
  2. For operations, provide information about their infrastructure.

GMOC collects data into a database and makes that information available via a web interface. There are two APIs for submitting data to the database: a relational API and a time series API. We are getting relational data from slice authorities, aggregates, and resources as well as getting time series data about resources.

Many entities are reporting data to GMOC (complete list in the slides) including:

  • the GPO slice authority (SA)
  • ExoGENI and InstaGENI racks
  • Aggregates
  • some Health Checks

The following information is provided by these items:

  • All entities report name, type, physical location, and operating organization.
  • Slices and slivers provide creator, creation/expiration time.
  • SAs provide a list of slices (identified by URN+UUID).
  • Aggregates include slivers (including state, corresponding slice, and containing resources)
  • Resources include interfaces as well as time series data such as CPU & Disk Utilization, Number of active VMs, Interface traffic counters, and OpenFlow control stats
  • Health Checks tells us if AM is responding.

The demonstration will highlight two use cases:

  • For experimenters, what's happening on my slice?
  • For operations, what's happening at my location?

Kevin then did a live demonstration of the GMOC DB user interface.

First, from the perspective of an experimenter who is interested in their slice:

  1. Login to GMOC DB.
  2. Slice page shows current state as of a few minutes ago including list of slices by URN, contact, last update, # slivers, # resources.
  3. Search for a particular slice (in this case tuptymon) in the search box at the top of the page.
  4. Click on detail link for that slice.
  5. Slivers tab shows slivers in that slice: aggregate, urn, expiration, status, last update
  6. Resources tabs shows: some OpenFlow datapaths, and a VM.
  7. Click on a resource.
  8. Resource measurements tab gives us a set of metrics. Currently collecting 7 metrics and here show 3 of those. See VM stats and click on VM count.
  9. Change resolution to 1 week gives a sense of the variation in this stat over time.

Second, from the perspective of operations, who is interested in the status of the resources they are running...

  1. Go to "aggregates".
  2. See list of all aggregates.
  3. Search for ExoGENI aggregates using the search box at the top of the page.
  4. Get three aggregates: 2 FOAMs and 1 Orca aggregate
  5. Click on rc-bn.exogeni.net. Look at tabs: Slivers, Resources, and Aggregate Measurements.
  6. This is a different list of measurements than we saw above.
  7. Click on OpenFlow statistics and get OpenFlow control statistics.

If you are interested in using the "protected" interface, please contact the GMOC service desk or the GENI Help Desk.

If you are interested in monitoring, join the monitoring@geni.net mailing list

Anirban Mandal described GENI client authentication and authorization for an XMPP messaging service.

Overall vision is to have an XMPP server which acts as a conduit for messages flowing through measurement points. Added features for authenticating clients.

There are different types of clients:

  1. client inside a slice
  2. client outside a slice (like a control framework entity)
  3. pubsub server subscribers

Authentication: "Can a client authenticate with the XMPP server using authentication mechanisms advertised by the XMPP server using GENI certificates?"

Did this by:

  • Added code to gcf code.
  • Added SASL external authentication on XMPP server. This is mostly one-time configuration of XMPP server (clearinghouse certificate needs to be inserted in server's client truststore.)

Think of this as a jabber server and various measurement points act as clients.

This part isn't pub sub. This is just sending messages authenticated through a GENI structure.

Authorization: “Does an already authenticated client have credentials (rights) to publish and subscribe to a pubsub node ? ”

If you want to subscribe to pubsub server, need the appropriate credentials.

  • How get credentials?
  • How verify creds on the XMPP server?

Did this by:

  • Added xmppcred to gcf tool.
  • Takes certificate of client, certificate of clearinghouse, XMPP server certificate key-pair, and rights namespace (which is a set of pubsub node namespaces which say which part of the pubsub space that client has rights to).
  • Extended XMPP server code to enable credential verification.
  • Openfire pubsub policy code is augmented with GENI credential verification.
  • Verify credentials. If it works, then pubsub action is approved.

Two example use cases that are using this...

First example, Orca service manager publishes slice manifest on XMPP server.

  • Publishes once when created, when nodes are created, it publishes an update.
  • A manifest subscriber client subscribes to slices of interest. Then get notified when manifest changes over time.

Second example, OMF components are communicating via an XMPP messaging service.

Martin Swany discussed active network monitoring with GEMINI.

GEMINI is an I&M system based on:

  • perfSONAR/LAMP which is a modification to general perfSONAR. Modified to understand GENI creds, etc.
  • INSTOOLS
  • Periscope (update to perfSONAR to make them more modern web services using JSON and REST interfaces)

GEMINI targets the complete I&M scenario.

  • Active measurements (from perspective of GEMINI) aren't special.
  • Active measurements need to have accurate timestamps. Tying in the measurements is more tricky.
  • All measurements may perturb a user's experiment.
  • All measurements may perturb other measurements.
  • Active measurement affect infrastructure, perhaps over multiple hops.
  • (Consistent with passive measurements, but certainly some issues that need to be addresses.)

Active Measurement Tools

  • OWAMP
    • like ping but can handle one way delay (and parse apart two directions)
    • depends on clock synchronization
  • BWCTL
    • wraps bulk transfer tools to provide mutual exclusion for ongoing tests
  • ping
    • simple and ubiquitous
  • 802.1ag
    • defines an L2 ping (to be added soon)
  • traceroute
    • is in perfSONAR, but not GEMINI yet.

GEMINI at GEC14

To use these...

  • A user marks which nodes are active measurement points in request RSpecs. Runs on users nodes (should they be distinct?). Tells instrumentize process to install active measurement tools.
  • Need to select node before start is due to code limitations.
  • WebUI provides central configuration and administration for active measurments.
  • Nodes run a local service which updates config, etc using UNIS service.

Performing Active Measurements

  • Two classes:
    • on demand -- for debugging an issue
    • regular testing - make sure things remain unchanged. Or measure changes over time.
  • perfSONAR includes metadata about measurements.

Future Issues

  • Work towards a single framework for measurements. Make it easy to extend with new tools. perfSONAR protocols were intended to be extensible, but code hasn't supported that over time.
  • Interaction between intra-slice and infrastructure measurements.
  • We'd like to get high rate, very frequent measurements. So lots of measurement activity. Lots in substrate, lots in slices.

Dedicated nodes might provide better info, but users need to be able to request that.

  • Coordination and sharing of active measurements is something to discuss.

Prasad Calyam described doing measurements on Layer 2 and OpenFlow paths.

Prasad provided an experimenter perspective on layer2 and OpenFlow slices.

  • Running I&M related experiments.
  • Running some OpenFlow slices.
  • Allocating thin clients across a long path.
  • Experiment active slice since GEC13.
  • I&M slice since April.

Use Case:

  • Run active measurements to check connectivity and performance as part of layer 2/OpenFlow slice monitoring.
  • Schedule experiment and active measurement traffic in a conflict-free manner, and use measurement intelligence for adaptation.

Has two slices running for some months running various test scenarios (see slides for details).

Brief Results

  • traceroute varies between IP and OpenFlow
  • Showed impact of competing traffic in different slices on same network.

Conclusion

  • I&M has a separate platform requirement compared to others.
  • L2 connectivity troubleshooting is required.
  • Impact of I&M can be seen on the experiment if measurement conflict occurs.
  • Sample OpenFlow slice RSpecs from GPO are helpful.

Next steps

  • orchestrate measurements
  • easier methods to integrate app metrics
  • provide OpenFlow slice traffic visibility -- would help experimenters a lot.

Part II: Discussion

Jeanne Ohren of the GPO introduced the discussion with some overarching questions for us to discuss.

Some issues...

  1. Consistent naming of resources and devices
    • example - consistent names
      • two aggregates share a link. Endpoint names need to be consistent.
    • example - globally unique
      • 3 ways of identifying same slice: URN, UUID, and slice name
      • consumer of this data might need to determine if two slivers belong to the same slice.
      • Growing consensus to identify slices by a combination of slice URN and slice UUID because the combination is unique over time and space.
      • GENI AM API v3 adopted this. Monitoring and one I&M group adopting this as well.
    • Question: How does this affect other projects? What other types of naming examples do we need to worry about?
  1. Data transport example
    • Example:
      • Aggregate collects data about sliver and resources in slivers, etc and report to GMOC.
      • Experimenter interested in resources available at aggregates.
      • Operator is interested in statistics on the slivers that have been created/deleted over a period of time.
    • How do each of these parties access the data?
      • Aggregates push data to GMOC (using GMOC API)
      • The future GENI CH will provide an API to pull data on slices, users, and projects.
      • IMF and others provide a pub/sub interface
      • I&M provide the ability of users to push data to an archive with metadata
      • iRODS account holders can control who has access to this data.
    • Currently we're transporting data.
      • Consider: Access control. How do we make sure the right people can access the data? How do we keep the wrong people from accessing the data?
      • Reliability, how do we ensure data is recorded properly.
    • Question: Can we work together to get access to good reliable data.
  1. Some data sources, quickly... (full list in slides)
    • Relational data collected by GMOC
    • Time-series data collected by GMOC
    • Active network measurement data collected by I&M tools
    • Passive host measurement data collected by I&M tools
    • Measurement Data Object Descriptor
    • Other independent monitoring tools

Discussion

Data naming

  • How have lack of globally unique and consistent naming affected other project?
  • What are some other data naming examples?

Prasad: Experiment itself has a lot of data it is generating. Lots of app specific measurements that are really critical. Processed measurement in addition to active measurements.

Justin Campos: Seattle nodes don't have consistent IPs. Use public keys to identify nodes. Do looks up based on keys, but nodes will put in information that helps build a hierarchy of data so we can find things.

Data transport

  • What are you using that others might find useful?
  • How can we all walk away from the table with access to good, reliable data?

Sarah Edwards: Control framework folks have lots of experience with authentication and authorization.

Chaos: Could GENI credentials be used to access data? How much more could we be doing with GENI credentials we already have. Are there APIs that would make it easier to plug in GENI credentials.

Justin Cappos: (Stole from Amazon) API key interface that is easily regeneratable. Don't want this to be a private key or a password for a user. Could use a less privileged mechanism which allows you to regenerate keys.

Sarah: prototype CH has some ideas with InCommon

Chaos: machines authenticating in a privileged way.

Martin Swany: in GEMINI using proxy certificate to allow shorter lived sub-identies

General point: Two authentication issues: user's log in (well understood); machines share data

Machine to machine transaction may be a smaller set of users.

What other issues have you encountered?

Clock synchronization?

Justin: Lots of issues especially related to crypto. In places we need to do this. We use NTP data. Lots of places have NTP blocked. We run our own NTP nodes and tunnel out with public IP addresses.

Chaos: Are you able to easily detect machines that are off?

Justin: We control the nodes. We refresh NTP every day. We have 20 sec expiry time on crypto ops.