Monitoring mini workshop notes:

Sarah Edwards of GPO introduced the session.

Monitoring fits in the context of a number of other GEC sessions,
including most of the campus track sessions.  It is related to, but
distinct from, the instrumentation and measurement session.

Why are we here?
 * We're trying to operate aggregates as if they're production, so
   we need monitoring tools now.
 * We should share information about what tools we have and what
   tools we need, and use those capabilities and requirements to
   set priorities.
 * Sites which have signed aggregate provider agreements need to
   meet operations requirements.  Shared monitoring information may
   make their jobs easier.  Sharing data on resources that multiple
   campuses use prevents campuses from needing to reinvent the
   wheel.  Sharing data also makes joint debugging of shared
   infrastructure more feasible.


Camilo Viecco of GMOC spoke about GMOC data monitoring options:

GMOC has three functions:
 1. To provide a unified view of GENI
 2. To be an initial point of contact for operations
 3. To provide monitoring visualization and monitoring APIs
They try to be "the place for unified data" in general, with a focus
on infrastructure data.  GMOC has measurement and alerting tools.
The measurement tools are "production-ready", but notifications
need more work to be targeted (fewer false negatives and false
positives) and actionable (people are correctly notified about
things they actually control) before they can be deployed in
production.

The focus of GMOC's measurement tools are on long-term trend analysis.
They use SNAPP, ganglia, and Measurement Manager.  SNAPP is an SNMP
collection package which is very scalable.  It is currently monitoring
Internet2, NLR, and ProtoGENI backbone switches, as well as Indiana's
own switches.  SNAPP contains a visualization interface which has
support for data tagging --- not just what data is, but metadata
about it.  Having data submitted with meaningful units has been a
big issue for GMOC.

GMOC collects data actively, and also allows data submission, using
a simple XML file for time-series data.  They have a generator which
uses a configuration file to submit data from a set of RRDs (for
integration with ganglia or other RRD-based mechanisms), and uses
REST for HTTP or HTTPS submission.  To submit data, either use this
API and submit data as fast as you like (submitting data multiple
times a minute shouldn't stress their interface, though it has not
yet been fully tested), or provide SNMP access to GMOC so they can
monitor devices directly.

Q: How much recently-submitted data can you get via the remote API?
   Can you access the RRDs directly?
A: You can get the most recent 10 minutes of data using the public
   API, but there is no direct access to the RRD files due to issues
   with compatibility and RRD internals.

Q: How much load does data submission add to a client?
A: It probably doesn't increase CPU utilization by more than 5%,
   in GMOC's experience.


Nick Bastin of Stanford University spoke about monitoring in OpenFlow:

Nick advocated the use of SNMP for monitoring, stressing that it
is a proven interface which can be extended to support new things,
and that traps in case of changes are preferable to polling for
changes.  However, OpenFlow uses custom monitoring heavily at
present, because there aren't MIBs for many things which need them.

Dataplane monitoring should probably use SNMP, and they hope to add
SNMP support to the FlowVisor soon.  That said, there are limitations
to allowing direct SNMP to expose information to experimenters,
because the ACLs may not be granular enough for GENI.  However,
SNMP should be used and encouraged by researchers, network operators,
and protocol designers whenever possible.

In response to a question, Nick stressed that OpenFlow shouldn't
change your network monitoring situation: since the gear continues
to work the same way at the hardware level, existing tools should
continue to work.  In addition, gathering statistics via OpenFlow
has approximately the same effect on switch performance as native
SNMP does --- it doesn't cause a performance problem, but it doesn't
solve one either.

The group discussed how to package SNMP-gathered data so that
experimenters and remote operators can use it.  Per-slice ACLs could
be applied at the switch level (using SNMP directly) or at the
display level (using an SNMP proxy to aggregate the data per-slice).

At present, FlowVisor provides reasonable statistics for the OpenFlow
control plane via the XMLRPC API, but there is still a big dataplane
monitoring problem.  Thus, the first step for SNMP support in
FlowVisor would be to add MIB-2 data collected and reported by the
FlowVisor about the switches in a slice.  Since FlowVisor is
virtualizing your topology, it needs to expose that in monitoring.
They are actively working on this reporting, but can't promise it
by GEC12.


Sarah Edwards of GPO about monitoring requirements reported by campuses:

Sarah spoke with a number of GENI participants and asked questions
about what monitoring tools they have, and what they want.  A variety
of requests from operators were raised at GEC10, including a "slice
top" utility to identify slices which are using the most resources.
Operators were also keen to get data which could prove that a problem
was *not* originating at their site.

She interviewed operators Russ Clark at Georgia Tech, Chaos Golubitsky
at GPO, Chris Small and John Meylor at Indiana University.  Russ
Clark mentioned that, while live or recent statistics from remote
switches would be helpful, remote SNMP access might be a non-starter
at his site.  Motivated sites could publish and link to local data
sources as one way around this.  The Indiana admins mentioned that
they would like visualization to contain per-campus or per-aggregate
views of collected data.

Sarah interviewed Mark Berman and Niky Riga at GPO to get an
experimenter perspective.  They want to see information about what
resources exist, and which are currently available, for GENI use.
They also want standardized information across sites, and, as a
troubleshooting aid, they want a per-slice view into data.  In
addition, more OpenFlow topology data would be helpful for debugging
and analysis.

Rob Ricci of ProtoGENI and Tony Mack of PlanetLab gave some input
on monitoring tools provided by their aggregates.  ProtoGENI does
not provide a central monitoring solution, counting on slice owners
to select their own monitoring options for their experiments.
However, the graphical experimenter tool Flack has recently been
integrated with Instools, a measurement and collection service,
allowing some easy measurement access for experimenters.  PlanetLab
provides three monitoring interfaces: CoMon reports per-node (and
per-sliver-per-node) statistics.  PlanetFlow reports on all traffic
flows into and out of a node, and is useful for diagnosing abuse
problems.  The MyOps tool reports on node availability as seen by
PLC, and is targeted at site operators.


The session concluded with a broad discussion about GENI monitoring
topics, with a goal of identifying some next steps.

Srini Seetharaman and Masa Kobayashi of Stanford discussed the tool
they have deployed for measuring OpenFlow performance internals.
It is an active testing tool which uses ping data between nodes and
OpenFlow topology information to detect flow setup problems.  They
encourage interested campuses to help test and deploy it.

There was a lot of discussion about what information FlowVisor
currently provides for monitoring.  FlowVisor developer Rob Sherwood
enumerated some features which currently exist.  You can get OpenFlow
message counts, i.e. per-slice/per-dpid/per-type information about
the OpenFlow control messages being sent and received.  FlowVisor
uses modified LLDP to contact its neighbors and look for nearby
FlowVisors, so you can get information about that topology.  However,
if there are non-OpenFlow-controlled devices between FlowVisors,
they will not be visible in this topology.  Stanford is currently
debugging a capability to provide per-slice and per-switch information
about what is in the flowtable, so that the translation between
what the slice requested and what the switch can provide is visible.

If you want to check the health of a FlowVisor, you can register
for traps (callbacks) when links go up and down.  Also, the API
"ping" call provides general FlowVisor health information, as well
as the version of the flowvisor.  There are CLI tools which currently
wrap all of this functionality, but no known GUI.

Many people were interested in the question of how privacy and
monitoring will interact in GENI.  During the GEC10 monitoring
breakfast, we proposed a short-term agreement: for the next 18
months (until around November 2012/GEC15), anyone who uses the
mesoscale infrastructure should assume their use is public.  The
goal of this agreement was to take the heat off, so we could implement
and debug things without worrying about short-term privacy breaches.
However, even if we don't have to immediately get privacy right,
we still need to try to build it in and debug it now, while we can.

Nick Bastin commented that if you know what's available in OpenFlow,
then you know what's not available (i.e. what traffic types someone
has reserved).  If you know that, and have direct switch access,
you can break someone else's experiment.  Scott Karlin of Princeton
commented that it's important to think about security and privacy
now --- protecting information by default might be a good idea,
since it's currently prohibitively difficult to figure out whether
each piece of information is security-related.  Karen Sollins of
MIT commented that security needs to be designed in, because users
and sites will be added over time.  Many campuses have strong
restrictions on what data can be collected, and may not allow
deployments which aren't compatible with those.

Matt Zekauskas of Internet2 said that he'd find it useful to have
some publically-available data characterizing OpenFlow use.  Even
if we have to elide the specifics, it's helpful to know usage details
like what ethertypes experimenters tend to reserve.

The question of direct (non-OpenFlow) access to switch data obtained
via SNMP and reported by campuses was raised, but attendees were
unsure whether it would be useful.  A robust SNMP proxy might serve
the same function.  K.C. Wang of Clemson University noted that a
lot of operators have agreed that OpenFlow experimenter opt-in
decisions are very difficult, so having monitoring tools which would
make it easier to determine whether an opt-in was safe would be
useful.  Ivan Seskar of Rutgers made a request for notifications
--- he uses local monitoring for performance data relevant to his
testbed, but wants centralized monitoring to tell him when GENI
things are not working.