GEC11MonitoringMiniWorkshop: MonitoringNotes.txt

File MonitoringNotes.txt, 10.7 KB (added by sedwards@bbn.com, 13 years ago)

Notes from meeting.

Line 
1Monitoring mini workshop notes:
2
3Sarah Edwards of GPO introduced the session.
4
5Monitoring fits in the context of a number of other GEC sessions,
6including most of the campus track sessions.  It is related to, but
7distinct from, the instrumentation and measurement session.
8
9Why are we here?
10 * We're trying to operate aggregates as if they're production, so
11   we need monitoring tools now.
12 * We should share information about what tools we have and what
13   tools we need, and use those capabilities and requirements to
14   set priorities.
15 * Sites which have signed aggregate provider agreements need to
16   meet operations requirements.  Shared monitoring information may
17   make their jobs easier.  Sharing data on resources that multiple
18   campuses use prevents campuses from needing to reinvent the
19   wheel.  Sharing data also makes joint debugging of shared
20   infrastructure more feasible.
21
22
23Camilo Viecco of GMOC spoke about GMOC data monitoring options:
24
25GMOC has three functions:
26 1. To provide a unified view of GENI
27 2. To be an initial point of contact for operations
28 3. To provide monitoring visualization and monitoring APIs
29They try to be "the place for unified data" in general, with a focus
30on infrastructure data.  GMOC has measurement and alerting tools.
31The measurement tools are "production-ready", but notifications
32need more work to be targeted (fewer false negatives and false
33positives) and actionable (people are correctly notified about
34things they actually control) before they can be deployed in
35production.
36
37The focus of GMOC's measurement tools are on long-term trend analysis.
38They use SNAPP, ganglia, and Measurement Manager.  SNAPP is an SNMP
39collection package which is very scalable.  It is currently monitoring
40Internet2, NLR, and ProtoGENI backbone switches, as well as Indiana's
41own switches.  SNAPP contains a visualization interface which has
42support for data tagging --- not just what data is, but metadata
43about it.  Having data submitted with meaningful units has been a
44big issue for GMOC.
45
46GMOC collects data actively, and also allows data submission, using
47a simple XML file for time-series data.  They have a generator which
48uses a configuration file to submit data from a set of RRDs (for
49integration with ganglia or other RRD-based mechanisms), and uses
50REST for HTTP or HTTPS submission.  To submit data, either use this
51API and submit data as fast as you like (submitting data multiple
52times a minute shouldn't stress their interface, though it has not
53yet been fully tested), or provide SNMP access to GMOC so they can
54monitor devices directly.
55
56Q: How much recently-submitted data can you get via the remote API?
57   Can you access the RRDs directly?
58A: You can get the most recent 10 minutes of data using the public
59   API, but there is no direct access to the RRD files due to issues
60   with compatibility and RRD internals.
61
62Q: How much load does data submission add to a client?
63A: It probably doesn't increase CPU utilization by more than 5%,
64   in GMOC's experience.
65
66
67Nick Bastin of Stanford University spoke about monitoring in OpenFlow:
68
69Nick advocated the use of SNMP for monitoring, stressing that it
70is a proven interface which can be extended to support new things,
71and that traps in case of changes are preferable to polling for
72changes.  However, OpenFlow uses custom monitoring heavily at
73present, because there aren't MIBs for many things which need them.
74
75Dataplane monitoring should probably use SNMP, and they hope to add
76SNMP support to the FlowVisor soon.  That said, there are limitations
77to allowing direct SNMP to expose information to experimenters,
78because the ACLs may not be granular enough for GENI.  However,
79SNMP should be used and encouraged by researchers, network operators,
80and protocol designers whenever possible.
81
82In response to a question, Nick stressed that OpenFlow shouldn't
83change your network monitoring situation: since the gear continues
84to work the same way at the hardware level, existing tools should
85continue to work.  In addition, gathering statistics via OpenFlow
86has approximately the same effect on switch performance as native
87SNMP does --- it doesn't cause a performance problem, but it doesn't
88solve one either.
89
90The group discussed how to package SNMP-gathered data so that
91experimenters and remote operators can use it.  Per-slice ACLs could
92be applied at the switch level (using SNMP directly) or at the
93display level (using an SNMP proxy to aggregate the data per-slice).
94
95At present, FlowVisor provides reasonable statistics for the OpenFlow
96control plane via the XMLRPC API, but there is still a big dataplane
97monitoring problem.  Thus, the first step for SNMP support in
98FlowVisor would be to add MIB-2 data collected and reported by the
99FlowVisor about the switches in a slice.  Since FlowVisor is
100virtualizing your topology, it needs to expose that in monitoring.
101They are actively working on this reporting, but can't promise it
102by GEC12.
103
104
105Sarah Edwards of GPO about monitoring requirements reported by campuses:
106
107Sarah spoke with a number of GENI participants and asked questions
108about what monitoring tools they have, and what they want.  A variety
109of requests from operators were raised at GEC10, including a "slice
110top" utility to identify slices which are using the most resources.
111Operators were also keen to get data which could prove that a problem
112was *not* originating at their site.
113
114She interviewed operators Russ Clark at Georgia Tech, Chaos Golubitsky
115at GPO, Chris Small and John Meylor at Indiana University.  Russ
116Clark mentioned that, while live or recent statistics from remote
117switches would be helpful, remote SNMP access might be a non-starter
118at his site.  Motivated sites could publish and link to local data
119sources as one way around this.  The Indiana admins mentioned that
120they would like visualization to contain per-campus or per-aggregate
121views of collected data.
122
123Sarah interviewed Mark Berman and Niky Riga at GPO to get an
124experimenter perspective.  They want to see information about what
125resources exist, and which are currently available, for GENI use.
126They also want standardized information across sites, and, as a
127troubleshooting aid, they want a per-slice view into data.  In
128addition, more OpenFlow topology data would be helpful for debugging
129and analysis.
130
131Rob Ricci of ProtoGENI and Tony Mack of PlanetLab gave some input
132on monitoring tools provided by their aggregates.  ProtoGENI does
133not provide a central monitoring solution, counting on slice owners
134to select their own monitoring options for their experiments.
135However, the graphical experimenter tool Flack has recently been
136integrated with Instools, a measurement and collection service,
137allowing some easy measurement access for experimenters.  PlanetLab
138provides three monitoring interfaces: CoMon reports per-node (and
139per-sliver-per-node) statistics.  PlanetFlow reports on all traffic
140flows into and out of a node, and is useful for diagnosing abuse
141problems.  The MyOps tool reports on node availability as seen by
142PLC, and is targeted at site operators.
143
144
145The session concluded with a broad discussion about GENI monitoring
146topics, with a goal of identifying some next steps.
147
148Srini Seetharaman and Masa Kobayashi of Stanford discussed the tool
149they have deployed for measuring OpenFlow performance internals.
150It is an active testing tool which uses ping data between nodes and
151OpenFlow topology information to detect flow setup problems.  They
152encourage interested campuses to help test and deploy it.
153
154There was a lot of discussion about what information FlowVisor
155currently provides for monitoring.  FlowVisor developer Rob Sherwood
156enumerated some features which currently exist.  You can get OpenFlow
157message counts, i.e. per-slice/per-dpid/per-type information about
158the OpenFlow control messages being sent and received.  FlowVisor
159uses modified LLDP to contact its neighbors and look for nearby
160FlowVisors, so you can get information about that topology.  However,
161if there are non-OpenFlow-controlled devices between FlowVisors,
162they will not be visible in this topology.  Stanford is currently
163debugging a capability to provide per-slice and per-switch information
164about what is in the flowtable, so that the translation between
165what the slice requested and what the switch can provide is visible.
166
167If you want to check the health of a FlowVisor, you can register
168for traps (callbacks) when links go up and down.  Also, the API
169"ping" call provides general FlowVisor health information, as well
170as the version of the flowvisor.  There are CLI tools which currently
171wrap all of this functionality, but no known GUI.
172
173Many people were interested in the question of how privacy and
174monitoring will interact in GENI.  During the GEC10 monitoring
175breakfast, we proposed a short-term agreement: for the next 18
176months (until around November 2012/GEC15), anyone who uses the
177mesoscale infrastructure should assume their use is public.  The
178goal of this agreement was to take the heat off, so we could implement
179and debug things without worrying about short-term privacy breaches.
180However, even if we don't have to immediately get privacy right,
181we still need to try to build it in and debug it now, while we can.
182
183Nick Bastin commented that if you know what's available in OpenFlow,
184then you know what's not available (i.e. what traffic types someone
185has reserved).  If you know that, and have direct switch access,
186you can break someone else's experiment.  Scott Karlin of Princeton
187commented that it's important to think about security and privacy
188now --- protecting information by default might be a good idea,
189since it's currently prohibitively difficult to figure out whether
190each piece of information is security-related.  Karen Sollins of
191MIT commented that security needs to be designed in, because users
192and sites will be added over time.  Many campuses have strong
193restrictions on what data can be collected, and may not allow
194deployments which aren't compatible with those.
195
196Matt Zekauskas of Internet2 said that he'd find it useful to have
197some publically-available data characterizing OpenFlow use.  Even
198if we have to elide the specifics, it's helpful to know usage details
199like what ethertypes experimenters tend to reserve.
200
201The question of direct (non-OpenFlow) access to switch data obtained
202via SNMP and reported by campuses was raised, but attendees were
203unsure whether it would be useful.  A robust SNMP proxy might serve
204the same function.  K.C. Wang of Clemson University noted that a
205lot of operators have agreed that OpenFlow experimenter opt-in
206decisions are very difficult, so having monitoring tools which would
207make it easier to determine whether an opt-in was safe would be
208useful.  Ivan Seskar of Rutgers made a request for notifications
209--- he uses local monitoring for performance data relevant to his
210testbed, but wants centralized monitoring to tell him when GENI
211things are not working.
212