Context Navigation

Back to GEC11MonitoringMiniWorkshop

GEC11MonitoringMiniWorkshop: MonitoringNotes.txt

File MonitoringNotes.txt, 10.7 KB (added by sedwards@bbn.com, 13 years ago)
Notes from meeting.

Line
1	Monitoring mini workshop notes:
2
3	Sarah Edwards of GPO introduced the session.
4
5	Monitoring fits in the context of a number of other GEC sessions,
6	including most of the campus track sessions. It is related to, but
7	distinct from, the instrumentation and measurement session.
8
9	Why are we here?
10	* We're trying to operate aggregates as if they're production, so
11	we need monitoring tools now.
12	* We should share information about what tools we have and what
13	tools we need, and use those capabilities and requirements to
14	set priorities.
15	* Sites which have signed aggregate provider agreements need to
16	meet operations requirements. Shared monitoring information may
17	make their jobs easier. Sharing data on resources that multiple
18	campuses use prevents campuses from needing to reinvent the
19	wheel. Sharing data also makes joint debugging of shared
20	infrastructure more feasible.
21
22
23	Camilo Viecco of GMOC spoke about GMOC data monitoring options:
24
25	GMOC has three functions:
26	1. To provide a unified view of GENI
27	2. To be an initial point of contact for operations
28	3. To provide monitoring visualization and monitoring APIs
29	They try to be "the place for unified data" in general, with a focus
30	on infrastructure data. GMOC has measurement and alerting tools.
31	The measurement tools are "production-ready", but notifications
32	need more work to be targeted (fewer false negatives and false
33	positives) and actionable (people are correctly notified about
34	things they actually control) before they can be deployed in
35	production.
36
37	The focus of GMOC's measurement tools are on long-term trend analysis.
38	They use SNAPP, ganglia, and Measurement Manager. SNAPP is an SNMP
39	collection package which is very scalable. It is currently monitoring
40	Internet2, NLR, and ProtoGENI backbone switches, as well as Indiana's
41	own switches. SNAPP contains a visualization interface which has
42	support for data tagging --- not just what data is, but metadata
43	about it. Having data submitted with meaningful units has been a
44	big issue for GMOC.
45
46	GMOC collects data actively, and also allows data submission, using
47	a simple XML file for time-series data. They have a generator which
48	uses a configuration file to submit data from a set of RRDs (for
49	integration with ganglia or other RRD-based mechanisms), and uses
50	REST for HTTP or HTTPS submission. To submit data, either use this
51	API and submit data as fast as you like (submitting data multiple
52	times a minute shouldn't stress their interface, though it has not
53	yet been fully tested), or provide SNMP access to GMOC so they can
54	monitor devices directly.
55
56	Q: How much recently-submitted data can you get via the remote API?
57	Can you access the RRDs directly?
58	A: You can get the most recent 10 minutes of data using the public
59	API, but there is no direct access to the RRD files due to issues
60	with compatibility and RRD internals.
61
62	Q: How much load does data submission add to a client?
63	A: It probably doesn't increase CPU utilization by more than 5%,
64	in GMOC's experience.
65
66
67	Nick Bastin of Stanford University spoke about monitoring in OpenFlow:
68
69	Nick advocated the use of SNMP for monitoring, stressing that it
70	is a proven interface which can be extended to support new things,
71	and that traps in case of changes are preferable to polling for
72	changes. However, OpenFlow uses custom monitoring heavily at
73	present, because there aren't MIBs for many things which need them.
74
75	Dataplane monitoring should probably use SNMP, and they hope to add
76	SNMP support to the FlowVisor soon. That said, there are limitations
77	to allowing direct SNMP to expose information to experimenters,
78	because the ACLs may not be granular enough for GENI. However,
79	SNMP should be used and encouraged by researchers, network operators,
80	and protocol designers whenever possible.
81
82	In response to a question, Nick stressed that OpenFlow shouldn't
83	change your network monitoring situation: since the gear continues
84	to work the same way at the hardware level, existing tools should
85	continue to work. In addition, gathering statistics via OpenFlow
86	has approximately the same effect on switch performance as native
87	SNMP does --- it doesn't cause a performance problem, but it doesn't
88	solve one either.
89
90	The group discussed how to package SNMP-gathered data so that
91	experimenters and remote operators can use it. Per-slice ACLs could
92	be applied at the switch level (using SNMP directly) or at the
93	display level (using an SNMP proxy to aggregate the data per-slice).
94
95	At present, FlowVisor provides reasonable statistics for the OpenFlow
96	control plane via the XMLRPC API, but there is still a big dataplane
97	monitoring problem. Thus, the first step for SNMP support in
98	FlowVisor would be to add MIB-2 data collected and reported by the
99	FlowVisor about the switches in a slice. Since FlowVisor is
100	virtualizing your topology, it needs to expose that in monitoring.
101	They are actively working on this reporting, but can't promise it
102	by GEC12.
103
104
105	Sarah Edwards of GPO about monitoring requirements reported by campuses:
106
107	Sarah spoke with a number of GENI participants and asked questions
108	about what monitoring tools they have, and what they want. A variety
109	of requests from operators were raised at GEC10, including a "slice
110	top" utility to identify slices which are using the most resources.
111	Operators were also keen to get data which could prove that a problem
112	was not originating at their site.
113
114	She interviewed operators Russ Clark at Georgia Tech, Chaos Golubitsky
115	at GPO, Chris Small and John Meylor at Indiana University. Russ
116	Clark mentioned that, while live or recent statistics from remote
117	switches would be helpful, remote SNMP access might be a non-starter
118	at his site. Motivated sites could publish and link to local data
119	sources as one way around this. The Indiana admins mentioned that
120	they would like visualization to contain per-campus or per-aggregate
121	views of collected data.
122
123	Sarah interviewed Mark Berman and Niky Riga at GPO to get an
124	experimenter perspective. They want to see information about what
125	resources exist, and which are currently available, for GENI use.
126	They also want standardized information across sites, and, as a
127	troubleshooting aid, they want a per-slice view into data. In
128	addition, more OpenFlow topology data would be helpful for debugging
129	and analysis.
130
131	Rob Ricci of ProtoGENI and Tony Mack of PlanetLab gave some input
132	on monitoring tools provided by their aggregates. ProtoGENI does
133	not provide a central monitoring solution, counting on slice owners
134	to select their own monitoring options for their experiments.
135	However, the graphical experimenter tool Flack has recently been
136	integrated with Instools, a measurement and collection service,
137	allowing some easy measurement access for experimenters. PlanetLab
138	provides three monitoring interfaces: CoMon reports per-node (and
139	per-sliver-per-node) statistics. PlanetFlow reports on all traffic
140	flows into and out of a node, and is useful for diagnosing abuse
141	problems. The MyOps tool reports on node availability as seen by
142	PLC, and is targeted at site operators.
143
144
145	The session concluded with a broad discussion about GENI monitoring
146	topics, with a goal of identifying some next steps.
147
148	Srini Seetharaman and Masa Kobayashi of Stanford discussed the tool
149	they have deployed for measuring OpenFlow performance internals.
150	It is an active testing tool which uses ping data between nodes and
151	OpenFlow topology information to detect flow setup problems. They
152	encourage interested campuses to help test and deploy it.
153
154	There was a lot of discussion about what information FlowVisor
155	currently provides for monitoring. FlowVisor developer Rob Sherwood
156	enumerated some features which currently exist. You can get OpenFlow
157	message counts, i.e. per-slice/per-dpid/per-type information about
158	the OpenFlow control messages being sent and received. FlowVisor
159	uses modified LLDP to contact its neighbors and look for nearby
160	FlowVisors, so you can get information about that topology. However,
161	if there are non-OpenFlow-controlled devices between FlowVisors,
162	they will not be visible in this topology. Stanford is currently
163	debugging a capability to provide per-slice and per-switch information
164	about what is in the flowtable, so that the translation between
165	what the slice requested and what the switch can provide is visible.
166
167	If you want to check the health of a FlowVisor, you can register
168	for traps (callbacks) when links go up and down. Also, the API
169	"ping" call provides general FlowVisor health information, as well
170	as the version of the flowvisor. There are CLI tools which currently
171	wrap all of this functionality, but no known GUI.
172
173	Many people were interested in the question of how privacy and
174	monitoring will interact in GENI. During the GEC10 monitoring
175	breakfast, we proposed a short-term agreement: for the next 18
176	months (until around November 2012/GEC15), anyone who uses the
177	mesoscale infrastructure should assume their use is public. The
178	goal of this agreement was to take the heat off, so we could implement
179	and debug things without worrying about short-term privacy breaches.
180	However, even if we don't have to immediately get privacy right,
181	we still need to try to build it in and debug it now, while we can.
182
183	Nick Bastin commented that if you know what's available in OpenFlow,
184	then you know what's not available (i.e. what traffic types someone
185	has reserved). If you know that, and have direct switch access,
186	you can break someone else's experiment. Scott Karlin of Princeton
187	commented that it's important to think about security and privacy
188	now --- protecting information by default might be a good idea,
189	since it's currently prohibitively difficult to figure out whether
190	each piece of information is security-related. Karen Sollins of
191	MIT commented that security needs to be designed in, because users
192	and sites will be added over time. Many campuses have strong
193	restrictions on what data can be collected, and may not allow
194	deployments which aren't compatible with those.
195
196	Matt Zekauskas of Internet2 said that he'd find it useful to have
197	some publically-available data characterizing OpenFlow use. Even
198	if we have to elide the specifics, it's helpful to know usage details
199	like what ethertypes experimenters tend to reserve.
200
201	The question of direct (non-OpenFlow) access to switch data obtained
202	via SNMP and reported by campuses was raised, but attendees were
203	unsure whether it would be useful. A robust SNMP proxy might serve
204	the same function. K.C. Wang of Clemson University noted that a
205	lot of operators have agreed that OpenFlow experimenter opt-in
206	decisions are very difficult, so having monitoring tools which would
207	make it easier to determine whether an opt-in was safe would be
208	useful. Ivan Seskar of Rutgers made a request for notifications
209	--- he uses local monitoring for performance data relevant to his
210	testbed, but wants centralized monitoring to tell him when GENI
211	things are not working.
212

Download in other formats:

Original Format