1 | Monitoring mini workshop notes: |
---|
2 | |
---|
3 | Sarah Edwards of GPO introduced the session. |
---|
4 | |
---|
5 | Monitoring fits in the context of a number of other GEC sessions, |
---|
6 | including most of the campus track sessions. It is related to, but |
---|
7 | distinct from, the instrumentation and measurement session. |
---|
8 | |
---|
9 | Why are we here? |
---|
10 | * We're trying to operate aggregates as if they're production, so |
---|
11 | we need monitoring tools now. |
---|
12 | * We should share information about what tools we have and what |
---|
13 | tools we need, and use those capabilities and requirements to |
---|
14 | set priorities. |
---|
15 | * Sites which have signed aggregate provider agreements need to |
---|
16 | meet operations requirements. Shared monitoring information may |
---|
17 | make their jobs easier. Sharing data on resources that multiple |
---|
18 | campuses use prevents campuses from needing to reinvent the |
---|
19 | wheel. Sharing data also makes joint debugging of shared |
---|
20 | infrastructure more feasible. |
---|
21 | |
---|
22 | |
---|
23 | Camilo Viecco of GMOC spoke about GMOC data monitoring options: |
---|
24 | |
---|
25 | GMOC has three functions: |
---|
26 | 1. To provide a unified view of GENI |
---|
27 | 2. To be an initial point of contact for operations |
---|
28 | 3. To provide monitoring visualization and monitoring APIs |
---|
29 | They try to be "the place for unified data" in general, with a focus |
---|
30 | on infrastructure data. GMOC has measurement and alerting tools. |
---|
31 | The measurement tools are "production-ready", but notifications |
---|
32 | need more work to be targeted (fewer false negatives and false |
---|
33 | positives) and actionable (people are correctly notified about |
---|
34 | things they actually control) before they can be deployed in |
---|
35 | production. |
---|
36 | |
---|
37 | The focus of GMOC's measurement tools are on long-term trend analysis. |
---|
38 | They use SNAPP, ganglia, and Measurement Manager. SNAPP is an SNMP |
---|
39 | collection package which is very scalable. It is currently monitoring |
---|
40 | Internet2, NLR, and ProtoGENI backbone switches, as well as Indiana's |
---|
41 | own switches. SNAPP contains a visualization interface which has |
---|
42 | support for data tagging --- not just what data is, but metadata |
---|
43 | about it. Having data submitted with meaningful units has been a |
---|
44 | big issue for GMOC. |
---|
45 | |
---|
46 | GMOC collects data actively, and also allows data submission, using |
---|
47 | a simple XML file for time-series data. They have a generator which |
---|
48 | uses a configuration file to submit data from a set of RRDs (for |
---|
49 | integration with ganglia or other RRD-based mechanisms), and uses |
---|
50 | REST for HTTP or HTTPS submission. To submit data, either use this |
---|
51 | API and submit data as fast as you like (submitting data multiple |
---|
52 | times a minute shouldn't stress their interface, though it has not |
---|
53 | yet been fully tested), or provide SNMP access to GMOC so they can |
---|
54 | monitor devices directly. |
---|
55 | |
---|
56 | Q: How much recently-submitted data can you get via the remote API? |
---|
57 | Can you access the RRDs directly? |
---|
58 | A: You can get the most recent 10 minutes of data using the public |
---|
59 | API, but there is no direct access to the RRD files due to issues |
---|
60 | with compatibility and RRD internals. |
---|
61 | |
---|
62 | Q: How much load does data submission add to a client? |
---|
63 | A: It probably doesn't increase CPU utilization by more than 5%, |
---|
64 | in GMOC's experience. |
---|
65 | |
---|
66 | |
---|
67 | Nick Bastin of Stanford University spoke about monitoring in OpenFlow: |
---|
68 | |
---|
69 | Nick advocated the use of SNMP for monitoring, stressing that it |
---|
70 | is a proven interface which can be extended to support new things, |
---|
71 | and that traps in case of changes are preferable to polling for |
---|
72 | changes. However, OpenFlow uses custom monitoring heavily at |
---|
73 | present, because there aren't MIBs for many things which need them. |
---|
74 | |
---|
75 | Dataplane monitoring should probably use SNMP, and they hope to add |
---|
76 | SNMP support to the FlowVisor soon. That said, there are limitations |
---|
77 | to allowing direct SNMP to expose information to experimenters, |
---|
78 | because the ACLs may not be granular enough for GENI. However, |
---|
79 | SNMP should be used and encouraged by researchers, network operators, |
---|
80 | and protocol designers whenever possible. |
---|
81 | |
---|
82 | In response to a question, Nick stressed that OpenFlow shouldn't |
---|
83 | change your network monitoring situation: since the gear continues |
---|
84 | to work the same way at the hardware level, existing tools should |
---|
85 | continue to work. In addition, gathering statistics via OpenFlow |
---|
86 | has approximately the same effect on switch performance as native |
---|
87 | SNMP does --- it doesn't cause a performance problem, but it doesn't |
---|
88 | solve one either. |
---|
89 | |
---|
90 | The group discussed how to package SNMP-gathered data so that |
---|
91 | experimenters and remote operators can use it. Per-slice ACLs could |
---|
92 | be applied at the switch level (using SNMP directly) or at the |
---|
93 | display level (using an SNMP proxy to aggregate the data per-slice). |
---|
94 | |
---|
95 | At present, FlowVisor provides reasonable statistics for the OpenFlow |
---|
96 | control plane via the XMLRPC API, but there is still a big dataplane |
---|
97 | monitoring problem. Thus, the first step for SNMP support in |
---|
98 | FlowVisor would be to add MIB-2 data collected and reported by the |
---|
99 | FlowVisor about the switches in a slice. Since FlowVisor is |
---|
100 | virtualizing your topology, it needs to expose that in monitoring. |
---|
101 | They are actively working on this reporting, but can't promise it |
---|
102 | by GEC12. |
---|
103 | |
---|
104 | |
---|
105 | Sarah Edwards of GPO about monitoring requirements reported by campuses: |
---|
106 | |
---|
107 | Sarah spoke with a number of GENI participants and asked questions |
---|
108 | about what monitoring tools they have, and what they want. A variety |
---|
109 | of requests from operators were raised at GEC10, including a "slice |
---|
110 | top" utility to identify slices which are using the most resources. |
---|
111 | Operators were also keen to get data which could prove that a problem |
---|
112 | was *not* originating at their site. |
---|
113 | |
---|
114 | She interviewed operators Russ Clark at Georgia Tech, Chaos Golubitsky |
---|
115 | at GPO, Chris Small and John Meylor at Indiana University. Russ |
---|
116 | Clark mentioned that, while live or recent statistics from remote |
---|
117 | switches would be helpful, remote SNMP access might be a non-starter |
---|
118 | at his site. Motivated sites could publish and link to local data |
---|
119 | sources as one way around this. The Indiana admins mentioned that |
---|
120 | they would like visualization to contain per-campus or per-aggregate |
---|
121 | views of collected data. |
---|
122 | |
---|
123 | Sarah interviewed Mark Berman and Niky Riga at GPO to get an |
---|
124 | experimenter perspective. They want to see information about what |
---|
125 | resources exist, and which are currently available, for GENI use. |
---|
126 | They also want standardized information across sites, and, as a |
---|
127 | troubleshooting aid, they want a per-slice view into data. In |
---|
128 | addition, more OpenFlow topology data would be helpful for debugging |
---|
129 | and analysis. |
---|
130 | |
---|
131 | Rob Ricci of ProtoGENI and Tony Mack of PlanetLab gave some input |
---|
132 | on monitoring tools provided by their aggregates. ProtoGENI does |
---|
133 | not provide a central monitoring solution, counting on slice owners |
---|
134 | to select their own monitoring options for their experiments. |
---|
135 | However, the graphical experimenter tool Flack has recently been |
---|
136 | integrated with Instools, a measurement and collection service, |
---|
137 | allowing some easy measurement access for experimenters. PlanetLab |
---|
138 | provides three monitoring interfaces: CoMon reports per-node (and |
---|
139 | per-sliver-per-node) statistics. PlanetFlow reports on all traffic |
---|
140 | flows into and out of a node, and is useful for diagnosing abuse |
---|
141 | problems. The MyOps tool reports on node availability as seen by |
---|
142 | PLC, and is targeted at site operators. |
---|
143 | |
---|
144 | |
---|
145 | The session concluded with a broad discussion about GENI monitoring |
---|
146 | topics, with a goal of identifying some next steps. |
---|
147 | |
---|
148 | Srini Seetharaman and Masa Kobayashi of Stanford discussed the tool |
---|
149 | they have deployed for measuring OpenFlow performance internals. |
---|
150 | It is an active testing tool which uses ping data between nodes and |
---|
151 | OpenFlow topology information to detect flow setup problems. They |
---|
152 | encourage interested campuses to help test and deploy it. |
---|
153 | |
---|
154 | There was a lot of discussion about what information FlowVisor |
---|
155 | currently provides for monitoring. FlowVisor developer Rob Sherwood |
---|
156 | enumerated some features which currently exist. You can get OpenFlow |
---|
157 | message counts, i.e. per-slice/per-dpid/per-type information about |
---|
158 | the OpenFlow control messages being sent and received. FlowVisor |
---|
159 | uses modified LLDP to contact its neighbors and look for nearby |
---|
160 | FlowVisors, so you can get information about that topology. However, |
---|
161 | if there are non-OpenFlow-controlled devices between FlowVisors, |
---|
162 | they will not be visible in this topology. Stanford is currently |
---|
163 | debugging a capability to provide per-slice and per-switch information |
---|
164 | about what is in the flowtable, so that the translation between |
---|
165 | what the slice requested and what the switch can provide is visible. |
---|
166 | |
---|
167 | If you want to check the health of a FlowVisor, you can register |
---|
168 | for traps (callbacks) when links go up and down. Also, the API |
---|
169 | "ping" call provides general FlowVisor health information, as well |
---|
170 | as the version of the flowvisor. There are CLI tools which currently |
---|
171 | wrap all of this functionality, but no known GUI. |
---|
172 | |
---|
173 | Many people were interested in the question of how privacy and |
---|
174 | monitoring will interact in GENI. During the GEC10 monitoring |
---|
175 | breakfast, we proposed a short-term agreement: for the next 18 |
---|
176 | months (until around November 2012/GEC15), anyone who uses the |
---|
177 | mesoscale infrastructure should assume their use is public. The |
---|
178 | goal of this agreement was to take the heat off, so we could implement |
---|
179 | and debug things without worrying about short-term privacy breaches. |
---|
180 | However, even if we don't have to immediately get privacy right, |
---|
181 | we still need to try to build it in and debug it now, while we can. |
---|
182 | |
---|
183 | Nick Bastin commented that if you know what's available in OpenFlow, |
---|
184 | then you know what's not available (i.e. what traffic types someone |
---|
185 | has reserved). If you know that, and have direct switch access, |
---|
186 | you can break someone else's experiment. Scott Karlin of Princeton |
---|
187 | commented that it's important to think about security and privacy |
---|
188 | now --- protecting information by default might be a good idea, |
---|
189 | since it's currently prohibitively difficult to figure out whether |
---|
190 | each piece of information is security-related. Karen Sollins of |
---|
191 | MIT commented that security needs to be designed in, because users |
---|
192 | and sites will be added over time. Many campuses have strong |
---|
193 | restrictions on what data can be collected, and may not allow |
---|
194 | deployments which aren't compatible with those. |
---|
195 | |
---|
196 | Matt Zekauskas of Internet2 said that he'd find it useful to have |
---|
197 | some publically-available data characterizing OpenFlow use. Even |
---|
198 | if we have to elide the specifics, it's helpful to know usage details |
---|
199 | like what ethertypes experimenters tend to reserve. |
---|
200 | |
---|
201 | The question of direct (non-OpenFlow) access to switch data obtained |
---|
202 | via SNMP and reported by campuses was raised, but attendees were |
---|
203 | unsure whether it would be useful. A robust SNMP proxy might serve |
---|
204 | the same function. K.C. Wang of Clemson University noted that a |
---|
205 | lot of operators have agreed that OpenFlow experimenter opt-in |
---|
206 | decisions are very difficult, so having monitoring tools which would |
---|
207 | make it easier to determine whether an opt-in was safe would be |
---|
208 | useful. Ivan Seskar of Rutgers made a request for notifications |
---|
209 | --- he uses local monitoring for performance data relevant to his |
---|
210 | testbed, but wants centralized monitoring to tell him when GENI |
---|
211 | things are not working. |
---|
212 | |
---|