Monitoring mini workshop notes: Sarah Edwards of GPO introduced the session. Monitoring fits in the context of a number of other GEC sessions, including most of the campus track sessions. It is related to, but distinct from, the instrumentation and measurement session. Why are we here? * We're trying to operate aggregates as if they're production, so we need monitoring tools now. * We should share information about what tools we have and what tools we need, and use those capabilities and requirements to set priorities. * Sites which have signed aggregate provider agreements need to meet operations requirements. Shared monitoring information may make their jobs easier. Sharing data on resources that multiple campuses use prevents campuses from needing to reinvent the wheel. Sharing data also makes joint debugging of shared infrastructure more feasible. Camilo Viecco of GMOC spoke about GMOC data monitoring options: GMOC has three functions: 1. To provide a unified view of GENI 2. To be an initial point of contact for operations 3. To provide monitoring visualization and monitoring APIs They try to be "the place for unified data" in general, with a focus on infrastructure data. GMOC has measurement and alerting tools. The measurement tools are "production-ready", but notifications need more work to be targeted (fewer false negatives and false positives) and actionable (people are correctly notified about things they actually control) before they can be deployed in production. The focus of GMOC's measurement tools are on long-term trend analysis. They use SNAPP, ganglia, and Measurement Manager. SNAPP is an SNMP collection package which is very scalable. It is currently monitoring Internet2, NLR, and ProtoGENI backbone switches, as well as Indiana's own switches. SNAPP contains a visualization interface which has support for data tagging --- not just what data is, but metadata about it. Having data submitted with meaningful units has been a big issue for GMOC. GMOC collects data actively, and also allows data submission, using a simple XML file for time-series data. They have a generator which uses a configuration file to submit data from a set of RRDs (for integration with ganglia or other RRD-based mechanisms), and uses REST for HTTP or HTTPS submission. To submit data, either use this API and submit data as fast as you like (submitting data multiple times a minute shouldn't stress their interface, though it has not yet been fully tested), or provide SNMP access to GMOC so they can monitor devices directly. Q: How much recently-submitted data can you get via the remote API? Can you access the RRDs directly? A: You can get the most recent 10 minutes of data using the public API, but there is no direct access to the RRD files due to issues with compatibility and RRD internals. Q: How much load does data submission add to a client? A: It probably doesn't increase CPU utilization by more than 5%, in GMOC's experience. Nick Bastin of Stanford University spoke about monitoring in OpenFlow: Nick advocated the use of SNMP for monitoring, stressing that it is a proven interface which can be extended to support new things, and that traps in case of changes are preferable to polling for changes. However, OpenFlow uses custom monitoring heavily at present, because there aren't MIBs for many things which need them. Dataplane monitoring should probably use SNMP, and they hope to add SNMP support to the FlowVisor soon. That said, there are limitations to allowing direct SNMP to expose information to experimenters, because the ACLs may not be granular enough for GENI. However, SNMP should be used and encouraged by researchers, network operators, and protocol designers whenever possible. In response to a question, Nick stressed that OpenFlow shouldn't change your network monitoring situation: since the gear continues to work the same way at the hardware level, existing tools should continue to work. In addition, gathering statistics via OpenFlow has approximately the same effect on switch performance as native SNMP does --- it doesn't cause a performance problem, but it doesn't solve one either. The group discussed how to package SNMP-gathered data so that experimenters and remote operators can use it. Per-slice ACLs could be applied at the switch level (using SNMP directly) or at the display level (using an SNMP proxy to aggregate the data per-slice). At present, FlowVisor provides reasonable statistics for the OpenFlow control plane via the XMLRPC API, but there is still a big dataplane monitoring problem. Thus, the first step for SNMP support in FlowVisor would be to add MIB-2 data collected and reported by the FlowVisor about the switches in a slice. Since FlowVisor is virtualizing your topology, it needs to expose that in monitoring. They are actively working on this reporting, but can't promise it by GEC12. Sarah Edwards of GPO about monitoring requirements reported by campuses: Sarah spoke with a number of GENI participants and asked questions about what monitoring tools they have, and what they want. A variety of requests from operators were raised at GEC10, including a "slice top" utility to identify slices which are using the most resources. Operators were also keen to get data which could prove that a problem was *not* originating at their site. She interviewed operators Russ Clark at Georgia Tech, Chaos Golubitsky at GPO, Chris Small and John Meylor at Indiana University. Russ Clark mentioned that, while live or recent statistics from remote switches would be helpful, remote SNMP access might be a non-starter at his site. Motivated sites could publish and link to local data sources as one way around this. The Indiana admins mentioned that they would like visualization to contain per-campus or per-aggregate views of collected data. Sarah interviewed Mark Berman and Niky Riga at GPO to get an experimenter perspective. They want to see information about what resources exist, and which are currently available, for GENI use. They also want standardized information across sites, and, as a troubleshooting aid, they want a per-slice view into data. In addition, more OpenFlow topology data would be helpful for debugging and analysis. Rob Ricci of ProtoGENI and Tony Mack of PlanetLab gave some input on monitoring tools provided by their aggregates. ProtoGENI does not provide a central monitoring solution, counting on slice owners to select their own monitoring options for their experiments. However, the graphical experimenter tool Flack has recently been integrated with Instools, a measurement and collection service, allowing some easy measurement access for experimenters. PlanetLab provides three monitoring interfaces: CoMon reports per-node (and per-sliver-per-node) statistics. PlanetFlow reports on all traffic flows into and out of a node, and is useful for diagnosing abuse problems. The MyOps tool reports on node availability as seen by PLC, and is targeted at site operators. The session concluded with a broad discussion about GENI monitoring topics, with a goal of identifying some next steps. Srini Seetharaman and Masa Kobayashi of Stanford discussed the tool they have deployed for measuring OpenFlow performance internals. It is an active testing tool which uses ping data between nodes and OpenFlow topology information to detect flow setup problems. They encourage interested campuses to help test and deploy it. There was a lot of discussion about what information FlowVisor currently provides for monitoring. FlowVisor developer Rob Sherwood enumerated some features which currently exist. You can get OpenFlow message counts, i.e. per-slice/per-dpid/per-type information about the OpenFlow control messages being sent and received. FlowVisor uses modified LLDP to contact its neighbors and look for nearby FlowVisors, so you can get information about that topology. However, if there are non-OpenFlow-controlled devices between FlowVisors, they will not be visible in this topology. Stanford is currently debugging a capability to provide per-slice and per-switch information about what is in the flowtable, so that the translation between what the slice requested and what the switch can provide is visible. If you want to check the health of a FlowVisor, you can register for traps (callbacks) when links go up and down. Also, the API "ping" call provides general FlowVisor health information, as well as the version of the flowvisor. There are CLI tools which currently wrap all of this functionality, but no known GUI. Many people were interested in the question of how privacy and monitoring will interact in GENI. During the GEC10 monitoring breakfast, we proposed a short-term agreement: for the next 18 months (until around November 2012/GEC15), anyone who uses the mesoscale infrastructure should assume their use is public. The goal of this agreement was to take the heat off, so we could implement and debug things without worrying about short-term privacy breaches. However, even if we don't have to immediately get privacy right, we still need to try to build it in and debug it now, while we can. Nick Bastin commented that if you know what's available in OpenFlow, then you know what's not available (i.e. what traffic types someone has reserved). If you know that, and have direct switch access, you can break someone else's experiment. Scott Karlin of Princeton commented that it's important to think about security and privacy now --- protecting information by default might be a good idea, since it's currently prohibitively difficult to figure out whether each piece of information is security-related. Karen Sollins of MIT commented that security needs to be designed in, because users and sites will be added over time. Many campuses have strong restrictions on what data can be collected, and may not allow deployments which aren't compatible with those. Matt Zekauskas of Internet2 said that he'd find it useful to have some publically-available data characterizing OpenFlow use. Even if we have to elide the specifics, it's helpful to know usage details like what ethertypes experimenters tend to reserve. The question of direct (non-OpenFlow) access to switch data obtained via SNMP and reported by campuses was raised, but attendees were unsure whether it would be useful. A robust SNMP proxy might serve the same function. K.C. Wang of Clemson University noted that a lot of operators have agreed that OpenFlow experimenter opt-in decisions are very difficult, so having monitoring tools which would make it easier to determine whether an opt-in was safe would be useful. Ivan Seskar of Rutgers made a request for notifications --- he uses local monitoring for performance data relevant to his testbed, but wants centralized monitoring to tell him when GENI things are not working.