wiki:InstrumentationTools/INSTOOLS_final_report

Version 3 (modified by fei@netlab.uky.edu, 8 years ago) (diff)

--

INSTOOLS Project Final Report

June 2016

I. Major accomplishments

We finished all the milestones stated in the contract and its modifications. The following highlights our accomplishments for the project.

A. Milestones achieved

During the project period we achieved the following milestones:

  • (Tasks 1 and 2) We upgraded our Edulab facility including installation and incorporation of new hardware into the Edulab system. We also upgraded the University’s Internet 2 network connection. We installed the latest copy of the Utah ProtoGENI code (i.e., the aggregate manager) on the Kentucky Edulab system resulting in a local aggregate implementation. We integrated the Kentucky aggregate with the ProtoGENI Clearinghouse to become a ProtoGENI edge cluster.
  • (Tasks 3) We implemented the instrumentation and measurement system using the ProtoGENI API. We integrated our existing monitoring and measurement tools into ProtoGENI by designing a measurement system architecture which provides a separate measurement controller for each experiment.
  • (Task 5) We continuously maintained and updated the instrumentation and measurement system. We developed a new web interface to the measurement system based on the Drupal content management system and implemented a new monitoring tool to capture NetFlow data. We developed the feature to support virtual nodes and provide a VNC interface to experiment nodes. We developed a portal that provides a single point of entry/control across various aggregates, and an archival system with features to archive data for future use and analysis.
  • (Tasks 4 and 6) ) We worked with the Utah group and the GENI security team while designing our instrumentation and measurement system, discussing methods for authentication and secure access to slice and measurement resources. In the process we provided Utah with feedback and changes needed to make the code work on our aggregate. We worked with the Utah group and integrated the INSTOOLS software into the FLACK client/GUI.
  • (Milestones s3.b, s3.d, s3.f) We provided continual support for the operation of the Kentucky aggregate. We made the stable version of the project's software available for release. We identified GENI experiments that will use the instrumentation tools.
  • (Milestones s3.a, s3.c, s3.e) We gave tutorials and demos of the instrumentation tools at GECs to experimenters and educators. We used the ProtoGENI and INSTOOLS in the networking and operating systems classes.

B. Deliverables made

  • We made the stable version of our software available via the ProtoGENI GIT repository.

II. Description of work performed during the project period

The following provides a description of our activities and results for the project.

A. Activities and findings

Our activities and findings can be described in the following six aspects.

1. Building the Kentucky Aggregate.

We upgraded our existing Edulab facility and then transformed it into a ProtoGENI edge cluster (aggregate). We began by moving our Edulab cluster into a larger room that would better accommodate the additional machines that were to be incorporated into the system. We then fixed and upgraded the existing Edulab hardware in preparation for its conversion to the ProtoGENI system. We also purchased and installed 24 new PCs and integrated them with the existing 47 machines to bring the total number of PCs to just over 70. We installed extra network interfaces in each PC so that each PC has one interface on the control network and an additional 4 interfaces on the experimental network. We installed 5 existing and newly purchased switches to provide the ports that we needed for the new machines on the control and experimental networks. We also installed two power controllers to enable remote power cycling of the machines, and we upgraded the University’s pathway to Internet 2 to be a 10 Gbps connection.

In preparation for the conversion to the ProtoGENI code, we began by upgrading our software to the latest release of the Emulab software. Having upgraded to the latest Emulab code, we were ready to convert to the ProtoGENI code. The conversion to the ProtoGENI code and its aggregate managers (i.e., the component and slice managers) went relatively smoothly, but was not without problems. The process was an iterative one, installing and trying the code, finding problems, working with Utah to address the issues, and then re-testing the code.

We originally planned to set up the Kentucky aggregate as an independent ProtoGENI cluster, and then connect to the ProtoGENI clearinghouse at a later date. However, after discussing this with the Utah group, we decided against this approach and instead connected to the ProtoGENI clearinghouse from the start. As a result, the Kentucky aggregate has been connected to–and registered with–the ProtoGENI clearinghouse from the start so that Kentucky certificates and resources are known by the ProtoGENI clearinghouse and can be verified/looked-up by clients. We operate a local slice authority and component manager that is known to the clearinghouse and can be contacted to obtain details of the current resources that have been allocated. Because our system is integrated with the Utah clearinghouse, slices can be allocated across Utah and Kentucky resources.

2. Designing and Implementing the instrumentation and measurement system using ProtoGENI API.

Having developed a better understanding of the ProtoGENI architecture and aggregate API, we developed an instrumentation and measurement architecture for ProtoGENI that incorporates measurement capabilities from our earlier Edulab system. However, because of the differences between Emulab and ProtoGENI, we had to take a different approach to the design of the instrumentation and measurement system than we did with Edulab. In particular, we decided to use experiment-specific (i.e., slice-specific) measurement nodes, creating a local measurement system within each experiment (slice). Slice-specific monitoring matches the standard usage model in which users only work with their own experiments and thus are interested in collecting measurement information for their own experiment, not the network as a whole. It also allows users to keep measurement data private and local, but yet allows data to be made public if desired.

Another design decision that we made was to build our instrumentation and measurement system using the ProtoGENI API instead of tapping directly into the Emulab database system as we had done in our previous Edulab implementation. Mapping our system onto the ProtoGENI APIs and infrastructure was rather challenging, in part because ProtoGENI itself was under development itself and did not yet have a set of guidelines or best practices describing the “approved” or “correct” way to hook into, and interoperate with, the ProtoGENI infrastructure. In order to flesh out the details of our design, we had to build components, connect them into ProtoGENI via the API, and try them out, often finding problems/limitations of the API that in turn required redesigning our architecture (and software). In that sense, fleshing out a detailed design for our measurement system was an iterative process. As part of this work, we realized that some of the information we needed in order to set up the instrumentation and measurement infrastructure was not available via the ProtoGENI APIs. After discussions with Utah, the ProtoGENI team decided to add a new abstraction to the APIs called a manifest that would contain detailed information about resources that should not be in an RSPEC. We then designed, and used, the manifests that Utah added to their APIs to find the information we needed.

Because we were learning about the (ever-changing) control framework at the same time that we were designing an instrumentation and measurement system to be built on top of it, our design has been evolving as we discover the various features and limitations of the ProtoGENI control framework. We published a report describing our instrumentation and measurement system (the INSTOOLS system) which can be found on the INSTOOLS wiki page at http://groups.geni.net/geni/wiki/InstrumentationTools. Our architectural design divides the system into the following components: measurement setup, data capture, data collection, data storage, data processing, measurement control, access, and presentation. The design attempts to separate the instrumentation and measurement functionality from the control framework functionality as much as possible so that the code for both can evolve independently. As a result, our INSTOOLS system primarily interacts with the control framework via the ProtoGENI API. We limited the number of modifications we had to make to the control framework to a handful of places in the setup code where we insert calls to invoke our INSTOOLS measurement setup code. Instead of deploying our own INSTOOL servers and services, our setup code automatically creates the necessary services by secretly adding additional GENI resources to a user’s slice and then configuring those resources to carry out the INSTOOLs measurement tasks. In other words, the measurement infrastructure is itself part of the slice/experiment.

In regards to the design of the user interface and display of data, we decided to take a lazy-evaluation approach. The raw data collected by the system is saved as RRD database files. Instead of generating graphs and image files immediately for potential use in the future, our system generates graphs from the RRD data on demand. Users interact with the measurement system through a web interface that allows them to select the information they want to see. That information is then dynamically created and updated every five seconds to give users a “live” look at what is occurring in their slice.

Our initial implementation uses hooks into the ProtoGENI control framework to invoke our code that sets up a measurement control (MC) node for each slice. Measurement control nodes form the heart of our measurement system, controlling software on the Measurement Points (MPs), collecting data from MPs, and even making the data available to users via a web server. As part of the setup, each sliver (MP) launches software to capture measurement data that is then collected by the MC. The raw data collected is typically stored in RRD database files and then converted using rrdtools into graphs that show traffic levels or utilization levels. The MC also houses a web server that provides users with (visual) access to the graphs and charts of measurement data.

This earlier implementation required placing hooks in the ProtoGENI code (i.e., rewriting and enhancing parts of the ProtoGENI source code) and made it difficult to keep pace with the frequent updates and changes being rolled out by the ProtoGENI team. Each time a new version of the ProtoGENI code was released, we had to re-implement our changes in the new version. To achieve a level of separation from the ProtoGENI code, we redesigned our code so that it only interacts with ProtoGENI via the ProtoGENI API calls. In other words, we abandoned the modifications that we had made to ProtoGENI code itself, and instead developed ways to achieve the same results simply by making calls to the ProtoGENI API. In particular, we wrote scripts that capture the RSPEC before it goes to the ProtoGENI API, adds a measurement controller (MC) node along with the code needed on the MC, and then makes the API calls needed to initialize the experiment, the MC, and all the measurement points (nodes). As a result, our code is only dependent on the ProtoGENI API, allowing the ProtoGENI implementation to change without affecting our code.

Another enhancement we made was to make the instrumentation and measurement deployment and teardown independent of the slice setup and teardown. In our original version of the code, we modified the Utah slice creation scripts to set up the instrumentation system at the same time. In our most recent version of the code, users can create a slice using any of the ProtoGENI approved ways (e.g., via scripts or via the Utah flash GUI). After the slice has been established, our new script will discover the topology of the slice and instrument it appropriately. We also created scripts to remove instrumentation from a slice when it is no longer needed.

Another important feature is the ability to instrument slices that span aggregates. Our code identifies all the aggregates that comprise a slice, locates the component managers for those aggregates, discovers the resources used in each aggregate, and then proceeds to set up an MC in each of the aggregates where the slices results will be collected and made available via the web interface. Finally, our code installs and initiates the measurement software on the resources in each of the aggregates, with measurement data from each resource being directed to the appropriate MC. Our scripts also allow a user to instrument only portions of a slice by selecting which aggregates should be instrumented and which ones should not be instrumented.

3. Maintaining and updating the instrumentation and measurement system.

After the initial instrumentation and measurement system is built, we have been in the process of continuously maintaining and updating the system. We describe several major updates to the system here.

We integrated the Drupal content management system for purposes of displaying the collected measurement information. We load Drupal nodes into the database that render the graphs. Because each graph is a distinct Drupal node, users can define their own views of the measurement data, allowing them to display precisely the information they are interested in. It also allows users to define the theme/look-and-feel of the web interface to meet their needs.

We completed our implementation of support for NetFlow data collected within a user’s slice. In addition to setting up the instrumentation and measurement infrastructure to collect packet data rates, we now set up the NetFlow services needed to capture (and categorize) data on a per-flow basis. Similar to our SNMP services, the NetFlow capture services send their data to the measurement controller (MC) for processing and viewing. The flows to be monitored are preconfigured for the user so that they can simply go to the web interface to see most common types of flows (e.g., to well known ports).

The experience of adding in a new type of information source (NetFlow) to our measurement system caused us to think about the problem of simplifying the process of adding new information sources in the future. In particular, we wanted it to be easy for users to modify the web interface to display the data they collect on each node or link. To make this possible, we created a web page on the MC where a user can enter a parameterized command that is then used to (automatically) generate all the web pages needed to view that type of data. As a result, it is relatively easy to incorporate new types of (collected) information into the web interface.

As more experiments start to use the ProtoGENI facilities, making efficient use of the resources has become one of our priorities. To that end, we worked with the Utah group to add support for virtual nodes based on OpenVZ so that each physical PC can be used as multiple experimental nodes. Adding support for virtual nodes turned out to be fraught with problems, and so we had to go through several rounds of bug fixes and upgrades to get the latest development version of the code working. In addition we had to manually fix the OpenVZ image for the shared nodes because the change had not been pushed out to the image running at Utah. Unlike BSD jails, we found that the OpenVZ image supports the ability to run independent SNMP daemons on each OpenVZ node (i.e., one SNMP daemon per VM). This feature allows us to perform per-sliver monitoring without the need to write new monitoring software to understand the OpenVZ virtual interfaces.

Another addition to our software was the ability to access X window software running on the experimental nodes (slivers) via the MC web interface. Our goal was to leverage existing network monitoring tools, such as Wireshark and EtherApe, in order to observe the behavior of experimental nodes. These tools are helpful to collect node statistics and visualize the link traffic. However, they need X window support. To support such access, we added the ability to dynamically load X-window software onto the experimental nodes and then provide indirect access through the MC Web browser and the VNC protocol. It has been added to the the MC's Drupal content management system as one of the menu options. The only way for a user to access the programs running on the experimental nodes is to first login to the Drupal CMS running on the MC. This way, we can assure the slice owner that they are the only ones allowed to access/run these programs. Our Drupal interface has two VNC templates that are preconfigured to run xterm and Wireshark respectively on the nodes in the slice via VNC. The MC runs a JAVA-based VNC client (in the CMS) that mirrors the VNC connections from the nodes. The VNC communication between the nodes and the MC are protected by a system generated random password that is unique for every slice and invisible to the user.

To simplify access to a user's measurement data we developed a "portal" system. The portal is a one-stop shop that allows a users to access all their measurement data via a single interface. Instead of requiring the user to visit multiple URLs, where each URL refers to one MC, the user visits a single URL representing the portal. The portal presents a visualization of the topology and allows users to select the measurement data they would like to see. Given the longitude and latitude information for each resource in the slice, the portal can show each resource's actual location on a map with links connecting the resources. If nodes within an aggregate are too close to be distinguished, the portal provides an expanded view to show the logical links among them. By clicking on a node or a link, the users can get a list of measurement information available. They can then choose what data to view. The portal will then present the corresponding tables and/or graphs on the screen. The links can also be color-coded to show the current levels of traffic over the links.

We also connected our INSTOOLS system with two archival systems: (1) the University of Kentucky Archive Service (UKAS) and (2) the CNRI Archive Service. Our UKAS archive service not only provides a repository for data, but it also provides a computational environment where we can recreate the same look-and-feel the user had when viewing the live data. In particular, we use OpenVZ containers to provide a virtualized environment in which to run the Drupal content management systems that was running on the MC at the the time the archive was taken. As a result, the user can visit the same web pages that were offered by the MC at the time the archive was taken. We have also incorporated support for the CNRI archive service and its concept of workspaces. In particular, our system is able to store the data files containing measurements information into a CNRI workspace associated with a slice. After the data has been stored in the workspace, the system adds the necessary metadata needed by the CNRI archive to move the measurement files from the workspace to the CNRI archive for permanent storage. The files can then be access via the CNRI web interface.

4. Collaborating with other teams and integrating the INSTOOLS software into the FLACK client/GUI.

We worked with the Utah group and the GENI security team while designing our instrumentation and measurement system, discussing methods for authentication and secure access to slice and measurement resources. In particular, we collaborated with the Utah team to integrate the INSTOOLS software with the FLACK client/GUI.

Instead of requiring users to run a set of scripts to instrumentize their experiment, we modified the FLACK GUI so that users can instrumentize their slice simply by clicking on a button in the GUI. In particular, we added two new buttons to the FLACK GUI, one to add instrumentation to a slice and the other to access the portal site. To instrumentize an existing slice, a user simply needs to click on the instrumentize button in the FLACK GUI. The GUI will then talk with the backend instrumentation servers to instrumentize the slice. While this is occurring, information about the progress/status of the instrumentize process is sent back to the GUI and can be viewed by the user. After the slice has been instrumentized, the Go-to-portal button is enabled, allowing the user to view the topology and the traffic moving across various nodes and links. The user can pick any node or link in the experiment to observe the measurement data he/she is interested in.

The integration of the INSTOOLS with the FLACK interface greatly simplifies the instrumentation process for users, making it much easier to use the instrumentation tools. From a user's perspective, the integration also enables a sort of single sign-on service in which the user authenticates to the FLACK client, but then the FLACK client and the backend instrumentation manager handles all the authentication on behalf of the user to the long lists of services that comprise the instrumentation system. The portal button on the FLACK client takes the user directly to our GENI Monitoring Portal (GMP) so that the user does not need to remember URLs or have to login to the GMP on their own.

To integrate the instrumentation tools with the FLACK client, we redesigned our software so that the installation and deployment of the instrumentation software is done by a backend instrumentation manager that, like a component manager, is responsible for allocating and setting up instrumentation infrastructure. As a result, the user interface components of our system could be separated out and integrated into the FLACK client. This involved rewriting our Python scripts as flash code making ProtoGENI API calls as well as XML-RPC calls to our instrumentation manager (IM). We run an instrumentation manager (IM) at each aggregate in concert with the component manager at each aggregate. The IM is responsible for sshing to experimental nodes, running initialization scripts, setting up the MC nodes, etc and, like a component manager, has the authority needed to carry out these tasks.

5. Supporting the operation of the Kentucky aggregate and the use of INSTOOLS.

We provided continual support to the operation of the Kentucky aggregate and enabled experimentation through the ProtoGENI clearinghouse. We identified multiple GENI experiments that will use our Instrumentation Tools. We made contact with these experimenters and have been providing support as they begin to use the tools. We established a web site with documentation/tutorials/examples to help experimenters get started.

6. Giving the tutorials and demos at GECs and using the tools in classes.

We demonstrated the INSTOOLS system at the GEC6, GEC7 and GEC10 conferences and gave the tutorials of using the system at the GEC8 and GEC11 conferences. We had several good discussions with other measurement groups regarding ways to incorporate their measurement data into our measurement interface. We used the GENI and INSTOOLS in several of our networking and operating systems classes.

B. Project participants

The following individuals have helped with the project in one way or another:

  • Jim Griffioen - Project PI
  • Zongming Fei - Project Co-PI
  • Hussamuddin Nasir - Technician/programmer
  • Xiongqi Wu - Research Assistant
  • Jeremy Reed - Research Assistant
  • Lowell Pike - Network administrator
  • Woody Marvel - Network administrator

C. Publications (individual and organizational)

  • James Griffioen and Zongming Fei, "Automatic Creation of Experiment-specific Measurement Infrastructure," In the proceedings of the First Workshop on Performance Evaluation of Next-Generation Networks (Neteval09), Boston MA, April 2009.
  • GENI Report: J. Griffioen, Z. Fei, H. Nasir, Architectural Design and Specification of the INSTOOLS Measurement System, December 2009.

  • Jonathon Duerig, Robert Ricci, Leigh Stoller, Matt Strum, Gary Wong, Charles Carpenter, Zongming Fei, James Griffioen, Hussamuddin Nasir, Jeremy Reed, Xiongqi Wu, "Getting started with GENI: A user tutorial," ACM SIGCOMM Computer Communication Review (CCR), vol.42, no.1, pp.72-77, January 2012.
  • James Griffioen, Zongming Fei, Hussamuddin Nasir, Xiongqi Wu, Jeremy Reed, Charles Carpenter, "The Design of an Instrumentation System for Federated and Virtualized Network Testbeds", Proc. of the First IEEE Workshop on Algorithms and Operating Procedures of Federated Virtualized Networks (FEDNET), Maui, Hawaii, April 2012.
  • James Griffioen, Zongming Fei, Hussamuddin Nasir, Xiongqi Wu, Jeremy Reed, Charles Carpenter, "Measuring experiments in GENI," Computer Networks, vol.63, pp.17-32, 2014.

D. Outreach activities

We participated in the GENI Measurement conference and were involved in the activities of GENI measurement working group. We gave a talk about our work at the Neteval 2009 held in Boston and the Internet2 Joint Techs conference held at Clemson University. We demonstrated the INSTOOLS system at the GEC6, GEC7 and GEC10 conferences and gave the tutorials of using the system at the GEC8 and GEC11 conferences. We have been providing support for the users of our tools.

E. Collaborations

Most of our collaborations were with the Utah ProtoGENI team. We were actively involved in the bi-weekly meetings of the ProtoGENI cluster. We also had discussions with other measurement groups including the OnTimeMeasure group at Ohio State and the S3 Monitor group at Purdue. We have also had conversations with the GENI security teams and members of other clusters regarding the design of security aspects of the measurement system.

F. Other Contributions