Changes between Initial Version and Version 1 of InstrumentationTools/INSTOOLS_final_report


Ignore:
Timestamp:
06/09/16 14:41:07 (8 years ago)
Author:
fei@netlab.uky.edu
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • InstrumentationTools/INSTOOLS_final_report

    v1 v1  
     1[[PageOutline]]
     2
     3= INSTOOLS Project Final Report =
     4
     5June 2016
     6
     7== I. Major accomplishments ==
     8
     9We finished all the milestones stated in the contract and its modifications.
     10The following highlights our accomplishments for the project.
     11
     12=== A. Milestones achieved ===
     13
     14During the project period we achieved the following milestones:
     15
     16 * We upgraded our Edulab facility including installation and incorporation of
     17   new hardware into the Edulab system. We also upgraded the University’s
     18   Internet 2 network connection. We installed the latest copy of the Utah
     19   ProtoGENI code (i.e., the aggregate manager) on the Kentucky Edulab system
     20   resulting in a local aggregate implementation. We integrated the Kentucky
     21   aggregate with the ProtoGENI Clearinghouse to become a ProtoGENI edge cluster.
     22
     23 * We implemented the measurement and instrumentation system using the ProtoGENI API.
     24   We integrated our existing monitoring and measurement tools into ProtoGENI by
     25   designing a measurement system architecture which provides a separate measurement
     26   controller for each experiment.
     27
     28 * We continued maintaining and updating the instrumentation and measurement systems.
     29   We developed a new web interface to the measurement system based on the
     30   Drupal content management system and implemented a new monitoring tool
     31   to capture Netflow data.  We developed the feature
     32   to support virtual nodes and provide a VNC interface to experiment nodes.
     33   We developed a portal that provides a single point
     34   of entry/control across various aggregates, and an archival system with
     35   features to archive data for future use and analysis.
     36
     37 * We worked with the Utah group and the GENI security team while designing
     38   our instrumentation and measurement system, discussing methods for
     39   authentication and secure access to slice and measurement resources.
     40   In the process we provided Utah with feedback and changes needed to
     41   make the code work on our aggregate. We worked with the Utah group and
     42   integrated the INSTOOLS software into the FLACK client/GUI.
     43
     44 * We provided continual support for the operation of the Kentucky aggregate.
     45
     46 * We gave tutorials and demos of the instrumentation tools
     47   at GECs to experimenters and educators.
     48   We used the ProtoGENI and INSTOOLS in the networking and operating systems classes.
     49     
     50 * We made the stable version of our software
     51   available through the ProtoGENI GIT repository.   
     52
     53=== B. Deliverables made ===
     54
     55 * We made the stable version of our software available via the
     56   ProtoGENI GIT repository.
     57
     58 * We posted the documentation about INSTOOLS at
     59   [http://groups.geni.net/geni/wiki/INSTOOLSSummary http://groups.geni.net/geni/wiki/INSTOOLSSummary]
     60
     61 * We posted a tutorial providing step-by-step instructions
     62   on how to use INSTOOLS at
     63   [http://www.netlab.uky.edu/p/instools/ http://www.netlab.uky.edu/p/instools/]
     64
     65== II. Description of work performed during the project period ==
     66
     67The following provides a description of our activities and results for the project.
     68
     69=== A. Activities and findings ===
     70
     71Our activities and findings can be described in the following six aspects.
     72
     73==== 1. Building the Kentucky Aggregate. ====
     74
     75We upgraded our existing Edulab facility and then transforming it into a ProtoGENI
     76edge cluster (aggregate). We began by moving our Edulab cluster into a larger room that would better
     77accommodate the additional machines that were to be incorporated into the system. We then fixed and
     78upgraded the existing Edulab hardware in preparation for its conversion to the ProtoGENI system. We also
     79purchased and installed 24 new PCs and integrated them with the existing 47 machines to bring the total
     80number of PCs to just over 70. We installed extra network interfaces in each PC so that each PC has one
     81interface on the control network and an additional 4 interfaces on the experimental network. We installed
     825 existing and newly purchased switches to provide the ports that we needed for the new machines on the
     83control and experimental networks. We also installed two power controllers to enable remote power cycling
     84of the machines, and we upgraded the University’s pathway to Internet 2 to be a 10 Gbps connection.
     85
     86In preparation for the conversion to the ProtoGENI code, we began by upgrading our software to the
     87latest release of the Emulab software. Having upgraded to the latest Emulab code, we were ready to convert
     88to the ProtoGENI code. The conversion to the ProtoGENI code and its aggregate managers (i.e., the component
     89and slice managers) went relatively smoothly, but was not without problems. The process was an
     90iterative one, installing and trying the code, finding problems, working with Utah to address the issues, and
     91then re-testing the code.
     92
     93We originally planned to set up the Kentucky aggregate as an independent ProtoGENI cluster, and then
     94connect to the ProtoGENI clearinghouse at a later date. However, after discussing this with the Utah group,
     95we decided against this approach and instead connected to the ProtoGENI clearinghouse from the start. As
     96a result, the Kentucky aggregate has been connected to–and registered with–the ProtoGENI clearinghouse
     97from the start so that Kentucky certificates and resources are known by the ProtoGENI clearinghouse and
     98can be verified/looked-up by clients. We operate a local slice authority and component manager that is
     99known to the clearinghouse and can be contacted to obtain details of the current resources that have been
     100allocated. Because our system is integrated with the Utah clearinghouse, slices can be allocated across Utah
     101and Kentucky resources.
     102
     103==== 2. Designing and Implementing the instrumentation and measurement system using GENI API. ====
     104
     105Having developed a better understanding of the ProtoGENI architecture and aggregate API, we began developing
     106an instrumentation and measurement architecture for ProtoGENI that incorporates measurement
     107capabilities from our earlier Edulab system. However, because of the differences between Emulab and ProtoGENI,
     108we had to take a different approach to the design of the instrumentation and measurement system
     109than we did with Edulab. In particular, we decided to use experiment-specific (i.e., slice-specific) measurement
     110nodes, creating a local measurement system within each experiment (slice). Slice-specific monitoring
     111matches the standard usage model in which users only work with their own experiments and thus are interested
     112in collecting measurement information for their own experiment, not the network as a whole. It also
     113allows users to keep measurement data private and local, but yet allows data to be made public if desired.
     114
     115Another design decision that we made was to build our instrumentation and measurement system using
     116the ProtoGENI API instead of tapping directly into the Emulab database system as we had done in our
     117previous Edulab implementation. Mapping our system onto the ProtoGENI APIs and infrastructure was
     118rather challenging, in part because ProtoGENI itself was under development itself and did not yet have a
     119set of guidelines or best practices describing the “approved” or “correct” way to hook into, and interoperate
     120with, the ProtoGENI infrastructure. In order to flesh out the details of our design, we had to build components,
     121connect them into ProtoGENI via the API, and try them out, often finding problems/limitations of the
     122API that in turn required redesigning our architecture (and software). In that sense, fleshing out a detailed
     123design for our measurement system was an iterative process. As part of this work, we realized that some
     124of the information we needed in order to set up the instrumentation and measurement infrastructure was
     125not available via the ProtoGENI APIs. After discussions with Utah, the ProtoGENI team decided to add a
     126new abstraction to the APIs called a manifest that would contain detailed information about resources that
     127should not be in an RSPEC.We then designed, and began using, the manifests that Utah added to their APIs
     128to find the information we needed. While the manifest abstraction was a welcome addition, the manifest
     129implementation and API calls still have not solved all of our issues and have caused us to rethink how and
     1303
     131when topology information should be obtained by our measurement nodes. We have added code to work
     132around these problems, but the issue is worth returning to in the future and may require additional changes
     133to the control framework and its API.
     134
     135Because we were learning about the (ever-changing) control framework at the same time that we were
     136designing an instrumentation and measurement system to be built on top of it, our design has been evolving
     137as we discover the various features and limitations of the ProtoGENI control framework. We recently
     138published a report describing our instrumentation and measurement system (the INSTOOLS system) which
     139can be found on the INSTOOLS wiki page at http://groups.geni.net/geni/wiki/InstrumentationTools. Our
     140architectural design divides the system into the following components: measurement setup, data capture,
     141data collection, data storage, data processing, measurement control, access, and presentation. The design
     142attempts to separate the instrumentation and measurement functionality from the control framework functionality
     143as much as possible so that the code for both can evolve independently. As a result, our INSTOOLS
     144system primarily interacts with the control framework via the ProtoGENI API. We limited the number of
     145modifications we had to make to the control framework to a handful of places in the setup code where we
     146insert calls to invoke our INSTOOLS measurement setup code. Instead of deploying our own INSTOOL
     147servers and services, our setup code automatically creates the necessary services by secretly adding additional
     148GENI resources to a user’s slice and then configuring those resources to carry out the INSTOOLs
     149measurement tasks. In other words, the measurement infrastructure is itself part of the slice/experiment.
     150
     151In regards to the design of the user interface and display of data, we decided to take a lazy-evaluation
     152approach. The raw data collected by the system is saved as RRD database files. Instead of generating graphs
     153and image files immediately for potential use in the future, our system generates graphs from the RRD
     154data on demand. Users interact with the measurement system through a web interface that allows them to
     155select the information they want to see. That information is then dynamically created and updated every five
     156seconds to give users a “live” look at what is occurring in their slice. Ultimately, we plan to use a content
     157management system to give the user even greater control over the information displayed by the measurement
     158system.
     159
     160Our initial implementation uses
     161hooks into the ProtoGENI control framework to invoke our code that sets up a measurement control (MC)
     162node for each slice. Measurement control nodes form the heart of our measurement system, controlling
     163software on the Measurement Points (MPs), collecting data from MPs, and even making the data available
     164to users via a web server. As part of the setup, each sliver (MP) launches software to capture measurement
     165data that is then collected by the MC. The raw data collected is typically stored in RRD database files and
     166then converted using rrdtools into graphs that show traffic levels or utilization levels. The MC also houses
     167a web server that provides users with (visual) access to the graphs and charts of measurement data. Our
     168current user interface is a simple PHP-based interface that allows users to select the sliver or link for which
     169they want to see measurement data. The data is then displayed and automatically refreshed every five seconds
     170to give the impression of a “live” view of the running system.
     171
     172This earlier implementation required placing hooks in the ProtoGENI
     173code (i.e., rewriting and enhancing parts of the ProtoGENI source code)
     174and
     175made it difficult to keep pace with the frequent updates and changes being
     176rolled out by the ProtoGENI team. Each time a new version of the ProtoGENI code was released, we had to
     177reimplement our changes in the new version. To achieve a level of separation from the ProtoGENI code, we
     178redesigned our code so that it only interacts with ProtoGENI via the ProtoGENI API calls. In other words,
     179we abandoned the modifications that we had made to ProtoGENI code itself, and instead developed ways to
     180achieve the same results simply by making calls to the ProtoGENI API. In particular, we wrote scripts that
     181capture the RSPEC before it goes to the ProtoGENI API, adds a measurement controller (MC) node along
     182with the code needed on the MC, and then makes the API calls needed to initialize the experiment, the MC,
     183and all the measurement points (nodes). As a result, our code is now only dependent on the ProtoGENI API,
     184allowing the ProtoGENI implementation to change without affecting our code.
     185
     186Another enhancement we made was to make the instrumentation and measurement deployment and
     187teardown independent of the slice setup and teardown. In our original version of the code, we modified the
     188Utah slice creation scripts to set up the intstrumentation system at the same time. In our most recent version
     189of the code, users can create a slice using any of the ProtoGENI approved ways (e.g., via scripts or via the
     190Utah flash GUI). After the slice has been established, our new script will discover the topology of the slice
     191and instrument it appropriately. We also created scripts to remove instrumentation from a slice when it is no
     192longer needed.
     193
     194Perhaps the most important advance we made is the ability to instrument slices that span aggregates.
     195Our new code identifies all the aggregates that comprise a slice, locates the component managers for those
     196aggregates, discovers the resources used in each aggregate, and then proceeds to set up an MC in each of
     197the aggregates where the slices results will be collected and made available via the web interface. Finally,
     198our code installs and initiates the measurement software on the resources in each of the aggregates, with
     199measurement data from each resource being directed to the appropriate MC. Our scripts also allow a user
     200to instrument only portions of a slice by selecting which aggregates should be instrumented and which ones
     201should not be instrumented.
     202
     203==== 3. Maintaining and updating the instrumentation and measurement system. ====
     204
     205After the initial instrumentation and measurement system is built, we have been
     206in the process of continuously maintaining and updating the system. We describe
     207several major updates to the system here.
     208
     209We integrated the Drupal content management system for purposes of displaying the collected measurement
     210information. We now load Drupal nodes into the database that render the graphs. Because each graph is a
     211distinct drupal node, users can define their own views of the measurement data, allowing them to display
     212precisely the information they are interested in. It also allows users to define the theme/look-and-feel of the
     213web interface to meet their needs.
     214
     215We completed our implementation of support for NetFlow data collected within a user’s slice. In addition
     216to setting up the instrumentation and measurement infrastructure to collect packet data rates, we now set up
     217the NetFlow services needed to capture (and categorize) data on a per-flow basis. Similar to our SNMP
     218services, the Netflow capture services send their data to the measurement controller (MC) for processing
     219and viewing. Currently, the flows to be monitored are preconfigured for the user so that they can simply
     220go to the web interface to see most common types of flows (e.g., to well known ports). Changing what is
     221monitored is still a manual task, but we plan to modify that in a future release.
     222
     223The experience of adding in a new type of information source (netflow) to our measurement system
     224caused us to think about the problem of simplifying the process of adding new information sources in the
     225future. In particular, we wanted it to be easy for users to modify the web interface to display the data they
     226collect on each node or link. To make this possible, we created a web page on theMC where a user can enter
     227a parameterized command that is then used to (automatically) generate all the web pages needed to view
     228that type of data. As a result, it is now relatively easy to incorporate new types of (collected) information
     229into the web interface.
     230
     231As more experiments start to use the ProtoGENI facilities, making efficient use of the resources has become one of our priorities. To that end, we worked with the Utah group to add support for virtual nodes based on OpenVZ so that each physical PC can be used as multiple experimental nodes. Adding support for virtual nodes turned out to be fraught with problems, and so we had to go through several rounds of bug fixes and upgrades to get the latest development version of the code working. In addition we had to manually fix the OPENVZ image for the shared nodes because the change had not been pushed out to the image running at Utah.
     232Unlike BSD jails, we found that the OpenVZ image supports the abilty to run independent SNMP daemons on each OpenVZ node (i.e., one SNMP daemon per VM). This feature allows us to perform per-sliver monitoring without the need to write new monitoring software to understand the OpenVZ virtual interfaces.
     233
     234Another addition to our software was the ability to access X window software running on the experimental nodes (slivers) via the MC web interace. Our goal was to leverage existing network monitoring tools, such as Wireshark and EtherApe?, in order to observe the behavior of experimental nodes. These tools are helpful to collect node statistics and visualize the link traffic. However, they need X window support. To support such access, we added the ability to dynamically load X-window software onto the experimental nodes and then provide indirect access through the MC Web browser and the VNC protocol. It has been added to the the MC's Drupal content management system as one of the menu options. The only way for a user to access the programs running on the experimental nodes is to first login to the Drupal CMS running on the MC. This way, we can assure the slice owner that they are the only ones allowed to access/run these programs. Our Drupal interface currently has two VNC templates that are preconfigured to run xterm and wireshark respectively on the nodes in the slice via VNC. The MC runs a JAVA-based VNC client (in the CMS) that mirrors the VNC connections from the nodes. The VNC communication between the nodes and the MC are protected by a system generated random password that is unique for every slice and invisible to the user.
     235
     236To simplify access to a user's measurement data we developed a "portal" system. The portal is a one-stop shop that allows a users to access all their measurement data via a single interface. Instead of requiring the user to visit multiple URLs, where each URL refers to one MC, the user visits a single URL representing the portal. The portal presents a visualization of the topology and allows users to select the measurement data they would like to see. Given the longitude and latitude information for each resource in the slice, the portal can show each resource's actual location on a map with links connecting the resources. If nodes within an aggregate are too close to be distinguished, the portal provides an expanded view to show the logical links among them. By clicking on a node or a link, the users can get a list of measurement information available. They can then choose what data to view. The portal will then present the corresponding tables and/or graphs on the screen. The links can also be color-coded to show the current levels of traffic over the links.
     237
     238We also connected our INSTOOLS system with two
     239archival systems: (1) the University of Kentucky Archive Service (UKAS)
     240and (2) the CNRI Archive Service. 
     241Our UKAS archive service not only
     242provides a repository for data, but it also provides a computational
     243environment where we can recreate the same look-and-feel the user
     244had when viewing the live data.  In particular, we
     245use OpenVZ containers to provide a virtualized environment in which
     246to run the drupal content management systems that was running on the MC
     247at the the time the archive was taken.  As a result, the user can visit
     248the same web pages that were offered by the MC at the time the archive
     249was taken.
     250We have also incorporated support for the CNRI archive service and its
     251concept of workspaces.  In particular, our system is able to store
     252the data files containing measurements information into a CNRI workspace
     253associated with a slice.  After the data has been stored in the workspace,
     254the system adds the necessary metadata needed by the CNRI archive to move
     255the measurement files from the workspace to the CNRI archive for permanent
     256storage.  The files can then be access via the CNRI web interface.
     257
     258==== 4. Collaborating with other teams and integrating the INSTOOLS software into the FLACK clien/GUI. ====
     259
     260We worked with the Utah group and the GENI security team while designing
     261our instrumentation and measurement system, discussing methods for
     262authentication and secure access to slice and measurement resources.
     263In particular, we collaborated with the Utah team to integrate
     264the INSTOOLS software with the FLACK client/GUI.
     265
     266Instead of requiring users to run a set of scripts to instrumentize their
     267experiment, we modified the FLACK GUI so that users can instrumentize their
     268slice simply by clicking on a button in the GUI.
     269In particular, we added two new buttons to the FLACK GUI,
     270one to add
     271instrumentation to a slice and the other to
     272access the portal site.
     273To instrumentize an existing slice, a user simply needs to click on the
     274instrumentize button in the FLACK GUI.  The GUI will then talk with the
     275backend instrumentation servers to instrumentize the slice.  While this is
     276occuring, information about the progress/status of the instrumentize process
     277is sent back to the GUI and can be viewed by the user.
     278After the slice has been instrumentized, the Go-to-portal button
     279is enabled, allowing the user to view the topology
     280and the traffic moving across various nodes and links.
     281The user can pick any node or link in the experiment to observe the
     282measurement data he/she is interested in.
     283
     284The integration of the INTOOLS with the FLACK interface greatly
     285simplifies the instrumentation process for users, making it much easier
     286to use the instrumentation tools.
     287From a user's perspective, the integration also enables a sort of single
     288sign-on service in which the user authenticates to the FLACK client, but
     289then the FLACK client and the backend instrumentation manager handles
     290all the authentication on behalf of the user to the long lists of
     291services that comprise the instrumentation system.
     292The portal button on the FLACK client takes the user
     293directly to our GENI Monitoring Portal (GMP) so that the user
     294does not need to remember URLs or have to login to the GMP
     295on their own.
     296
     297To integrate the instrumentation tools with the FLACK client,
     298we redesigned our software so that the installation and deployment
     299of the instrumentation software is done by a backend instrumentation
     300manager that, like a component manager, is responsible for allocating
     301and setting up instrumentation infrastructure.  As a result, the
     302user interface components of our system could be separated out
     303and integrated into the FLACK client.  This involved rewriting
     304our Python scripts as flash code making ProtoGENI API calls as
     305well as XML-RPC calls to our instrumentation manager (IM). 
     306We currently run an instrumentation manager (IM) at each aggregate
     307in concert with the component manager at each aggregate.
     308The IM is responsible for sshing to experimental nodes, running
     309initialization scripts, setting up the MC nodes, etc and, like a component
     310manager, has the authority needed to carry out these tasks.
     311
     312==== 5. Supporting the operation of the Kentucky aggregate and the use of INSTOOLS. ====
     313
     314We provided continual support to the operation of the Kentucky aggregate
     315and enabled experimentation through the ProtoGENI clearinghouse.
     316We identified multiple GENI experiments that will use our Instrumentation Tools. 
     317We made contact with these experimenters and have been providing support as they
     318begin to use the tools.  We established a web site
     319with documentation/tutorials/examples to help experimenters get started.
     320
     321==== 6. Giving the tutorials and demos at GECs and using the tools in classes. ====
     322
     323We demonstrated the INSTOOLS system at the GEC6, GEC7 and GEC10 conferences
     324and gave a tutorial of using the system at the GEC8 and GEC11 conferences.
     325We had several good discussions with other measurement
     326groups regarding ways to incorporate their measurement data into
     327our measurement interface. We used the GENI and INSTOOLS in several
     328of our networking and operating systems classes.
     329
     330
     331=== B. Project participants ===
     332
     333The following individuals have helped with the project in one way or another
     334during the last quarter:
     335 * Jim Griffioen - Project PI
     336 * Zongming Fei - Project Co-PI
     337 * Hussamuddin Nasir - Technician/programmer
     338 * Xiongqi Wu - Research Assistant
     339 * Jeremy Reed - Research Assistant
     340 * Lowell Pike - Network administrator
     341 * Woody Marvel - Network administrator
     342
     343
     344=== C. Publications (individual and organizational) ===
     345
     346 * James Griffioen and Zongming Fei, "Automatic Creation of Experiment-specific Measurement Infrastructure," In the proceedings of the First Workshop on Performance Evaluation of Next-Generation Networks (Neteval09), Boston MA, April 2009.
     347
     348 * GENI Report: J. Griffioen, Z. Fei, H. Nasir, Architectural Design and Specification of the INSTOOLS Measurement System, December 2009.
     349 
     350 * Jonathon Duerig, Robert Ricci, Leigh Stoller, Matt Strum, Gary Wong, Charles Carpenter, Zongming Fei, James Griffioen, Hussamuddin Nasir, Jeremy Reed, Xiongqi Wu, "Getting started with GENI: A user tutorial," ACM SIGCOMM Computer Communication Review (CCR), vol.42, no.1, pp.72-77, January 2012.
     351
     352 * James Griffioen, Zongming Fei, Hussamuddin Nasir, Xiongqi Wu, Jeremy Reed, Charles Carpenter, "The Design of an Instrumentation System for Federated and Virtualized Network Testbeds", Proc. of the First IEEE Workshop on Algorithms and Operating Procedures of Federated Virtualized Networks (FEDNET), Maui, Hawaii, April 2012.
     353
     354 * James Griffioen, Zongming Fei, Hussamuddin Nasir, Xiongqi Wu, Jeremy Reed, Charles Carpenter, "Measuring experiments in GENI," Computer Networks, vol.63, pp.17-32, 2014.
     355
     356
     357=== D. Outreach activities ===
     358
     359We participated in the GENI Measurement conference and
     360were involved in the activities of GENI measurement working group.
     361We gave a talk about our work at the Neteval 2009 held in Boston and
     362the Internet2 Joint Techs conference held at Clemson University.
     363We demonstrated the INSTOOLS system at the GEC6, GEC7 and GEC10 conferences
     364and gave a tutorial of using the system at the GEC8 and GEC11 conferences.
     365We have been providing support for the early adopters of our tools.
     366
     367=== E. Collaborations ===
     368
     369Most of our collaborations were with the Utah ProtoGENI team. We were actively
     370involved in the bi-weekly meetings of the ProtoGENI cluster. We also had discussions with other measurement
     371groups including the OnTimeMeasure group at Ohio State and the S3 Monitor group at Purdue.
     372We have also had conversations with the GENI security teams and
     373members of other clusters regarding the design of security aspects of the measurement system.
     374
     375=== F. Other Contributions ===
     376