wiki:PlasticSlices/ProjectPlan

Version 2 (modified by chaos@bbn.com, 13 years ago) (diff)

change canonical location

1. Introduction

This is the first project in the Meso-Scale Campus Experiments plan, for evolving meso-scale experiments between now and GEC 12. The overarching purpose is to set shared campus goals, and to agree on a schedule and resources to support experiments. This first project, dubbed "Plastic Slices", is an effort to run ten (or more) GENI slices continuously for multiple months -- not merely for the sake of having ten slices running, but to gain experience managing and operating production-quality GENI resources.

Experiments are currently coming from three sources:

  • Outreach tutorials and workshops sponsored by NSF and BBN
  • IT and research staff at the participating universities
  • GENI early operations trials

Rather than wait for all experiment requirements to be final, this document focuses on detailed requirements for running and managing multiple simultaneous slices for long-term experiments. These requirements are common for current suggested experiments from all three sources. We expect that much of what we do for this project will be reusable by other experiments, shortening the startup time for new experimenters. We also expect that campuses will actively manage their project resources, giving them more hands-on experience and a relatively low-risk way for campuses to try out production procedures for GENI experiments.

Results from this project will be presented to the GENI community at GEC 11 (Jul 26 - 29, 2011). We expect to complete the next meso-scale experiment project plan in May, covering follow-on experimenter requirements for GEC 12 (Nov 15 - 17, 2011) in more detail.

This document is a draft, and proposes schedules and activities not yet approved by participants. Please provide feedback and changes by April 20th, 2011.

2. Purpose and scope

During Spiral 3, campuses have been expanding and updating their OpenFlow deployments. All campuses have agreed to run a GENI AM API compliant aggregate manager, and to support at least four multi-site GENI experiments, by GEC 12. This lays the foundation for GENI to continuously support and manage multiple simultaneous slices that contain resources from GENI AM API compliant aggregates with multiple cross-country layer 2 data paths. (The GENI API page has more information about what it means to be a "GENI AM API compliant aggregate.") This campus infrastructure can support the transition from building GENI to using GENI continuously in the meso-scale infrastructure. Longer-term, it can also support the transition to at-scale production use in 2012, as originally proposed by each campus.

This project will investigate technical issues associated with managing multiple slices for long-term experiments, and will also try out early operations procedures for supporting those experiments. We expect to have early results from both efforts to share at GEC 11, and to use the procedures, software, and tools from this project to support more live experiments from new GENI researchers at GEC 12.

2.1. Experimenter requirements

Requirements from GENI experimenters are a major input to the general technical requirements that drive the overall scope for meso-scale experiments. In this project, we intend to cover these requirements:

  • Include resources from multiple interoperating aggregates in a slice.
  • Allow experimenters to control end-to-end data planes supporting a slice.
  • Allow multiple experiment team members to share resources in a slice.
  • Operate experiments continuously for three or more months.
  • Operate many experiments simultaneously.
  • Support non-IP protocol experiments.
  • Support comparison of IP and non-IP path results within an experiment.
  • Support video and audio applications.

We intend to address these requirements in the next meso-scale experiment plan:

  • Support control and data connections from mobile endpoints.
  • Support use of cloud services with GENI slices.
  • Support up to 1000 hosts in an experiment.
  • Support BGPMux in an experiment.
  • Support experimenter-controlled load sharing.
  • Support bandwidth enforcement (QoS-like functions).

Some of these requirements may not be feasible to meet until after 2012.

In addition to the campuses and OpenFlow software support teams, this project will include the GMOC and NCSA, who are implementing national GENI operations procedures and tools to support campuses where needed. For example, campuses can use the GMOC Emergency Stop procedure to request an experiment shut-down on infrastructure the campus does not directly control. This approach will improve both technical and operations support for GENI in parallel.

General operations requirements for the meso-scale experiments include the following requirements from campuses, operators, and experimenters:

  • Ensure that GENI experiments are compatible with campus security policies.
  • Allow campuses to manage access to the resources they contribute to GENI.
  • Ensure that the level of effort required from campus IT staff to manage GENI experiments is reasonable, compared to similar research programs.
  • Make experiments easier to set up and take down than they were at GEC9.
  • Make experiments easier to change (add/delete resources, change data paths) than they were at GEC9.
  • Try out operations procedures and policies that can transition to production at-scale use.

This project covers all of these operations requirements, but we expect that each will also need further work in the next meso-scale experiment plan, and some may not be feasible until after 2012.

The rest of this document covers only plans for this project; other projects in the overall Meso-Scale Campus Experiments plan will be documented separately, beginning with the GEC12 research experiments plan in May.

2.2. Goals

This project aims to accomplish the following technical and production goals:

  • Verify that multiple simultaneous slices and end-to-end data paths can operate within GENI over a long period of time (compared to previous GENI activities).
  • Verify that these slices can support technical requirements gathered from potential GENI experimenters. (To accommodate the proposed timeline, these slices will initially run artificial experiments that are meant to be representative of expected actual experiments. The artificial experiments will be replaced by real experiments as real experiments become ready.)
  • Determine whether the aggregates, tools, and procedures used in this project will be able to support the number of real slices and experiments expected in GENI through the end of 2011.
  • Identify and correct ways in which GENI is still difficult for real experimenters to use.
  • Gain experience operating GENI resources while GENI is still at a relatively small size.
  • Verify that campuses can support experiments with little (if any) assistance from Stanford and BBN.
  • Verify implementations of GMOC procedures, such as Emergency Stop.
  • Verify that campuses can support the aggregates, tools, and procedures needed to support experiments in keeping with their campus security policies and the GENI Aggregate Provider Agreement.
  • Gather feedback from campuses to GMOC, NCSA, and BBN on the state of operations support for campuses.

2.3. Resources

This project will use the following GENI resources:

  • The GENI network core, in Internet2 and NLR.
  • An OpenFlow network at each of eight campuses, consistent with the recommended campus topology, connected to the GENI network core.
  • A GENI AM API compliant MyPLC aggregate at each of eight campuses, each with two or more hosts.
  • A GENI AM API compliant Expedient aggregate at each of eight campuses, managing the OpenFlow network connecting the MyPLC hosts to the GENI network core.
  • Additional incidental resources to support monitoring and security.

2.4. Participants

Participants will include:

  • BBN, who will design and manage the project, gather and publish results, and provide GENI resources that slices will allocate and use.
  • Stanford, who will help design and manage the project, maintain the OpenFlow-related software used by the slices, and provide GENI resources that slices will allocate and use.
  • Six additional campuses, who will provide GENI compute and network resources that slices will allocate and use. (Clemson, Georgia Tech, Indiana, Rutgers, Washington, and Wisconsin)
  • Two backbone providers, who will provide GENI network resources that slices will allocate and use. (Internet2 and NLR)
  • The GENI Meta Operations (GMOC) team at the Global Research Network Operations Center (GRNOC), who will implement and test nationwide operations procedures and tools to support campuses.
  • The National Center for Supercomputing Applications (NCSA), who will support GMOC and the campuses for security-related analyses and procedures.

Additional participants may be added as interest and availability permit.

3. Information sharing

Information about the project will be shared via the GENI wiki, various software repositories, and various mailing lists.

3.1. The GENI wiki

http://groups.geni.net/geni/wiki/PlasticSlices/ProjectPlan is the canonical location of this project plan. The page http://groups.geni.net/geni/wiki/PlasticSlices contains links to additional pages relevant to this project.

3.2. Software repositories

The project will use the following software:

We won't necessarily know the version numbers of all of the software in this latter category until we complete each baseline -- especially true for experimental tools and controllers -- but the versions of all of the software used to accomplish each baseline will be captured before the baseline is considered complete.

If new versions of these packages are released, this document may be updated to use them.

3.3. Mailing lists

The following mailing lists will be used to share information about the project:

3.4. Real-time chat

At times when real-time online communications would be valuable (such as when testing Emergency Stop), we'll use a channel on the BBN Jabber server (plastic-slices@conference.j.ir.bbn.com) to chat.

4. Description of work

This section contains a timeline of the work to be performed for the project, and more details about the various parts.

4.1. Timelines

These timelines describe when various project-related tasks will be completed. All dates are for the completion for the work.

4.1.1. Campus resources

The campus participants will prepare the resources that the slices and experiments will use. (The first two items have already been done, but are included here for completeness.)

2011-04-05 BBN and Stanford agree on a campus OpenFlow topology to recommend to campus participants.
2011-04-06 BBN and Stanford communicate to campuses our desires for this topology and some resources.
2011-04-25 Three campuses are ready with the recommended topology and resources. (BBN, Stanford, and Clemson)
2011-05-02 Six campuses are ready with the recommended topology and resources. (add Rutgers, Wisconsin, and Washington)
2011-05-09 All campuses are ready with the recommended topology and resources. (add Georgia Tech and Indiana)
2011-05-09 BBN and Stanford test all campus topologies and resources.

4.1.2. Backbone resources

The backbone participants will also prepare inter-campus resources that the slices and experiments will use.

2011-04-25 BBN and GMOC engineer and implement all appropriate VLANs in NLR and I2, including peering between the two providers.
2011-04-25 I2 and NLR are ready to provide and support OpenFlow resources.

4.1.3. Slices

Over the course of the project, we'll create slices in which to run the various experiments.

2011-04-25 BBN creates ten initial slices, including resources at three campuses, running the simple Hello GENI experiment.
2011-05-02 BBN modifies the initial ten slices to include subsets of six campuses and to run different artificial experiments.
2011-05-09 BBN finalizes the initial slice configuration (all eight campuses, running ten artificial experiments).
2011-06-06 BBN replaces some artificial experiments with real experiments (pending experimenter availability), and revise the slice/campus/experiment configuration as needed.
2011-06-27 BBN replaces more artificial experiments with real experiments (pending experimenter availability), and revise the slice/campus/experiment configuration as needed.

4.1.4. Baselines

We'll use those slices to run some experiments, and establish a series of baselines to demonstrate progress throughout the project.

2011-05-16 Baseline 1 - Ten slices, each moving at least 1 GB of data per day, for 24 hours.
2011-05-23 Baseline 2 - Ten slices, each moving at least 1 GB of data per day, for 72 hours.
2011-05-31 Baseline 3 - Ten slices, each moving at least 1 GB of data per day, for 144 hours.
2011-06-03 Baseline 4 - Ten slices, each moving at least 1 Mb/s continuously, for 24 hours.
2011-06-07 Baseline 5 - Ten slices, each moving at least 10 Mb/s continuously, for 24 hours.
2011-06-13 Baseline 6 - Ten slices, each moving at least 10 Mb/s continuously, for 144 hours.
2011-06-20 Baseline 7 - Perform an Emergency Stop test while running ten slices, each moving at least 10 Mb/s continuously, for 144 hours.
2011-06-20 Baseline 8 - Create one slice per second for 1000 seconds; then create and delete one slice per second for 24 hours.
2011-06-27 BBN defines additional July baselines.

4.1.5 Monitoring and reporting

Throughout the project, we'll monitor the slices and resources, and report on the results.

2011-04-18 GMOC and BBN complete and test the monitoring data API.
2011-04-25 GMOC is ready to receive monitoring data from campuses.
2011-05-02 BBN is automatically submitting monitoring data from MyPLC and OpenFlow aggregates to GMOC.
2011-05-02 BBN is able to obtain monitoring data from GMOC.
2011-05-02 GMOC completes emergency contact database.
2011-05-09 All campuses are automatically submitting monitoring data from MyPLC and OpenFlow aggregates to GMOC.
2011-05-09 NLR and I2 are automatically submitting monitoring data from OpenFlow aggregates to GMOC.
2011-05-09 BBN publishes interim results to egeni-trials@lists.stanford.edu and the GENI wiki.
2011-06-06 BBN publishes interim results to egeni-trials@lists.stanford.edu and the GENI wiki.
2011-06-27 BBN publishes interim results to egeni-trials@lists.stanford.edu and the GENI wiki.
2011-07-26 BBN presents final results at GEC 11, and publish them to the GENI wiki.

4.2. Participant roles

These sections describe the work that each of the participants will perform, broken out by participant. Some of the work is ongoing throughout the project, and thus isn't tied to a specific point in the timeline.

4.2.1. Campuses

All eight campuses will provide and support some GENI resources:

  • Configure a local OpenFlow network consistent with the recommended campus topology, connected to the GENI network core.
  • Operate a GENI AM API compliant MyPLC aggregate with two or more hosts.
  • Provide a GENI AM API compliant Expedient aggregate managing the OpenFlow network connecting the MyPLC hosts to the GENI network core.
  • Create a GENI aggregate information page for each of those aggregates.
  • Submit monitoring data for each of those aggregates using the API the GMOC develops, as it beomes available.
  • For each of those aggregates, agree to the the GENI Aggregate Provider Agreement. (Or discuss with BBN if this is problematic.)
  • Provide best-effort support to keep these resources up and running and available, testing operations and security procedures, and responding to outages.
  • Test operations and security procedures for supporting these resources and responding to outages on a best-effort basis.

4.2.2. Internet2 and NLR

Internet2 and NLR will provide and support some GENI resources:

  • Provide an Expedient aggregate managing the OpenFlow resources in the Internet2 portion of the GENI network core.
  • Create an GENI aggregate information page for it.
  • Submit monitoring data for it using the API the GMOC develops, as it beomes available.
  • Agree to the the GENI Aggregate Provider Agreement. (Or discuss with BBN if this is problematic.)
  • Provide best-effort support to keep these resources up and running and available.
  • Test operations and security procedures for supporting these resources and responding to outages on a best-effort basis.

4.2.3. BBN

In addition to the items in section 4.2.1, BBN will:

  • Write and maintain this project plan, including communication with other participants about their requirements, tracking whether requirements are being met, etc.
  • Lead the design of the experiments to run in each of the slices, and ensure that they run continuously, with at least some detectable activity at least daily.
  • Obtain and review the monitoring data submitted by campuses to GMOC.
  • Support campuses in setting up and maintaining their OpenFlow resources.
  • Support campuses in setting up and maintaining their MyPLC resources.
  • Support campuses in setting up a feed of monitoring data to the GMOC.
  • Support experimenters in setting up and running their experiments, once we begin to replace the initial artificial experiments with real ones.
  • Provide and support a GENI AM API compliant ProtoGENI aggregate with four hosts.

4.2.4. Stanford

In addition to the items in section 4.2.1, Stanford will:

  • Support campuses in setting up and maintaining their OpenFlow resources.
  • Maintain the FlowVisor and Expedient software used by the slices.
  • Respond to software bug reports in a timely fashion (first response within three business days, resolution within ten business days if possible).

4.2.5. GMOC

GMOC will:

  • Design, document, and implement an API which campuses can use to submit monitoring data.
  • Provide and maintain a host which all participating campuses can use to submit monitoring data using that API.
  • Provide and maintain a host and API which BBN (and other participants) can use to obtain the monitoring data that campuses have submitted.
  • Build and populate a GENI operational contact database that includes all project participants.
  • Implement and test the Emergency Stop procedure with all project participants.
  • Respond to monitoring outages within one business day.

4.2.6. NCSA

NCSA will:

  • Review operations security plans for the project and recommend appropriate procedures to participants as necessary.
  • Carry out the GENI Legal, Law Enforcement and Regulatory Representative functions for the project as necessary.
  • Investigate and try to resolve any issues campuses raise with the GENI Aggregate Provider Agreement.

4.3. Baselines

These technical baselines will demonstrate progress throughout the project. As each baseline is completed, BBN will capture system configuration and software version information, and the results of the baseline, for use in the various interim and final reports. (See the GEC 9 Plenary Demo Snapshots page for examples.)

If necessary, BBN will propose changes to the baselines to all participants, who should approve or reject the changes within three business days.

Note that the 24-hour and 72-hour baselines may need to be repeated if early attempts have too many errors or failures. There isn't enough time in the schedule to repeat a 144-hour baseline unless it fails very early in the process.

The goal is to complete each baseline with few (if any) errors or failures. If a baseline has too many errors or failures, BBN will either propose changes to the schedule to allow it to be repeated; or propose that we conclude that the baseline not be completed, and document why not.

While completing Baselines 1 - 8, BBN will review the status of upcoming research experiments, and propose July baselines that help prepare the participants to support those experiments at GEC12. We expect the July baselines will use a combination of real experiments and artificial ones.

4.3.1. Baseline 1

Cause the experiment running in each slice to move at least 1 GB of data over the course of a 24-hour period. Multiple slices should be moving data simultaneously, but it can be slow, or bursty, as long as it reaches 1 GB total over the course of the day.

The purpose of this baseline is to confirm basic functionality of the experiments, and stability of the aggregates.

4.3.2. Baseline 2

Similar to the previous baseline, cause the experiment running in each slice to move at least 1 GB of data per day, but do so repeatedly for 72 hours.

The purpose of this baseline is to confirm longer-term stability of the aggregates.

4.3.3. Baseline 3

Similar to the previous baseline, cause the experiment running in each slice to move at least 1 GB of data per day, but do so repeatedly for 144 hours.

The purpose of this baseline is to confirm even longer-term stability of the aggregates.

4.3.4. Baseline 4

Cause the experiment running in each slice to move at least 1 MB/second continuously over the course of a 24-hour period (approximately 10 GB total).

The purpose of this baseline is to confirm that an experiment can send data continuously without interruption.

4.3.5. Baseline 5

Similar to the previous baseline, cause the experiment running in each slice to move at least 100 MB/second continuously over the course of a 24-hour period (approximately 1 TB total).

The purpose of this baseline is to confirm that an experiment can send a higher volume of data continuously without interruption.

4.3.6. Baseline 6

Similar to the previous baseline, cause the experiment running in each slice to move at least 100 MB/second continuously over the course of a 144-hour period.

The purpose of this baseline is to confirm that an experiment can send a higher volume of data continuously without interruption, for several days running.

4.3.7. Baseline 7

Repeat the previous baseline, but call an Emergency Stop while it's running, once per slice for each of the ten slices. Campuses will not be informed in advance about when each Emergency Stop will be called. There will be at least one instance of two simultaneous Emergency Stops, and at least one instance of a single campus being asked to respond to two simultaneous Emergency Stops. After each Emergency Stop, verify that all resources have been successfully restored to service.

GMOC will define precisely how each Emergency Stop test will be conducted, and what resources will be stopped, presumably selecting a combination campus resources (e.g. disconnecting an on-campus network connection) and backbone resources (e.g. disabling a campus's connection to an inter-campus VLAN).

The purpose of this baseline is to test Emergency Stop procedures.

4.3.8. Baseline 8

This baseline creates new temporary slices, rather than using the existing ten slices, creating a thousand slices at the rate of one slice per second, and then continuing to delete and create a new slice every second, for 24 hours. Each slice will include resources at three campuses, selected randomly for each slice. Automated tools will confirm that the resources are available, e.g. by logging in to a host and running 'uname -a'.

The purpose of this baseline is to confirm that many users can create slices and allocate resources at the same time.

5. Reporting

BBN will produce three interim reports on the status of the project, and a final presentation at GEC 11 (a poster and/or a talk). Each report will include details of what all the resources and experiments were, and results including:

  • CPU usage on hosts and switches
  • memory usage on hosts
  • network usage on hosts and slivers
  • anything that caused anything to break or crash
  • any actual research results

All reports will also be published to the GENI wiki.

The timeline in section 4.1.5 has target dates for these reports.

6. Quality assurance

To validate the proposition that these results are repeatable and well documented, Georgia Tech will reproduce baselines 1, 4, and 7 after GEC 11, using only documentation from the project, and provide a brief report to the participants on their experience.

After BBN and other participants have reviewed Georgia Tech's report, and improved the documentation as needed, all other campuses will reproduce the same baselines and provide a similar report.

Separately, BBN and GMOC will work with campuses to confirm that they can provision and release inter-campus VLANs.

7. Risks

All participants must be available on the proposed timeline in order for the project to complete before GEC 11; this has not been verified. Availability around graduation and in the summer may be an issue for some participants.

BBN and GMOC staff must complete needed tools and automation in time to monitor baselines; this has not been formally verified, although both organizations have said in the past that they expect to do this at some point.

BBN and GMOC monitoring tools may be technically more difficult than anticipated; if tools are late, it may reduce the amount or nature of the information we can gather for each baseline.

Stanford software developers must fix any bugs that prevent baselines from being completed promptly; availability and priority for this timeline has not been formally verified.

Resource outages that are not addressed promptly could cause baselines to slip (e.g. hardware failure, site construction etc).

Georgia Tech planned to use VMs for their MyPLC hosts, and we have not verified that the recommended campus OpenFlow topology will work with VMs; if it doesn't, this could delay Georgia Tech's ability to provide resources.

8. Revision history

Last updated on 2011-04-27 (Wed) at 10:05 EDT

Click "Last Change" above for a full revision history