Title: A Control Plane for Experimenter Support in GENI
PIs: Tom Anderson and Arvind Krishnamurthy

Why is experimenter support important to GENI?  For GENI to be
effective at enabling experimental network and distributed systems
research, it must be accessible to the broadest set of researchers.
We are particularly concerned for the lone PI at a low ranked research
school with perhaps a single graduate student; will they have the
resources to be able to make productive use of GENI?  Or will GENI be
the sole playground of the schools with large teams of researchers?
If so, there will be a tremendous backlash against GENI and MREFC's in
general.

Unfortunately, PlanetLab does not provide us much helpful guidance.
While PlanetLab has a number of strengths, it is hard to use for the
uninitiated.  Users often find they need a local "expert guide" who
has already tripped over the potholes in getting started.  Writing and
running a distributed "hello world" program should be a few lines of
code, but instead requires, among other things, clicking several
hundred times on a web page to request sliver creation on each node,
waiting for the system to install sliver access permissions on the
node, figuring out how to install your own software execution
environment on each of those nodes, building scripts to monitor node
and process failures, building a data distribution system to pipe logs
back to the developer's workstation, etc.  This lack of effective
tools places a high bar against PlanetLab's occasional use, and
addressing this in GENI is a high priority in the research community.

We seek to build a control plane toolkit that will reduce this startup
cost to less than a day.  Key steps will be to develop abstractions
for job control and failure recovery and make them available through a
shell, a programmable API, and a scripting language interface to the
API.  These different interfaces will target different types of users,
namely, the new user community, existing PlanetLab user community, and
experienced developers maintaining long-running services and desiring
greater control.  The control plane will smooth the development
process by which new ideas go from initial experimentation to eventual
deployment, and it will do so by providing an abstraction layer that
allows experimenters to migrate from locally available clusters to
planetary-scale testbeds.

We have the following schedule for the deliverables:

6 Months (Feb '07): Develop and implement API for experiment
instantiation and system-wide job control using parallel exec.
Develop a shell program and an interactive GUI that allows the user to
interactively invoke the tasks provided by the API.  At the end of the
six month period, demonstrate that a new user can quickly and easily
deploy experiments on Emulab and PlanetLab using the shell.

12 Months (Aug '07): Develop support for using the API with scripting
languages, such as Python, Perl.  Develop more advanced support for
advanced parallel process management, including suspend/resume/kill,
and automated I/O management.  Also develop support for issuing
system-wide commands in asynchronous mode, querying the status of
previously issued asynchronous operations, and to synchronize
subsequent operations with those initiated earlier.  Provide support
for simple forms of node selection, e.g., based on processor load or
memory availability.  Make all of the developed components available
on both the Emulab and PlanetLab platforms.  The goal is to provide
the abstraction of "run this experiment on some set of nodes." 

18 Months (Feb '08): Develop support for permanent services.  Develop
mechanisms for continuous monitoring of program state and reporting
exceptions, faults to the user.  Develop support for deploying
experiments on heterogeneous platforms such as VINI and GENI.  This
would include support for specifying network configuration,
controlling (or injecting) network events, exporting information
regarding network conditions, and providing more control over resource
allocation for a diversity of resources.  The goal is to provide the
abstractions of "keep this service running" for service developers and
"run this experiment on a matching set of nodes/topologies" to network
experimenters desiring a specific workload.

24 Months (Aug '08): Interface the control plane to existing services
and sensors.  Integrate with CPU performance monitoring sensors such
as Slicestat, CoTop, and Ganglia and also integrate with the IPlane, a
network performance monitoring system that we are concurrently
building.  Also provide interfaces to different resource discovery and
allocation mechanisms (such as SWORD, Bellagio, SHARP), and different
content distribution systems (such as Bullet, BitTorrent, Coral,
Codeen), so that the user can just change an environment variable or a
parameter in the API to use these services.

30 Months (Feb '09): Provide support for common design patterns that
application developers use for recognizing and overcoming faults, such
as using transactional operations and process isolation.  Develop
mechanisms and tools for end-hosts to subscribe to overlay services
(VINI/GENI services), with support for interfacing at different levels
of the protocol stack.

36 Months (Aug '09): Develop intrusive and non-intrusive techniques
for monitoring program state, detecting abnormal behavior, and
debugging support such as single-stepping.  Address scalability issues
so that the control infrastructure can scale to hundreds and thousands
of nodes without developing hotspots.  Address network reliability
issues by having the control plane use a resilient communication layer
that routes control messages around network faults and hides transient
connectivity problems.