1 | Title: A Control Plane for Experimenter Support in GENI |
---|
2 | PIs: Tom Anderson and Arvind Krishnamurthy |
---|
3 | |
---|
4 | Why is experimenter support important to GENI? For GENI to be |
---|
5 | effective at enabling experimental network and distributed systems |
---|
6 | research, it must be accessible to the broadest set of researchers. |
---|
7 | We are particularly concerned for the lone PI at a low ranked research |
---|
8 | school with perhaps a single graduate student; will they have the |
---|
9 | resources to be able to make productive use of GENI? Or will GENI be |
---|
10 | the sole playground of the schools with large teams of researchers? |
---|
11 | If so, there will be a tremendous backlash against GENI and MREFC's in |
---|
12 | general. |
---|
13 | |
---|
14 | Unfortunately, PlanetLab does not provide us much helpful guidance. |
---|
15 | While PlanetLab has a number of strengths, it is hard to use for the |
---|
16 | uninitiated. Users often find they need a local "expert guide" who |
---|
17 | has already tripped over the potholes in getting started. Writing and |
---|
18 | running a distributed "hello world" program should be a few lines of |
---|
19 | code, but instead requires, among other things, clicking several |
---|
20 | hundred times on a web page to request sliver creation on each node, |
---|
21 | waiting for the system to install sliver access permissions on the |
---|
22 | node, figuring out how to install your own software execution |
---|
23 | environment on each of those nodes, building scripts to monitor node |
---|
24 | and process failures, building a data distribution system to pipe logs |
---|
25 | back to the developer's workstation, etc. This lack of effective |
---|
26 | tools places a high bar against PlanetLab's occasional use, and |
---|
27 | addressing this in GENI is a high priority in the research community. |
---|
28 | |
---|
29 | We seek to build a control plane toolkit that will reduce this startup |
---|
30 | cost to less than a day. Key steps will be to develop abstractions |
---|
31 | for job control and failure recovery and make them available through a |
---|
32 | shell, a programmable API, and a scripting language interface to the |
---|
33 | API. These different interfaces will target different types of users, |
---|
34 | namely, the new user community, existing PlanetLab user community, and |
---|
35 | experienced developers maintaining long-running services and desiring |
---|
36 | greater control. The control plane will smooth the development |
---|
37 | process by which new ideas go from initial experimentation to eventual |
---|
38 | deployment, and it will do so by providing an abstraction layer that |
---|
39 | allows experimenters to migrate from locally available clusters to |
---|
40 | planetary-scale testbeds. |
---|
41 | |
---|
42 | We have the following schedule for the deliverables: |
---|
43 | |
---|
44 | 6 Months (Feb '07): Develop and implement API for experiment |
---|
45 | instantiation and system-wide job control using parallel exec. |
---|
46 | Develop a shell program and an interactive GUI that allows the user to |
---|
47 | interactively invoke the tasks provided by the API. At the end of the |
---|
48 | six month period, demonstrate that a new user can quickly and easily |
---|
49 | deploy experiments on Emulab and PlanetLab using the shell. |
---|
50 | |
---|
51 | 12 Months (Aug '07): Develop support for using the API with scripting |
---|
52 | languages, such as Python, Perl. Develop more advanced support for |
---|
53 | advanced parallel process management, including suspend/resume/kill, |
---|
54 | and automated I/O management. Also develop support for issuing |
---|
55 | system-wide commands in asynchronous mode, querying the status of |
---|
56 | previously issued asynchronous operations, and to synchronize |
---|
57 | subsequent operations with those initiated earlier. Provide support |
---|
58 | for simple forms of node selection, e.g., based on processor load or |
---|
59 | memory availability. Make all of the developed components available |
---|
60 | on both the Emulab and PlanetLab platforms. The goal is to provide |
---|
61 | the abstraction of "run this experiment on some set of nodes." |
---|
62 | |
---|
63 | 18 Months (Feb '08): Develop support for permanent services. Develop |
---|
64 | mechanisms for continuous monitoring of program state and reporting |
---|
65 | exceptions, faults to the user. Develop support for deploying |
---|
66 | experiments on heterogeneous platforms such as VINI and GENI. This |
---|
67 | would include support for specifying network configuration, |
---|
68 | controlling (or injecting) network events, exporting information |
---|
69 | regarding network conditions, and providing more control over resource |
---|
70 | allocation for a diversity of resources. The goal is to provide the |
---|
71 | abstractions of "keep this service running" for service developers and |
---|
72 | "run this experiment on a matching set of nodes/topologies" to network |
---|
73 | experimenters desiring a specific workload. |
---|
74 | |
---|
75 | 24 Months (Aug '08): Interface the control plane to existing services |
---|
76 | and sensors. Integrate with CPU performance monitoring sensors such |
---|
77 | as Slicestat, CoTop, and Ganglia and also integrate with the IPlane, a |
---|
78 | network performance monitoring system that we are concurrently |
---|
79 | building. Also provide interfaces to different resource discovery and |
---|
80 | allocation mechanisms (such as SWORD, Bellagio, SHARP), and different |
---|
81 | content distribution systems (such as Bullet, BitTorrent, Coral, |
---|
82 | Codeen), so that the user can just change an environment variable or a |
---|
83 | parameter in the API to use these services. |
---|
84 | |
---|
85 | 30 Months (Feb '09): Provide support for common design patterns that |
---|
86 | application developers use for recognizing and overcoming faults, such |
---|
87 | as using transactional operations and process isolation. Develop |
---|
88 | mechanisms and tools for end-hosts to subscribe to overlay services |
---|
89 | (VINI/GENI services), with support for interfacing at different levels |
---|
90 | of the protocol stack. |
---|
91 | |
---|
92 | 36 Months (Aug '09): Develop intrusive and non-intrusive techniques |
---|
93 | for monitoring program state, detecting abnormal behavior, and |
---|
94 | debugging support such as single-stepping. Address scalability issues |
---|
95 | so that the control infrastructure can scale to hundreds and thousands |
---|
96 | of nodes without developing hotspots. Address network reliability |
---|
97 | issues by having the control plane use a resilient communication layer |
---|
98 | that routes control messages around network faults and hides transient |
---|
99 | connectivity problems. |
---|
100 | |
---|