wiki:ClusterCProjReview20090708Notes

Version 1 (modified by Christopher Small, 10 years ago) (diff)

--

Cluster C Review

July 8, 2009

Attendees

  • Chip Elliot
  • Aaron Falk
  • Yao Luo
  • Larry Lannom
  • Paul Barford
  • Evan Zhang
  • Heidi Picher Dempsey
  • Christopher Small (GPO)
  • Jim Griffioen
  • Justin Cappos
  • Rob Ricci
  • Vic Thomas
  • Nick Feamster

Intro (Aaron and Chip)

Goals for today's meeting.

Paul Barford This is a woefully underfunded project and has amazing promise. What about some of the $3B stimulus money?

ProtoGENI (Rob Ricci)

Atlanta is the next most important PoP for Rob.

Implementations for CH, CM, Slice Authority (all are fairly functional); RSpec for creating slices with topology (within site VLANs, between sites tunnels). Functional federation; can get slices that span seven sites, with complex topology. Authentication and authorization system. Backbone deployment progressing.

Have worked with about 1/2 of all spiral 1 projects.

  • Some important pieces not in the control framework, e.g. cross-CM

coordination, relinquish tickets.

  • Lots of projects to coordinate with, not all integrations are as

tight as they would like.

  • Lots of bureaucratic hassles; University of Utah not used to this

kind of contract.

The University has gotten very interested in the fact that they are installing a separate (not controlled by university IT) backbone link; concerned about security, privacy, and liability issues. Rob has pointed them at Larry Peterson's document on doing this. GPO is interested in their help in getting recommendations on how to do this.

Simple clearinghouse up and running. Several registries, Trust anchor for federation. Seven federates: Utah x 2 , Kentucky, Wisc, UML, BBN, CMU.

Backbone will be all layer 2 or below, VLAN based. Have been working with Internet2, HP, IU, GPO to make this happen. Will provide transit for SPP. Campus/regional out connections soon at Utah, Wisc, MAX, GpENI. Early exporation with UKY, GATech, ISI. Talking with Internet2 about DCN.

At Univ Utah talking with UEN (Utah Educational Network); UEN is building a metro fiber ring around SLC, hits campus and the Level 3 PoP. Plan is to get 10Gbps wave on this fiber.

Paul Barford Our CIO (Ron Kramer) represents WiscNet (regional entity in Wisconsin). Wisc runs BOREAS-Net (Broadband Optical Research, Education and Sciences Network) that spans three states, etc. He is very interested in GENI; we could relatively easily put together a consortium of CIOs to put together regional deployment -- not just Internet2 and NLR -- get a GENI wave across all of the regional deployments. It will be work, but there are resources available and the will to do it with this group, and GENI could dramatically expand the footprint with little resources.

Chip Elliot Maybe early fall we should put together a meeting.

Heidi Picher Dempsey There is a working group in Quilt for GENI that is already thinking about some of these issues.

Aaron Falk Good insight that campus CIOs are influentional in the regions, we should pursue this.

I'm the primary PoC for all of the inter-project connections.

Talking with Internet2 to get tail circuits (using DCN) to other campuses. Internet2 views DCN connections as ephemeral (policy, not technical), which may not be a good match for GENI. We might need to convince them to let us have long-lived DCN connections.

Chip Elliot How has working with Internet2 been?

Justin Cappos UW has been connected since 1999. We aren't thinking about layer 2 VLANs.

Chip Elliot We'd love to have your campus GENI-enabled. Same with UML. We'll gladly help.

Aaron Falk GENI's utility goes up as the number of connected users goes up; you can be advisors and advocates to campuses.

Larry Lannom What is the simplest test for a campus being "GENI enabled"?

Chip Elliot If it can be controlled via ProtoGENI, it's GENI enabled. It'd be great if you stood up a small ProtoGENI or Emulab it'd be great.

Rob Ricci We have shipped nodes to some of o the other campuses in this cluster, and controlled from Utah -- little or no local administration.

Larry Lannom We're not on MAX or Internet2, we could be, given funds.

Paul Barford One of the success stories of PlanetLab was their very simple method of joining PlanetLab, simple setup, a couple of machines on the internet -- if you give us two machines at your site, you can be part of this great infrastructure.

Justin Cappos This is part of our story -- you download this software, you're part of Million-node GENI.

Rob Ricci I have some money to buy some machines -- there are some at Wisc -- and the reason we haven't gone further is that coordination on the sending side takes a lot of effort. But we could do more of it.

Chip Elliot Let's work together on this.

If DCN works out, all you have to do is turn it on.

Heidi Picher Dempsey From what I've heard the software isn't really there yet.

Jim Griffioen The reason that it might be attractive to us is that Internet2 comes to the regional, but not all the way to us. DCN may be the easiest way for us to get a VLAN.

Paul Barford We're planning on doing a lot of things with VLANs, but VLANs are not a panacea. All kinds of weird things happen when you start multiplexing over VLANs, stuff you can't imagine. The most attractive option for me for GENI would be to have separate lambdas for everyone. It's never been clear in my mind where this is all going to go.

Chip Elliot The GPO view is to go to VLANs on Spiral 1, but it's not obvious that that's what we want to be using in Spiral 3.

Heidi Picher Dempsey It's amazing how much push-back we're getting trying to get people to use anything but IP!

Paul Barford I'd really like to have vendors here at the table telling us what is available, what their stuff does, so we can explore more options.

Chip Elliot That's a good idea, we're heading there I think.

ProtoGENI is (theoretically) available to a few thousand experimenters, but they don't know it. I've been holding off on recruiting users waiting to hear on our round 2 proposal, largely for interface issues; GENI is less than what Emulab users are used to.

The control framework API and tools, as we've defined it, is not enough for what users need. We need more services and basic setup stuff for researchers to be successful.

Paul Barford Our experience with WAIL is that without good tools, the learning curve is too long for people to pick it up. Need a front-end that makes the experience for users simpler. Just got Cybertrust funding for a tool for Emulab that helps you repeat a configuration, publish an experiment, provides pre-configured environments for different types of experiments (e.g. honeynet, botnet, ...). If there is any way to do this in GENI so a student or a researcher could sit down at a dashboard and push some buttons and get an experiment running, it'll really help get GENI off the ground.

Aaron Falk We're completely in sync with this. It wasn't in spiral 1, we're thinking about it for spiral 2.

Jim Griffioen The interface is difficult to use, and little documentation, so it's pretty difficult to talk about instrumentation.

Jim Griffioen I see the backbone evolving as the PlanetLab backbone did. A lot of campuses are on Internet2, but a lot are on the regular internet. The backbone is connectivity no matter how you get it. A lot of us will try to get on the wave links, if we can.

Once our switches get deployed in Internet2 PoPs, we'll manage them the same way we do in Emulab, we won't overprovision them.

Chip Elliot Right now we have a lambda from NLR and a lambda from Internet2, and in the next few weeks we'll start attaching equipment on it. Some people want to run repeatable experiments, and they'll migrate off the internet onto this.

The only real difference between Emulab and DETER is that DETER doesn't provide publicly routable addresses; nodes have to be behind a NAT.

Paul Barford Containment is a seurious issue. The community needs to think about what it means to run larger scale security experiments. Arguably, the community doesn't really know how to do repeatable experiments under laboratory conditions of this kind. We need to think carefully about what is required to bootstrap experiments, and how to run them.

Heidi Picher Dempsey There is this tension between "you can connect anywhere via the internet, but you don't know how you're connected", and "here is a fixed, repeatable network topology, but it's hard to get onto it."

Milestones

Integrate, install 3 nodes in Internet2 -- waiting for switches, not arrived, this date will slip.

Next two milestones are blocked on this one, but should fall out as soon as the switches are up.

Support experiments on ProtoGENI cluster -- we're already there, except for not running on the Internet2 backbone. And last milestone is also well underway.

Instrumentation Tools (Jim Griffioen)

About 70 machines available, will be more useful after we get OpenVZ (a virtualization environment) on them. They are connected to ProtoGENI.

Some of the software we're running doesn't understand virtual interfaces (e.g. tcpdump). We need more experience with that.

We've had some conversations with the GMOC about what we're measuring and how. We're trying to grab experiment-specific data; they are interested in global information.

One problem with the API is that it's designed around users, not groups. If I'm teaching a class, I have students, I have a TA, want some sharing, some restrictions.

Chip Elliot CF working group should talk about this.

I have a paper -- Heidi has a pointer to it.

UK now has a 10Gbps Internet2 link, will be putting in a second "research" link to Internet2 (which will be shared with other departments). No VLAN support planned at this time (although it might be possible over the new link). Right now using GRE tunnels. Have discussed implementing DCN with Internet2 and KRON.

Aaron Falk Dynamic VLANs are not a requirement; having static VLANs you can run non-IP layer 2 experiments, which would be great.

Lots of researchers have access via ProtoGENI -- although most of them don't know it. Not ready for class use yet, still gets rebooted and reinstalled every couple of weeks. (Would like to use it for operating systems/distributed systems course in spiral 2.) Want to use it with our FIND research.

Chip Elliot It would be great to give tutorials to researchers (how to use it), professors (how to teach classes on it).

Paul Barford Been teaching a network systems course in WAIL for years, it could be generalized to work on GENI. It's been very popular. Anyone who wants it is welcome to it. Of the 50 people who have taken this class, 20 have been hired by Cisco.

Milestones

2nd milestone: Jim went the extra mile to update his software and be more interactive with ProtoGENI, integrate more often. So although the milestone was a little bit late, the effort was above and beyond what was agreed to, so the minor slippage is not an issue.

DTunnels (Nick Feamster)

Design Requirements for BGPMux

Sesison transparency (BGP looks like adirect conect), session stability (no transient behavior visible to upstream nets), isolation (individual networks able to set own policies, foward independently), scalability (support many networks).

The last two are still in progress. The first two are being used in the prototype now.

Reviewed BGP mux functions in slightly more detail---we've seen this before, so I'm not making notes on it.

Year 1 progress:

  • DTunnels: Kernel patches to create Ethernet GRE tunnels Interface for specifying topology in XML and instantiating topology in OpenVS (ProtoGENI nodes will run) Still need to discuss RSPECs more to make BGPMux RSPEC fit in with ProtoGENI RSPECs
  • BGP Mux: Design and implementatio of control plane Deployment in three locations: GT, Wisc, PSG. Will demo data plane at GEC5

Paul Barford Do you used differnt AS numbers for each site?

No, we are using one AS for all sites. Challenges were a combination of logistical for things like AS number, and hacking for GRETunnels and IPTables mapping.

Integration Progress in Spiral 1:

  • DTunnels: Instantiate topology from an XML spec that resembls and RSPEC initial discussions about how to make RSPEC and XML spec have more in common
  • BGP Mux: Install Mux nodes on same subnets as ProtoGENI nodes. Will get two PG nodes assigned to BGPMux and then try running the BGPMux on them. Would like GEC5 to demo this.

No RSpec integration yet; need an AM

  • I2 progress: I2 is accepting BGP advertisements from Georgia Tech, PSGNet (not Wisconsin) Soeme filtering on commodity links due to rwois/IRR problems

DTunnels I2 connections doesn't really apply, but we do have the option of previsioning between deployed sites. Just using GRETunnels now.

First Experiment with Jen Rexford at Princeton, NameCast. CNS resolvers at multiple sites and advertised on a common IP prefix. Service advertises and withdraws BGP routes to control how traffic reaches the service. Deployment in progress at Georgia Tech. This experiment was designed for VINI, but there's no VINI at Wisconsin, so need to work on ProtoGENI.

Next experiment will be somethign that requires virtual networks/tunneling behind the mux.

Have to be careful about things like advertising BGPMux as a transit net and causing trouble in the rest of the network when this is deployed and used by many people, not just friends and family.

  • Spiral 2 plans:
    • More experiments
    • Integrate AM with ProtoGENI
      • Automate BGP Mux setup from RSpec
    • Integrate with other control frameworks

Some discussion about status of BGPMux milestones. Some are being changed, so wiki status will be revised when Aaron and Nick close that out.

Measurement System (Paul Barford)

GIMS aggregate manager controls the instrumentation plane. Basic model is that there are multiple sensors, managed by the GIMS aggregate manager. Researchers select which sensors they want to use in their experiment, which kinds of packets to gather in the experiment.

Sensor can use a high performance packet capture card (cost about $4K for the card, provide a GPS timestamp). There is a new Intel NIC that comes on the motherboard of new machines; it splits streams across multiple cores, may be able to do the work we need for substantially less. Not generally programmable, but does what we need. Total system cost under $2K per sensor node. Old NICs start dropping packets at 300-400 pps.

Will demo at GEC5.

High management overhead + low publication opportunity makes the program a challenge.

There is a fluidity to the GENI architecture which makes designing for it difficult.

We did a redesign of the GIMS architecture and the GIMS sensor architecture.

Upgraded WAIL to GENI. WAIL users can access GIMS, GENI, and GENI users access to WAIL. GIMS UI extension for PRotoGENI underway, will demo at GEC5.

No WAN connections to date. NLR will support GIMS deployment, have not officially solicited Internet2.

Spiral 2, want to expand suite, move from prototype to v1.0, and deploy GIMS on GENI (with minimal scope).

Daryl Veitch is working with him on timing. High-perf packet capture cards come with GPS capabiity.

Chips says GENI could engineer a tier 1 stratum. Million node GENI would like to use it if it existed. Someone would have to propose it and get it funded of course. Paul noted that he and Daryl are skeptical about how well current stratum 1s are actually synchronized.

Chip said we will have to address the lack of ethernet frame info in this project, since GENI is being built based on eth VLANs for prototypes.

Paul said his work is a complement to Jim's. Jim might be able to use pub/sub interface to GIMS. Difference has to do with Paul owning the sensor and Jim not. May need some kind of shim to make Jim's stuff able to use it.

Other measurement infrastructures that have been deployed: NEMEY, Surveyor, laely move towards end-host application measurements, but Paul is trying to cover the in-network measurements.

Chip and Paul discussed the need for better funding for building testbeds. Need to convince NSF that it needs better funding. Should we have the meeting of a dozen blue-ribbon people who are participating in testbeds and make recommendations to NSF. If a dozen of leading people in the US are saying it is needed, maybe they will listen.

Larry points out that the digital library people had the same problem 10 years ago and they failed. The GRID people have "money running out their ears" for this kind of thing.

Jason said what about MRI funding? Can only submit 3 proposals per campus.

GENI architecture is complicated and fluid. Coming up with measurement AM was a new approach ad harder than the original idea. Redesigns will still be required in the future.

WAIL has been upgraded to ProtoGENI API.

NLR deployment was original plan, haven't talked to I2 about deployment yet.

Can build interfaces to other projects outside the cluster. Sensors and AM are designed to be general and extensible.

Working closely with BGP mux project. Paul will try to come up with experiments to measure what BGP mux is doing to integrate more closely.

SEER wrapper from DETER (Schwab) might be extended to be useful with GIMS for experimenter. SEER is built on emulab APIs. Could port to ProtoGENI API instead.

Probably Y2 deployments are likely to be off Gigabit links nearer to campus edges than cores.

Milestones

  • Develop set of specifications. It's kind of a general description or narrative.
  • Develop test suite. Harpoon-based suite, complete.
  • Develop prototype that can be deployed and demonstrated in WAIL. Done, will demo at GEC5.

Looks like basically all done. The specifications need some refinement.

Programmable Edge Node (PEN) (Yan Luo)

In year 1, acquired and assembled a PEN multi-core server with network-processor-based acceleration card (Netronome NFE-i8000). Acquired switch, PCs. Integrated with Emulab/ProtoGENI. Use Netronome Flow Manager to establish up to 256 virtual NICs.

PEN modified the DB to give it the illustion that it has several physical nodes, even though they are actually virtual nodes. The rest of the work could be done on the client side.

Integration was a challenge. Emulab source base is huge, and not all that completely documented.

Rob Ricci A big (250 page) document is coming.

Spiral 2 goals

Want deployable PEN node ready (regression test, bill of materials, user guide, wiki) by 10/01/09. Work with University of Utah to deploy two PENs in Internet2 backbone. Enhance the measurement and diagnosis capabilities of NP card. Support external research on PEN.

Heidi Picher Dempsey Does it have to be in the Internet2 PoPs? It's expensive.

Chip Elliot This seems like an interesting match for BGPMux; you should talk to Nick Feamster.

Chip Elliot Do you expect researchers to program the network cards?

Yao Luo We'll program the network cards, not researchers.

Milestones

All of the milestones were done on time, one was done far in advance.

Million Node GENI (Justin Cappos)

2 programmers and 15 undergrads working on this project over the summer

Built Repy (VM), Node Manager, Seash (shell) experiment manger (similar to PLUSH)

Prototyped SeattleGENI (component manager/clearinghouse) This tracks your "donations" of systems and gives you credit for them. When you donate more, you get more allocated to you. (Basically VMs).

Collaborated: PLab, PgENI, ProtoGENI, DOR

Publications at SIGCSE, PN-ASEE/WCERT, NW-DCSD workshops, 12 talks

This project is heavily geared towards education. What teachers want to do is either run on a WAN or on a LAN. LAN is easier to debug. Obviously researchers will have different requirements that we might be able to add.

Paul B. mentioned CONDOR GRID computing environment. Wondered whether million node GENI could get access to Condor nodes (there are over 100,000). Justin says yes, they've thought about it, and also similarly with the @Home project. The issue is the trust model because Million Node GENI will allow ANYONE to have an account, but GRID and other projects have more restrictions.

VM is based on Python, and is custom to Million Node GENI.

Chip asks what kind of resources: Disk, processing, memory, network bandwidth, hardware random number generator...15-20 resources are restricted by Million Node GENI VM now. Million Node GENI measures based on Justin's Mac laptop to decide how much to limit resources.

Expected security model for Million Node GENI is to restrict traffic to within the "testbed." VMs in a tesbed talk to each other, but nothing is an exit node elsewhere. There may be other models too, but this seems to be the most likely.

Testbed this year went up over 1000 nodes. This counts anything that's ever been up.

Had to adapt to firewalls. Not looking for more end users.

Time goes backwards on modern machines frequently

ProtoGENI clearinghouse integration. Demo version will run at GEC5. Seattle runnning on ProtoGENI. Includes AutoGrader, gets Emulab resources, deploys Seattle, deploys student code and then makes grade. This is supposed to be repeatable.

Have been discussing whether it makes sense for Million Node GENI to use ProtoGENI control framework. Million Node GENI belives there is a problem with that. Million Node GENI doesn't have a notion of a lot of things that are calls in ProtoGENI (for example booting a node).

Might want to implement the interface but not the semantics. Might want to build absolute bare minimum interface to ProtoGENI for what Million Node GENI wants to do.

Chip asked about where integration is going with Million Node GENI. Million Node GENI says they want people to be able to use Million Node GENI through a GENI control framework, but not the other way around. To combine Million Node GENI with something else, you would use a tap on Million Node GENI that lets a random internet node to act like a Million Node GENI node even thought it doesn't show up as a general purpose Million Node GENI node.

Rob said he'd like to be able to have the notiion of a slice in Million Node GENI. Justin said you can (but it souds like there are complications with this).

Chip asked about whether it runs on handsets. Justin said they've done proof of concept on Nokia 0800, jailbroken Iphones, One laptop per child notebook. Couldn't do Windows mobile despite significant effort.

Chip asked about privacy concerns, especially with handsets and location information. Can route traffic through TOR and have this information be hidden. If the information isn't hidden, experiments can take advantage of the location information. This is a different model than the cloud projects because the node will be ON your own network.

Million Node GENI will work on adding some UW nodes to GENI via I2.

Aaron said it would be good to know whether nodes were on GENI-enabled infrastructure.

Currently used in 6 classes. Suport from NW-DCSD. Want more adoption this fall. Easier for students to do experiments in this environment than in others.

  • Spiral 2 plans
  • Increasing support for researchers/developers
    • Installer/end user interface
    • Repy v0.2 node manager (performance improvements, resource reassignment, measurement)
    • Services
    • Spec for end-host Clearinghouse API
      • Prototype end-host Clearinghouse
        • mash ups
        • identiy management
  • Collaborate with O&M and security
  • GENI outreach

Vic asked about how ProtoGENI interface works. There's a slice defined for Seattle. Emulab resources can be "owned" by Seattle on behalf of Million Node GENI users.

On status Vic asked about status for tutorial (could be called done in Justin's opinion). Also discussed RSPECs--still working on definitions of that.

Aaron pointed out that GENI reallly needs to be able to glue slices into parts of Million Node GENI to make it useful---can't just

Million Node GENI would like to have people deploy on Windows machines to help them with debugging. HPD to follow up on the possibility of deploying some Windows boxes on our BBN Emulab site.

HPD asked if GpENI install of Million Node GENI used anything but just access the GpENI Plab node. No. But it was a good node to add because it has ssh access, which most nodes don't.

Milestones

Digital Object Registry (Larry Lannom)

Explained the background of the Handle System and the Digital Object Registry.

Has grown up in a DoD environment in the past 3 years or so.

Larry put up a chart of terms for GFC (GENI Federated Clearinghouse of CNRI), ProtoGENI, and Million Node GENI

Rob commented that all CMs in ProtoGENi are also AMs. There are are no CMs that only do single components.

  • Spiral 2 goals
    • Clearinghouse:
      • defined a normalized and interoperable GENI clearinghouse specificatoin
      • provide our sw to new and existing projects
      • federate individual clearinghouse s into the GFC allowing researchers to discover resources across GENI.
    • Security:
      • Integrate and make available the propsoed PKI solution (from CNRI). Has been running for at least a decade.
    • Identifiers

Scholarly publishers are current most frequent users of the Handle system.

Can do interfaces into repositories.

GPO Spiral 2 Overview and Discussion (Chip Elliot)

Chip Elliot We should have a workshop for people involved in building and running testbeds and infrastructure and send some recommendations to NSF explaining what is needed for research infrastructure.

Chip Elliot There is a clear disconnect. They like the idea of infrastructure, but don't like the cost.

GPO to put Shib/InCommon info from Docushare onto wiki for convenience of GENI projects investigating it. Harry could do this (also overlaps with Ketly).

Paul said the last NSF nets call included a one-page supplemental about how the research would be done in GENI.