wiki:WorryingSlivers

Version 7 (modified by chase@cs.duke.edu, 13 years ago) (diff)

--

How I learned to stop worrying and love slivers

Jeff Chase, speaking only for myself.

The GENI debate over slivers has always seemed surreal to me, because it cannot be a technical disagreement. It can only be a disagreement about what is meant by the term "sliver". And that is pointless because a sliver could be anything. So it quickly comes down to one's views about the goals and scope of the entire GENI undertaking. And yet at the same time, the sliver debate raises a crucial technical issue because it will drive the implementation balance between declarative specifications and code, which is a major architectural tussle in GENI. I am being provocative here but please bear with me.

In the Spring of 2008 several of us made a pilgrimage to CNRI to meet some Internet luminaries and talk about GENI. (That was the last time I saw Jay Lepreau, bless him.) As I recall, Vint Cerf was eloquent and animated on the theme that what is (or was) new in GENI is describing and reasoning about virtualized resources and their configurations. (Federation is another challenge, but that is mostly a question of figuring out how to use prior work.)

GENI is all about heterogeneous deeply programmable virtualized infrastructure resources, also known as programmable substrate. These words are taken from my favorite GENI Vision slide (pdf), which is now four years old and which I have been using for almost as long. (If anyone remembers the history of this slide, I would like to know it.)

So, what is a virtualized infrastructure resource? Well, that's a good question. In fact, it was the question Vint Cerf asked us in Reston, and so it must be a good question. He said that defining what we mean by "network virtual resource" and describing such resources are the key challenges for GENI. I don't want to put words in his mouth. But it's in my notes.

We seem to find lots to do to avoid that challenge, but we are forced to confront it every time we talk about slivers and rspec.

Virtual Resources

A virtual resource ("virtualized infrastructure resource") is something you can have or not have. While you have it, it might conduct some activity for you, like running a program, or storing data, or carrying your packets, or looking at someone else's packets as they fly by, or sensing the real world. Or anything, really.

A virtual resource is something you can allocate: if you get it, somebody else can't have it. At one end of the continuum we are allocated exclusive access to some physical substrate, e.g., a component. But most often when we interact with a virtual resource there is a layer of software between us and the hardware substrate itself (the metal and glass). So what we are really interacting with, at least in part, is the software. We might occupy time or space on some actual substrate resource during the activity (that's the mutual exclusion part), but the software may do some scheduling of that resource and otherwise control the activity in some way.

In other words, a virtual resource is really an instance of a service---an infrastructure service. GENI has always focused on what is now called infrastructure-as-a-service (IaaS) in the industry and in the NIST cloud taxonomy.

So when we describe a virtual resource, we might also have to say something about the details of the service instance. For example, we might have to describe properties of the scheduler, like the degree of assurance that the instance will be permitted to occupy space or time on any actual substrate.

Some virtual resources mix software with the hardware to varying degrees. Surely a Python environment like Google's AppEngine is a programmable virtual resource, for example. There is more to say about the distinction between an infrastructure service and any other kind of software service along the continuum from grid job services to application services. NIST has already laid out that space for us, so I will leave that question. People sometimes seem uncomfortable that we do not know how far down this continuum we might go with the GENI virtual resource abstractions. What we do know is that we can describe physical resources fairly easily, but it gets harder to describe our virtual resources as we move down the continuum toward application services. At some point we can no longer say precisely what we mean by "virtualization".

OpenFlow

Before I move on, let me say a little about OpenFlow. OpenFlow is an interesting case for several reasons. OpenFlow is pretty far down this continuum, and it is not an accident that some of the most pointed questions about slivers have come from OpenFlow voices.

Clearly OpenFlow switches are programmable substrate. Flow rules are programs and those programs consume switch resources. Those flow rules are a kind of service: what OpenFlow does is apply the rules to packets flowing through the network. All packets are labeled with names allocated from various name spaces. Those name spaces are also virtual resources---flowspace. If a rule matches the labels on a packet, then the rule may redirect or even transform the packet, which consumes more switch resources. Before OpenFlow we called such rulesets packet filters.

OpenFlow is interesting first because the packets matching the rules might belong to someone else. An OpenFlow rule is like a service that queries someone else's data. So we should first be sure that the owner of a rule has a right to operate on the data. Thus OpenFlow presents an interesting authorization challenge that is quite different from other kinds of virtual resources. Second, OpenFlow rules match against flowspace labels on the packets, and the label values are themselves virtual resources. Thus authorization of OpenFlow rules is often strangely conflated with resource allocation of flowspace. But it is not resource allocation because the flowspace is allocated to the owner of the packet, and not to the owner of the rule (unless these happen to be the same entity). In any case, flowspace is allocated to a network endpoint or path independently of any rules being activated for that flowspace, so OpenFlow has little to do with flowspace allocation.

Third, the amount of actual substrate resource that an OpenFlow ruleset consumes depends in part on the amount of network traffic generated by someone else. But in this third respect OpenFlow is no different from any service that is driven by requests from external users. The resource demands of these services may be hard to predict in advance, but some resources must be assured for acceptable service quality. Which brings us back to the need to describe scheduling properties. But I digress.

The point is that active OpenFlow rulesets represent virtual resources, so OpenFlow is a good test case for a GENI sliver abstraction. But we should be very careful when we talk about OpenFlow.

Semantic Models

So: there are many kinds of virtual resources, and the goal of GENI is to operate on them. To operate on them we must describe them. But there are so many kinds to describe. And as Alice said, the question is whether you *can* make words mean so many different things. But what do we want to say about these virtual resources?

When we talk about virtual resources, we can often identify distinct elements within them to talk about. For example, there are virtual machines, and network pipes, and logical storage containers. These examples suggest that our virtual resources tend to closely match the shapes and behaviors of actual substrate elements (components). Indeed, a physical component might be allocated directly as a virtual resource (as in early Emulab). But virtual resources are above the component layer: a component might host more than one virtual resource, or a virtual resource might span components. And when we describe virtual resources we find other elements that do not map easily onto components. There are VLAN tags and other flowspace labels, and running programs, and standing queries such as active OpenFlow rulesets.

We can describe these various entities independently in terms of their properties and their relationships. That is good, because if our goal is to describe virtual resources precisely enough to process the descriptions automatically, then we are going to need a semantic model, and semantic models are almost by definition based on classifying entities and their relationships. There is a rich literature on entity-relationship models going back 35 years.

Indeed, a large part of the GENI challenge is in developing and processing declarative specifications of virtual resources using semantic models. We can use these models to create documents that describe virtual resources in terms of elements and relationships. When we request virtual resources or changes to virtual resources, we attach documents that describe the resources and changes we want. If a request is granted, we receive documents describing the virtual resources we got. These documents are called rspec.

What is important here is that virtual resources have an internal structure and properties, and we describe these using declarative specifications. Originally ORCA used simple resource types and property lists. But these descriptions have become significantly more advanced in the GENI project, and they are a key part of the GENI architecture challenge.

But what does such an rspec document describe? Must it describe a complete slice? Or can we describe different pieces of a slice in different documents? As we will see, this is an essential question for understanding the role of slivers in GENI.

Partitioning Virtual Resources Across Aggregates

Another aspect of the GENI challenge is that virtual resources are distributed. They span a "Federated International Infrastructure" in the words of my favorite GENI Vision slide. We recognize that substrate resources are grouped into aggregates owned by infrastructure providers. In general, we seem to be willing to presume that this grouping is a partitioning: each piece of infrastructure is controlled by exactly one aggregate. I sometimes hear people talk about hierarchical "aggregates of aggregates", but I think even they would agree that each piece of infrastructure is controlled by exactly one leaf aggregate in the hierarchy.

We also seem to accept that we can partition virtual resources across aggregates in a way that mirrors the partitioning of the substrate resources. That is, a virtual resource is provided by a single aggregate, and consumes substrate resources only on that aggregate. To make changes to a virtual resource, we send requests about it to the aggregate that controls it. Users and their tools can talk to aggregates independently of other aggregates.

We can think of these virtual resources as the "parts" of a slice. Users and their tools obtain these parts from different suppliers --- the aggregates --- and assemble them into a slice. We can make an analogy to building any kind of complex machine from off-the-shelf parts obtained from multiple suppliers. The analogy isn't perfect: if the part is a virtual resource, then the supplier must host and operate the part rather than merely shipping a physical widget out into the physical world. As we have said, the "part" is really a service. This is also why one can never really own virtual resources in the same way that one owns a physical part: virtual resources are services provided over a network, and the provider controls the actual substrate and can stop providing the service at any time. But let us go with the analogy as far as it takes us.

The next question is, when we get a part from a supplier, do we need to tell the supplier about our parts from other suppliers? When we talk to an aggregate about our virtual resources there, it is reasonable that we would want to limit the conversation to the specific infrastructure service that aggregate provides. Similarly, if a builder or manufacturer gets parts from a supplier, they do not have to show the supplier the blueprints for the entire project. It is understood that various materials and parts are available to the customer from different suppliers, and that these pieces fit together in various ways. The customer may select the parts and combinations and use them to build whatever the customer wants, without telling the parts suppliers about the overall assembly. The materials and parts and means of assembling them may change with time, and we can't say in advance what they all are. But it is understood that it improves efficiency to have interchangeable off-the-shelf parts with standard well-defined compatibilities. This familiar idea was called the American System in the 1830s.

Can we apply the American System to virtual resources in GENI? This partitioning of virtual resources is a crucial step. If we are going to describe complex slices and change them over time, then a divide and conquer strategy will simplify the task considerably.

Stitching

The key challenge to overcome for partitioning our resource descriptions is that there are relationships among the virtual resource elements. And to the extent that we have these relationships among virtual resources on different aggregates, those aggregates may need to interact, perhaps through some intermediary. In GENI we call these interactions stitching. Many of the driving use cases for stitching involve interconnecting virtual resources within a slice. For example, we use stitching to connect virtual network pipes into paths or networks terminating at virtual nodes.

Can our semantic models describe all the relationships that might require such interactions? If the answer is yes, then we can talk to each aggregate about the relationships that cross its borders, without it having to be aware of any virtual resources that are unrelated to what we want that aggregate to do. The graph is partitionable. If the answer is no, then we might need to tell every aggregate about every virtual resource, in case there is some important relationship that we missed. Perhaps there is some relationship that is not represented explicitly in the description, but that an aggregate can infer from the descriptions of other virtual resources at other aggregates. In that case, the aggregate must have all of those descriptions available to it. The graph is not partitionable.

It is logical that we would seek to describe all such relationships in our semantic resource descriptions. If we discover that we have missed an important relationship, that means our semantic model is insufficient, and we should go back and extend it or rethink it.

I believe that these interactions are relatively easy to describe for network virtual resources. What is difficult is to describe the service that a network virtual resource provides. But once we describe the service, a relationship is almost always a binding of one virtual resource to the service provided by another. In networked systems those service endpoints always have names or labels allocated from some network namespace: VLAN tags, IP addresses, ports, DNS names, URLs, LUNs, lambdas, pathnames, alone or in combinations with other identifiers. What is needed is to describe which virtual resources are providing a service and which are consuming that service. Then we can bind the consumer to the producer by passing the producer's label to the consumer. In essence, the graph becomes a directed dependency DAG, with directed edges from producers to consumers. A stitching agent traverses the DAG, instantiating resources and propagating labels to their successors as the labels become available. This is how ORCA does stitching.

We should be able to describe these relationships using our semantic models, and propagate labels by querying descriptions based on those models. We don't need to write any new code for stitching. We don't need to describe the producer to the consumer if the consumer already understands what kind of resource or service it wants to bind to. If it does not, then the configuration of virtual resources is malformed.

Partitioning Virtual Resources Within Aggregates

Suppose then that we can describe virtual resources of a slice by a graph of elements (entities) and relationships, using a semantic model. Suppose further that we can partition the graph across aggregates as I have described, so that we talk to each aggregate only about the virtual resource elements that it hosts, and any adjacent edges. Edges crossing partition boundaries require coordination among a pair of aggregates, i.e., stitching.

Now the question is: how does the aggregate expose the graph through its API, so that a slice owner can operate on the graph? In the current (or near future) [DRAFT_GAPI_AM_API AM-API] there are simple calls to operate on the graph: create, destroy, and (soon) update. The create and update operations take as an argument an rspec document describing at least the entire partition of the graph residing at that aggregate. The requester says what region(s) of the graph they want to operate on somewhere in the rspec document attached to the request, and not in the API.

Thus the AM-API itself offers no way to talk to an aggregate about some regions of the graph independently of other regions of the graph. If we want to add resources, we must pass an rspec for the entire graph, with the new parts added. If we want to remove resources, we must pass a description for the entire graph, with some parts removed.

But if a virtual resource graph can be partitioned across aggregates, then it must also be possible to partition the graph within aggregates. We can break the graph into named regions and use the aggregate API to talk about specific regions, passing the rspec for only the regions of interest. We can let the aggregate handle any edges that cross region boundaries within the aggregate.

If we can partition the graph into regions, how shall we decide where to set the boundaries? How big shall we make the regions? At one extreme there is a single region: this is the degenerate case represented by the v2.0 AM-API. We can make any changes we want using a single API call, but we must pass rspec for the entire graph, even for a minor change. If we introduce region boundaries, then we have a tradeoff. With smaller regions, we need more requests to instantiate or update a given graph, but each request passes a smaller rspec. With larger regions, we make fewer API calls to instantiate or update the graph, but the rspec documents are larger.

Slivers

Finally we come to slivers. What is a sliver? A sliver is a region of a slice's virtual resource graph. Either the graph is partitionable, or it is not. If the graph is partitionable, and we choose to partition it, then we need a name for the partitions. Sliver is a fine name. But perhaps the community will insist on a different name. (Virtual Resource Assembly?)

If the graph is not partitionable, then it is not partitionable across aggregates. Then there are only slices, and each aggregate must receive the entire graph for the entire slice. If we change any part of the graph, we must pass the new slice rspec to at least all aggregates participating in the slice. We can make this approach work for demos, but it will not scale to large slices, and it will not succeed in accommodating dynamic slices. And then somebody in another project will figure out how to describe a virtual resource graph in a way that makes it partitionable, and we will move forward again from there.

I believe that we know how to describe the resources we care about as partitionable graphs. If the graph is partitionable across aggregates, then it is partitionable within aggregates. Then the only question is whether the aggregate API permits any partitioning within an aggregate. And why would the API prohibit an aggregate from grouping and organizing the virtual resources that it serves? Why would the API prohibit an aggregate from breaking its virtual resources into slivers that can be operated on through sliver APIs?

But what is a sliver *really*? What does a region of the graph represent? I have been vague in talking about "virtual resource elements" and "entities" and "virtual resources". It is an abstraction. The Vision Slide says that GENI will support heterogeneous deeply programmable virtualized infrastructure resources. What are those? We do not know. But they are heterogeneous, so there could be many different kinds. And if the architecture is to have impact over more than a few years, then it must accommodate resources that have not been invented yet. We do not know what these will look like.

What we do know is that it must be possible to describe these new virtual resources using a semantic model, and that the description will be a graph of elements and edges representing relationships among the elements. And if we have a stitching architecture that can propagate labels across edges, then the graph will be partitionable. And if the graph is partitionable, then it will be convenient to partition it into regions in order to allow the possibility that we might use the aggregate API to operate on different regions independently of other regions. For example, we can add virtual resources to a slice by attaching a new region, without changing anything about the graph as it exists. And we can remove virtual resources from a slice by detaching a region, without changing anything about the rest of the graph as it exists.

We also know that this abstract notion of "regions" must cover the virtual resource cases we already understand. For example, a cloud site offers an infrastructure service that allows us to instantiate graphs of related virtual resource elements such as virtual CPU cores, memories, network interfaces, storage volumes, and virtual networks. We can take a pen and draw regions around parts of this richly connected graph of virtual resource elements. We can decide that the collection of elements adjacent to a memory constitute a useful grouping. We can draw a region encompassing all of those virtual elements and call it a "virtual machine" or "instance". That is a reasonable choice of a region: it is the choice made by EC2-like cloud sites. EC2 also draws regions around VLANs and calls them "security groups". It considers storage volumes separately from virtual machine instances.

The Breakable Experimental Network is another interesting case. BEN is a network substrate with a multi-layer topology. We can allocate virtual network topologies from BEN. Given a virtual network topology that is planar (forget about layers for now), there are many reasonable ways to partition the network into connected regions. What is important about a region is that offers some connectivity service among a set of locations. The aggregate might choose to expose more or less information about its internal structure. People who understand network description languages call this topology aggregation. But it is up to the network aggregate whether it allows its clients to create subnetworks separately and then stitch them together. Most advanced networks today permit only creation of paths from point A to point B. If the aggregate does support multi-point network topologies, then it is up to the client how it chooses to use those primitives. It may be useful for a client to build and evolve a virtual network one piece at a time, or, it might be simpler to create a static network in one shot and then leave it alone.

Sliver Types

At one level of abstraction, we can speak of these groupings as "regions" of a graph describing any set of virtual resources. But if we know something about a specific aggregate, we can see that these groupings correspond to well-understood resource abstractions that are meaningful to users of the aggregate. For example, EC2 separates networks, storage volumes, and virtual machines. As a result, it can offer primitives to attach and detach storage volumes to/from virtual machines, and attach/detach virtual machines to/from networks. These are specific examples of generalized stitching, but these groupings can also support other useful verbs, like cloning storage volumes or suspending virtual machines.

These examples show that a given virtual resource service incorporates its own groupings of the virtual resource element graph into regions (slivers), and these groupings may allow useful operations on a sliver other than creating it and releasing it. Thus virtual resources have types that define what we can say about them and do to them. An aggregate could provide supplementary type-specific operations on slivers, in addition to common operations supported by the base sliver API. In the past, some have seemed to argue that the impracticality of a one-size-fits-all sliver API undermines the whole dream of GENI. But the notion of subtyping has been proven in many other contexts and should be comfortable here as well.

Of course, some virtual resources are programmable, and programs running on them may also expose interfaces and operations. But in general those interfaces are above the virtual resource management layer and are outside our scope of concern.

The Boundary Between Software and Semantic Specifications

So: slivers are named typed partitions of a slice's virtual resource graph. They reside entirely within one aggregate, and their boundaries are chosen by that aggregate. The aggregate exports a type-specific API to operate on each sliver.

Another view of slivers might be "that which the API allows us to name and operate on". If we want to operate on a virtual resource element that isn't named through the API, then we must name it and operate on it in the rspec for its containing sliver or slice, or whatever the granularity of that rspec is. If we don't enable type-specific operations on a sliver through the sliver API, then these operations must be represented somehow as verbs in the rspec, or (worse) they won't be supported at all. Putting verbs in a semantic resource description is a bad idea: if we want to use a language for imperative programming, then we should use an imperative programming language.

These choices will drive the balance of focus on the API vs. declarative specifications. In one direction we have a system that uses a few simple API calls to pass around large resource descriptions that are diffed and acted upon in different ways at multiple aggregates. In the other direction we have a system that uses many calls to a diversity of APIs on a diversity of sliver objects, with each call carrying a small rspec document pertaining to the sliver object being operated on.

A Footnote on ORCA

We traveled this line of reasoning some time ago in developing the ORCA system. And yet ORCA has nothing that we call "slivers".

ORCA AM calls operate on objects called resource leases. Leases are time-bounded contracts for one or more units of typed virtual resources. The units in a lease must have the same type and parameters (e.g., sizes). These units are the closest analogue to slivers, so let us call them slivers. The canonical example of a resource lease request is something like "get me 20 large virtual machines for an hour". (But that is just an example.)

Leases have states and state machine transitions that are independent of the resource type. (E.g., initializing, active, closing, closed.) The resource-specific code (setup, teardown) is implemented in pluggable back-end handler scripts that interact with some underlying virtual sliver service, e.g., a cloud middleware system or a network provisioning system. An aggregate may have many such handlers for different sliver types: an ORCA aggregate is not limited to one type of virtual resource.

A key property of ORCA resource leases is that they expire if the client does not renew them. That property is important for GENI, but is out of scope for this discussion. The set of slivers in an ORCA lease may be changed in various ways when the lease is renewed (extended). This is one way to grow and shrink slices in ORCA. However, I now believe that the idea of multiple slivers per lease was a mistake. It complicated the code and caused a lot of unnecessary debugging effort (in 2005), is useless for networks, and makes it impossible to change some slivers independently of other slivers if they are in the same lease. In GENI we always use ORCA with one sliver per lease. Used in this way, an ORCA lease is a pretty close analogue of a sliver. One can grow slices by adding leases (slivers), and shrink slices by closing leases or allowing them to expire.

Recently people have started saying that ORCA does not have UpdateSliver, but I am not sure if they are right because I still don't know what they mean by UpdateSliver. ORCA defines another operation on a lease (sliver), called Modify, that has never yet been fully implemented. Modify was intended as a hook for pluggable type-specific actions on the slivers in a lease. One might think of it as sort of a kitchen-sink ioctl. But this seems different from the UpdateSliver planned for the AM-API. An ORCA slice can have many slivers at the same AM, and can create and release them independently, so the stated motivation for the AM-API UpdateSliver does not seem to apply. (?)

Attachments (1)

Download all attachments as: .zip