wiki:WorryingSlivers

Version 12 (modified by chase@cs.duke.edu, 8 years ago) (diff)

Rewrote OpenFlow text to reduce confusion, fixed a few other nits

How I learned to stop worrying and love slivers

Jeff Chase, speaking only for myself.

The GENI debate over slivers has always seemed surreal to me, because it cannot be a technical disagreement. It can only be a disagreement about what is meant by the term "sliver". And that is pointless because a sliver could be anything. So it quickly comes down to one's views about the goals and scope of the entire GENI undertaking. And yet at the same time, the sliver debate raises a crucial technical issue because it will drive the implementation balance between declarative specifications and code, which is a major architectural tussle in GENI. I am being provocative here but please bear with me.

In the Spring of 2008 several of us made a pilgrimage to CNRI to meet some Internet luminaries and talk about GENI. (That was the last time I saw Jay Lepreau, bless him.) As I recall, Vint Cerf was eloquent and animated on the theme that what is (or was) new in GENI is describing and reasoning about virtualized resources and their configurations. (Federation is another challenge, but that is mostly a question of figuring out how to use prior work.)

GENI is all about heterogeneous deeply programmable virtualized infrastructure resources, also known as programmable substrate. These words are taken from my favorite GENI Vision slide (pdf), which is now four years old and which I have been using for almost as long. (If anyone remembers the history of this slide, I would like to know it.)

So, what is a virtualized infrastructure resource? Well, that's a good question. In fact, it was the question Vint Cerf asked us in Reston, and so it must be a good question. He said that defining what we mean by "network virtual resource" and describing such resources are the key challenges for GENI. I don't want to put words in his mouth. But it's in my notes.

We seem to find lots to do to avoid that challenge, but we are forced to confront it every time we talk about slivers and rspec.

Virtual Resources

A virtual resource ("virtualized infrastructure resource") is something you can have or not have. While you have it, it might conduct some activity for you, like running a program, or storing data, or carrying your packets, or looking at someone else's packets as they fly by, or sensing the real world. Or anything, really.

A virtual resource is something you can allocate: if you get it, somebody else can't have it. At one end of the continuum we are allocated exclusive access to some physical substrate, e.g., a component. But most often when we interact with a virtual resource there is a layer of software between us and the hardware substrate itself (the metal and glass). So what we are really interacting with, at least in part, is the software. We might occupy time or space on some actual substrate resource during the activity (that's the mutual exclusion part), but the software may do some scheduling of that resource and otherwise control the activity in some way.

In other words, a virtual resource is really an instance of a service---an infrastructure service. GENI has always focused on what is now called infrastructure-as-a-service (IaaS) in the industry and in the NIST cloud taxonomy.

So when we describe a virtual resource, we might also have to say something about the details of the service instance. For example, we might have to describe properties of the scheduler, like the degree of assurance that the instance will be permitted to occupy space or time on any actual substrate.

Some virtual resources mix software with the hardware to varying degrees. Surely a Python environment like Google's AppEngine is a programmable virtual resource, for example. There is more to say about the distinction between an infrastructure service and any other kind of software service along the continuum from grid job services to application services. NIST has already laid out that space for us, so I will leave that question. People sometimes seem uncomfortable that we do not know how far down this continuum we might go with the GENI virtual resource abstractions. What we do know is that we can describe physical resources fairly easily, but it gets harder to describe our virtual resources as we move down the continuum toward application services. At some point we can no longer say precisely what we mean by "virtualization".

OpenFlow

Before I move on, let me say a little about OpenFlow, which is an interesting case of programmable substrate. OpenFlow enables external controllers to install "flow entries" in the network datapath to manage traffic. Flow entries define pattern-action rules: the patterns match packets flowing through the network, and the actions operate on matching packets, e.g., to redirect or transform them. The patterns may match packets with specific combinations of header values (the controller's "flowspace").

In GENI it is common to speak of flowspace as a "resource" that an experiment requests from an OpenFlow aggregate and receives as a sliver. Clearly this a very different concept of a GENI resource than, say, a virtual machine (VM). A VM is an abstraction of a substrate element (a physical machine), and can be bound to a set of substrate resources (CPU, memory, etc.). The challenges for managing VMs are to decide who gets use of the substrate resource (resource allocation) and how to connect a VM to other virtual resources (stitching).

Flowspace represents something quite different: a right to control packets that may belong to someone else. When I say "virtual resource" I am not talking about flowspace granted to a controller, which I see as an issue of authorization rather than resource management. It is true that flowspace consists of combinations of network names (VLANs, MAC addresses, IP addresses), and that these names are "resources" allocated from some pool. But they are allocated to the owners of the packets, and not to the controller.

The point is that OpenFlow is a good test case for a GENI sliver abstraction, but we should be very careful when we talk about slivers in OpenFlow. I don't subscribe to the "slivers of flowspace" view. It makes more sense to me that an OpenFlow sliver would be a group of flow entries (which I will call a "ruleset") installed in the datapath at a particular location on behalf of a controller. That concept of a sliver is similar to a VM: we can think of a ruleset as a program that runs in the network and consumes hardware resources there. The amount of substrate resource that a ruleset consumes depends in part on the amount of network traffic generated by someone else, but in this respect OpenFlow is no different from many uses of VMs whose resource demands are driven by external requests and may be hard to predict in advance. Some resources must be assured for acceptable service quality, which brings us back to the need to describe scheduling properties. And it would be good if we could apply our sliver abstraction to the problem of managing the hardware resources consumed by OpenFlow rulesets. But I digress.

Semantic Models

So: there are many kinds of virtual resources, and the goal of GENI is to operate on them. To operate on them we must describe them. But there are so many kinds to describe. And as Alice said, the question is whether you *can* make words mean so many different things. But what do we want to say about these virtual resources?

When we talk about virtual resources, we can often identify distinct elements within them to talk about. For example, there are virtual machines, and network pipes, and logical storage containers. These examples suggest that our virtual resources tend to closely match the shapes and behaviors of actual substrate elements (components). Indeed, a physical component might be allocated directly as a virtual resource (as in early Emulab). But virtual resources are above the component layer: a component might host more than one virtual resource, or a virtual resource might span components.

We can describe these various entities independently in terms of their properties and their relationships. That is good, because if our goal is to describe virtual resources precisely enough to process the descriptions automatically, then we are going to need a semantic model, and semantic models are almost by definition based on classifying entities and their relationships. There is a rich literature on entity-relationship models going back 35 years.

Indeed, a large part of the GENI challenge is in developing and processing declarative specifications of virtual resources using semantic models. We can use these models to create documents that describe virtual resources in terms of elements and relationships. When we request virtual resources or changes to virtual resources, we attach documents that describe the resources and changes we want. If a request is granted, we receive documents describing the virtual resources we got. These documents are called rspec.

What is important here is that virtual resources have an internal structure and properties, and we describe these using declarative specifications. But what does such an rspec document describe? Must it describe a complete slice? Or can we describe different pieces of a slice in different documents? As we will see, this is an essential question for understanding the role of slivers in GENI.

Partitioning Virtual Resources Across Aggregates

Another aspect of the GENI challenge is that virtual resources are distributed. They span a "Federated International Infrastructure" in the words of my favorite GENI Vision slide. We recognize that substrate resources are grouped into aggregates owned by infrastructure providers. In general, we seem to be willing to presume that this grouping is a partitioning: each piece of infrastructure is controlled by exactly one aggregate. I sometimes hear people talk about hierarchical "aggregates of aggregates", but I think even they would agree that each piece of infrastructure is controlled by exactly one leaf aggregate in the hierarchy.

We also seem to accept that we can partition virtual resources across aggregates in a way that mirrors the partitioning of the substrate resources. That is, a virtual resource is provided by a single aggregate, and consumes substrate resources only on that aggregate. To make changes to a virtual resource, we send requests about it to the aggregate that controls it. Users and their tools can talk to aggregates independently of other aggregates.

We can think of these virtual resources as the "parts" of a slice. Users and their tools obtain these parts from different suppliers --- the aggregates --- and assemble them into a slice. We can make an analogy to building any kind of complex machine from off-the-shelf parts obtained from multiple suppliers. The analogy isn't perfect: if the part is a virtual resource, then the supplier must host and operate the part rather than merely shipping a physical widget out into the physical world. As we have said, the "part" is really a service. This is also why one can never really own virtual resources in the same way that one owns a physical part: virtual resources are services provided over a network, and the provider controls the actual substrate and can stop providing the service at any time. But let us go with the analogy as far as it takes us.

The next question is, when we get a part from a supplier, do we need to tell the supplier about our parts from other suppliers? When we talk to an aggregate about our virtual resources there, it is reasonable that we would want to limit the conversation to the specific infrastructure service that aggregate provides. Similarly, if a builder or manufacturer gets parts from a supplier, they do not have to show the supplier the blueprints for the entire project. It is understood that various materials and parts are available to the customer from different suppliers, and that these pieces fit together in various ways. The customer may select the parts and combinations and use them to build whatever the customer wants, without telling the parts suppliers about the overall assembly. The materials and parts and means of assembling them may change with time, and we can't say in advance what they all are. But it is understood that it improves efficiency to have interchangeable off-the-shelf parts with standard well-defined compatibilities. This familiar idea was called the American System in the 1830s.

Can we apply the American System to virtual resources in GENI? This partitioning of virtual resources is a crucial step. If we are going to describe complex slices and change them over time, then a divide and conquer strategy will simplify the task considerably.

Stitching

The key challenge to overcome for partitioning our resource descriptions is that there are relationships among the virtual resource elements. And to the extent that we have these relationships among virtual resources on different aggregates, those aggregates may need to interact, perhaps through some intermediary. In GENI we call these interactions stitching. Many of the driving use cases for stitching involve interconnecting virtual resources within a slice. For example, we use stitching to connect virtual network pipes into paths or networks terminating at virtual nodes.

Can our semantic models describe all the relationships that might require such interactions? If the answer is yes, then we can talk to each aggregate about the relationships that cross its borders, without it having to be aware of any virtual resources that are unrelated to what we want that aggregate to do. In other words, the graph is partitionable. If the answer is no, then we might need to tell every aggregate about every virtual resource, in case there is some important relationship that we missed. Perhaps there is some relationship that is not represented explicitly in the description, but that an aggregate can infer from the descriptions of other virtual resources at other aggregates. In that case, the aggregate must have all of those descriptions available to it. The graph is not partitionable.

It is logical that we would seek to describe all such relationships in our semantic resource descriptions. If we discover that we have missed an important relationship, that means our semantic model is insufficient, and we should go back and extend it or rethink it.

I believe that these interactions are relatively easy to describe for network virtual resources. What is difficult is to describe the service that a network virtual resource provides. But once we describe the service, a relationship is almost always a binding of one virtual resource to the service provided by another. In networked systems those service endpoints always have names or labels allocated from some network namespace: VLAN tags, IP addresses, ports, DNS names, URLs, LUNs, lambdas, pathnames, alone or in combinations with other identifiers. What is needed is to describe which virtual resources are providing a service and which are consuming that service. Then we can bind the consumer to the producer by passing the producer's label to the consumer. In essence, the graph becomes a directed dependency DAG, with directed edges from producers to consumers. A stitching agent traverses the DAG, instantiating resources and propagating labels to their successors as the labels become available. This is how ORCA does stitching.

We should be able to describe these relationships using our semantic models, and propagate labels by querying descriptions based on those models. We don't need to write any new code for stitching. We don't need to describe the producer to the consumer if the consumer already understands what kind of resource or service it wants to bind to. If it does not, then the configuration of virtual resources is malformed.

Partitioning Virtual Resources Within Aggregates

Suppose then that we can describe virtual resources of a slice by a graph of elements (entities) and relationships, using a semantic model. Suppose further that we can partition the graph across aggregates as I have described, so that we talk to each aggregate only about the virtual resource elements that it hosts, and any adjacent edges. Edges crossing partition boundaries require coordination among a pair of aggregates, i.e., stitching.

Now the question is: how does the aggregate expose the graph through its API, so that a slice owner can operate on the graph? In the current (or near future) AM-API there are simple calls to operate on the graph: create, destroy, and (soon) update. The create and update operations take as an argument an rspec document describing at least the entire partition of the graph residing at that aggregate. The requester says what region(s) of the graph they want to operate on somewhere in the rspec document attached to the request, and not in the API.

Thus the AM-API itself offers no way to talk to an aggregate about some regions of the graph independently of other regions of the graph. If we want to add resources, we must pass an rspec for the entire graph, with the new parts added. If we want to remove resources, we must pass a description for the entire graph, with some parts removed.

But if a virtual resource graph can be partitioned across aggregates, then it must also be possible to partition the graph within aggregates. We can break the graph into named regions and use the aggregate API to talk about specific regions, passing the rspec for only the regions of interest. We can let the aggregate handle any edges that cross region boundaries within the aggregate.

If we can partition the graph into regions, how shall we decide where to set the boundaries? How big shall we make the regions? At one extreme there is a single region: this is the degenerate case represented by the v2.0 AM-API. We can make the changes we want using a single API call, but we must pass rspec for the entire graph, even for a minor change. If we introduce region boundaries, then we have a tradeoff. With smaller regions, we need more requests to instantiate or update a given graph, but each request passes a smaller rspec. With larger regions, we make fewer API calls to instantiate or update the graph, but the rspec documents are larger.

Slivers

Finally we come to slivers. What is a sliver? A sliver is a region of a slice's virtual resource graph. A sliver API allows a client to operate on one region independently of other regions. For example, we can add virtual resources to a slice by attaching a new region, without changing anything about the graph as it exists. And we can remove virtual resources from a slice by detaching a region, without changing anything about the rest of the graph as it exists.

There seem to be three classes of arguments "against" slivers. First: the name sliver is confusing to people. Well, we can change the name, but it probably won't help. I think sliver is a good name, because it sounds like what it is---a piece of a slice---and it doesn't already mean something else. Other options proposed, like resource or component, might make the situation worse by promoting a confusion between virtual resources and physical resources. (I say this recognizing that they are sometimes the same thing, but that makes the confusion more confusing, not less.)

The second common argument makes various assumptions about what others assume slivers to be, and then argues that system X is different. Indeed, perhaps system X is different.

The third argument is (in essence) that virtual resources are tightly coupled, and we can't operate on them independently, or that trying to do so might add undue complexity to the API, or might get us into trouble later if we discover unexpected dependencies.

The third argument merits a substantive response. My response is: either the virtual resource graph is partitionable, or it is not. If the graph is not partitionable, then it is not partitionable across aggregates. Then there are only slices, and each aggregate must receive the entire graph for the entire slice. If we change any part of the graph, we must pass the new slice rspec to at least all aggregates participating in the slice. We can make this approach work for demos, but it will not scale to large slices, and it will not succeed in accommodating dynamic slices. And then somebody in another project will figure out how to describe a virtual resource graph in a way that makes it partitionable, and we will move forward again from there.

On the other hand, if the graph is partitionable across aggregates, then it is also partitionable within aggregates. Then the only question is: should the aggregate API permit any partitioning within an aggregate, at the aggregate's discretion? And why would the API prohibit that? Why would the API prohibit an aggregate from grouping and organizing the virtual resources that it serves? Why would the API prohibit an aggregate from breaking its virtual resources into slivers that can be operated on through sliver APIs? If the aggregate finds its resources to be not easily partitionable, then it can always choose not to partition them, and group all the virtual resources of a slice at that aggregate as a single sliver. It then has all the advantages of the current AM-API.

Needless to say, I hope and believe that the virtual resource graph is partitionable.

Types of Slivers

But what is a sliver *really*? I have been speaking at a high level of abstraction of these groupings as "regions" of a graph describing any set of virtual resources. But what does a region of the graph represent? If we know something about a specific aggregate, we can see that these groupings correspond to well-understood resource abstractions that are meaningful to users of the aggregate.

Let's consider some virtual resource cases we already understand. For example, a cloud site offers an infrastructure service that allows us to instantiate graphs of related virtual resource elements such as virtual CPU cores, memories, network interfaces, storage volumes, and virtual networks. We can take a pen and draw regions around parts of this richly connected graph of virtual resource elements. We can decide that the collection of elements adjacent to a memory constitute a useful grouping. We can draw a region encompassing all of the cores and virtual devices adjacent to a memory and call it a "virtual machine" or "instance". That is a reasonable choice of a region: it is the choice made by EC2-like cloud sites. EC2 also draws regions around VLANs and calls them "security groups". It considers storage volumes separately from virtual machine instances.

The Breakable Experimental Network is another interesting case. BEN is a network substrate with a multi-layer topology. We can allocate virtual network topologies from BEN. Given a virtual network topology that is planar (forget about layers for now), there are many reasonable ways to partition the network into connected regions. What is important about a region is that offers some connectivity service among a set of locations. The aggregate might choose to expose more or less information about its internal structure. People who understand network description languages call this topology aggregation. But it is up to the network aggregate whether it allows its clients to create subnetworks separately and then stitch them together, and it is up to the client how it chooses to use those primitives. It may be useful for a client to build and evolve a virtual network one piece at a time, or, it might be simpler to create a static network in one shot and then leave it alone.

These examples show that a given virtual resource service incorporates its own groupings of the virtual resource element graph into regions (slivers), and these groupings may allow useful operations on a sliver other than creating it and releasing it. EC2 separates networks, storage volumes, and virtual machines, and as a result it can offer primitives to attach and detach storage volumes to/from virtual machines, and attach/detach virtual machines to/from networks. These are specific examples of generalized stitching, but these groupings can also support other useful verbs, like cloning storage volumes or suspending virtual machines.

Thus virtual resources have types that define what we can say about them and do to them. An aggregate could provide supplementary type-specific operations on slivers, in addition to common operations supported by the base sliver API. Of course, some virtual resources are programmable, and programs running on them may also expose interfaces and operations. But in general those interfaces are above the virtual resource management layer and are outside our scope of concern.

But sliver is a very abstract abstraction. There will be other kinds of slivers that don't look like these examples. There will be aggregates whose mapping to slivers is unclear, including some (like OpenFlow) whose functions have little to do with resource allocation. The Vision Slide says that GENI will support heterogeneous deeply programmable virtualized infrastructure resources. What are those? We do not know. But they are heterogeneous, so there could be many different kinds. And if the architecture is to have impact over more than a few years, then it must accommodate resources that have not been invented yet. We do not know what these will look like.

What we do know is that it must be possible to describe these new virtual resources using a semantic model, and that the description will be a graph of elements and edges representing relationships among the elements. And if we have a stitching architecture that can propagate labels across edges, then the graph will be partitionable. And if the graph is partitionable, then it will be convenient to partition it into regions in order to allow the possibility that we might use the aggregate API to operate on different regions independently of other regions. But we can't say in advance what the regions might represent, or what the various type-specific sliver APIs might be (except for the ones we understand now).

In the past, some have seemed to argue that the impracticality of a one-size-fits-all sliver API undermines the whole dream of GENI. But the notion of subtyping has been proven in many other contexts and should be comfortable here as well.

The Boundary Between Software and Semantic Specifications

So: slivers are named typed partitions of a slice's virtual resource graph. They reside entirely within one aggregate, and their boundaries are chosen by that aggregate. The aggregate exports a type-specific API to operate on each sliver.

Another view of slivers might be "that which the API allows us to name and operate on". If we want to operate on a virtual resource element that isn't named through the API, then we must name it and operate on it in the rspec for its containing sliver or slice, or whatever the granularity of that rspec is. If we don't enable type-specific operations on a sliver through the sliver API, then these operations must be represented somehow as verbs in the rspec, or (worse) they won't be supported at all. Putting verbs in a semantic resource description is a bad idea: if we want to use a language for imperative programming, then we should use an imperative programming language.

These choices will drive the balance of focus on the API vs. declarative specifications. In one direction we have a system that uses a few simple API calls to pass around large resource descriptions that are diffed and acted upon in different ways at multiple aggregates. In the other direction we have a system that uses many calls to a diversity of APIs on a diversity of sliver objects, with each call carrying a small rspec document pertaining to the sliver object being operated on.

A Footnote on ORCA

We traveled this line of reasoning some time ago in developing the ORCA system. And yet ORCA has nothing that we call "slivers".

ORCA AM calls operate on objects called resource leases. Leases are time-bounded contracts for one or more units of typed virtual resources. The units in a lease must have the same type and parameters (e.g., sizes). These units are the closest analogue to slivers, so let us call them slivers. The canonical example of a resource lease request is something like "get me 20 large virtual machines for an hour". (But that is just an example.)

Leases have states and state machine transitions that are independent of the resource type. (E.g., initializing, active, closing, closed.) The resource-specific code (setup, teardown) is implemented in pluggable back-end handler scripts that interact with some underlying virtual sliver service, e.g., a cloud middleware system or a network provisioning system. An aggregate may have many such handlers for different sliver types: an ORCA aggregate is not limited to one type of virtual resource.

A key property of ORCA resource leases is that they expire if the client does not renew them. That property is important for GENI, but is out of scope for this discussion. The set of slivers in an ORCA lease may be changed in various ways when the lease is renewed (extended). This is one way to grow and shrink slices in ORCA. However, I now believe that the idea of multiple slivers per lease was a mistake. It complicated the code and caused a lot of unnecessary debugging effort (in 2005), is useless for networks, and makes it impossible to change some slivers independently of other slivers if they are in the same lease. In GENI we always use ORCA with one sliver per lease. Used in this way, an ORCA lease is a pretty close analogue of a sliver. One can grow slices by adding leases (slivers), and shrink slices by closing leases or allowing them to expire.

Recently people have started saying that ORCA does not have the UpdateSliver function. An ORCA slice can have many slivers at the same AM, and can create and release them independently, so ORCA has the function of UpdateSliver to grow or shrink a slice at an aggregate. Also, a caller can change certain sliver parameters at lease extension time, which may cover other planned functions of UpdateSliver. ORCA defines another operation on a lease (sliver), called Modify, which has never yet been fully implemented. Modify is intended as a hook for pluggable type-specific actions on the slivers in a lease. One might think of it as sort of a kitchen-sink ioctl. But this seems different from the UpdateSliver planned for the AM-API.

Attachments (1)

Download all attachments as: .zip