[[PageOutline]] These are numbered questions sent to exogeni-design@geni.net; discussions are captured to track resolution. Questions are crossed out when answered. Person closing a question should adding his/her name to the question with a brief explanation. Any document that is referred to with a URL is attached to the page for historical reference. Following are notes attributions: {{{ AS: Adam Slagell CE: Chip Elliot IB: Ilia Baldine JS: Josh Smift NR: Niky Riga AH: Aaron Helsinger CG: Chaos Golubitsky JC: Jeff Chase LN: Luisa Nevers VT: Vic Thomas BV: Brad Viviano HD: Heidi Dempsey JM: Jonathan Mills NB: Nick Bastin TU: Tim Upthegrove }}} = GPO ExoGENI Questions = * 1. ~~Does ExoSM speak GENI API?~~ (nriga) {{{ Yes, ExoSM is just like any Orca SM running in a rack and can be thought as a GENI AM that can make reservations in racks as well as provide the network connecting resources on different racks. Per Ilia comments ExoSM can also give an experimenter resources from only one rack by making a bound request that bounds all resources to specific rack. My understanding is also that all topology information is available to the experimenter through the GENI API (listresources) only through the ExoSM and not through the rack-local Orca SMs. }}} * 2. ~~Can you describe the ExoGENI software stack a bit more in the teleconf (Figure 7)?~~ (ahelsing) * 2a. ~~Is the AltG API the same as the Orca XMLRPC API at the SM?~~ (ahelsing) {{{ Yes. }}} * 2b. ~~Can you draw the software stack for the worker nodes in the same style as Figure 7 for comparison?~~ (ahelsing) {{{ Worker nodes are either turned off (booted and installed using xCAT when needed) or run Centos 6.1 with OpenStack worker node configuration. The cloud worker nodes also need a cloud node manager installed. This requires minor modifications for NEuca. This is the thing that lets us create multiple interfaces on VMs and stitch them to other VLANs. Do we understand what controls how many nodes are bare-metal and how many are available for VMs? Can this be adjusted on the fly? By whom? The allocation question is a policy question, the mechanism should be defined later - both postponed. Josh to get software stack for Worker node. }}} * ~~3. Is Eucalyptus or OpenStack used for the compute resources?~~ (chaos) {{{ OpenStack }}} * ~~3a. If OpenStack is being used, what testing or analysis convinced you to choose OpenStack?~~ (chaos) {{{ We've done performance comparision. OpenStack instances boot significantly faster (orders of magnitude) due to the use of COW for boot images. We can also see a path to making VM migration work between ExoGENI sites with OpenStack (could not figure it out with Eucalyptus). }}} * ~~4. When will ExoGENI racks support xCAT-based bare-metal node allocation?~~ (chaos) {{{ My hope by GEC13 }}} * ~~ 5. How do bare-metal images get vetted?~~ (hpd) {{{ TBD Same as S.27, closing this one, Adam will follow up. }}} * ~~5a. Given that VM images are unvetted, why vet bare-metal images?~~ {{{ Security concerns. Also bare-metal images are harder to prepare. Mistakes will mean users occupy fairly limited bare-metal resources just learning how to boot them. Same as S.27, closing this one, Adam will follow up. }}} * ~~6. Can we have more detail about disk images:~~ (jbs) * ~~6a. How are central images selected? Is there a central repository?~~ (jbs) {{{ For vetted images for bare metal. There may also be a small informal repo for sample VM images. For 6a-6b-6c. ImageProxy can fetch images given a URL. So people can put their images anywhere. We have software (pod) to make it easy for users to upload images and share them. The user interface is a little rough, and it is not quite deploy-ready, but it could be used. Ilia's concern (I think) is that we don't have budget to run and manage a repository server with a lot of disk. But GPO could certainly host it. 13) Building new VM images takes work. (Q6) You have to add NEuca and maybe something OpenStack? This was hard with Eucalyptus, but maybe it is easier with OpenStack? Their answer that this is documented elsewhere isn't terribly re-assuring (since the Eucalyptus documentation wasn't enough for Tom). Do we want to check their list of images? Have them add 1 or 2? Have them collect/edit documentation for this process? Well, OpenStack may be a lot better for this --- we simply don't know. IMHO, the right answer to "this is documented elsewhere" is, "great, then it should be easy for you to make a wiki page pointing to usable procedures elsewhere". Well, actually: i said that, and then thought about the RENCI integration of external documentation and internal for ORCA/NEuca, which i have not found all that readable when i've tried to use. So maybe we'd rather have them duplicate the steps that an experimenter would use to create an image? I'm not sure here. I predict that we aren't planning to host a repository server. Are we? If not, do we think that someone else is? Do we want to push RENCI to do that? Answer: RENCI has a central repository at http://geni-images.renci.org/images/, which ExoGENI will use too (or maybe a subdirectory, or some such). Images for that repository must be reviewed by RENCI, the GPO, or our delegate. All vetted bare-metal images will live here, and a small number of commonly-used VM images could be hosted here too. RENCI or GPO will put together a nicer index page (it's currently just an Apache DirectoryIndex listing, with no comments or explanations) This is correct although as far as the bare-metal nodes are concerned the images will be cached in each rack and the booting will happen from there. I have put together a small page listing available VM images : https://geni-orca.renci.org/trac/wiki/neuca-images I've also rephrased some of the questions a bit from their original forms. 6a. How are central images selected? Is there a central repository? Answer: RENCI has a central repository at http://geni-images.renci.org/images/, which ExoGENI will use too (or maybe a subdirectory, or some such). Images for that repository must be reviewed by RENCI, the GPO, or our delegate. All vetted bare-metal images will live here, and a small number of commonly-used VM images could be hosted here too. RENCI or GPO will put together a nicer index page (it's currently just an Apache DirectoryIndex listing, with no comments or explanations) This is correct although as far as the bare-metal nodes are concerned the images will be cached in each rack and the booting will happen from there. I have put together a small page listing available VM images : https://geni-orca.renci.org/trac/wiki/neuca-images }}} * ~~6b. Are there default images hosted at RENCI? What are they?~~ (jbs) {{{ We have a few on http://geni-images.renci.org/images/ Answer: The exact images haven't been specified, but there aren't in principle any reason why we can't publish any images that we decide we want, within disk space limitations. We (GPO) will presumably use our rack to come up with some, probably focusing on modern and stable versions of Ubuntu and Fedora/CentOS. Yes and we encourage multiple locations (URLs/web servers) from which the images are served. A directory listing them can be stored in one place. I've also rephrased some of the questions a bit from their original forms. 6b. Are there default images hosted at RENCI? What are they? Answer: The exact images haven't been specified, but there aren't in principle any reason why we can't publish any images that we decide we want, within disk space limitations. We (GPO) will presumably use our rack to come up with some, probably focusing on modern and stable versions of Ubuntu and Fedora/CentOS. Yes and we encourage multiple locations (URLs/web servers) from which the images are served. A directory listing them can be stored in one place. }}} * ~~6c. Will RENCI also store some user images?~~ (jbs) {{{ Only a few. Answer: RENCI will only store experimenter-created images if they've been reviewed (see 6a), but ImageProxy can fetch and use an image from any experimenter-supplied URL, and RENCI has software that makes it easy for experimenters to upload images and share them, although it's not quite deployment-ready yet. Yes. Duke team is working on POD (Persistent Object Depository) that can fulfill this role. I repeat that this is an optional component - a user can create an image and serve it from *any* web server. I've also rephrased some of the questions a bit from their original forms. 6c. Will RENCI also store some user images? Answer: RENCI will only store experimenter-created images if they've been reviewed (see 6a), but ImageProxy can fetch and use an image from any experimenter-supplied URL, and RENCI has software that makes it easy for experimenters to upload images and share them, although it's not quite deployment-ready yet. Yes. Duke team is working on POD (Persistent Object Depository) that can fulfill this role. I repeat that this is an optional component - a user can create an image and serve it from *any* web server. }}} * ~~6d. Will there be instructions for building custom images?~~ (jbs) {{{ For VMs yes, although basically OpenStack, Eucalyptus and Amazon have pretty extensive guides on how to do that. Answer: RENCI will publish instructions for building VM images, and there are good general docs available from OpenStack, Eucalyptus, and Amazon too. Here are the current instructions: https://geni-orca.renci.org/trac/wiki/NEuca-guest-configuration I've also rephrased some of the questions a bit from their original forms. 6d. Will there be instructions for building custom images? Answer: RENCI will publish instructions for building VM images, and there are good general docs available from OpenStack, Eucalyptus, and Amazon too. Here are the current instructions: https://geni-orca.renci.org/trac/wiki/NEuca-guest-configuration }}} * ~~6f. Must the experimenter add NEuca? ~~ (hpd) {{{ NEuca-py tools *should* be added to the image such that post boot configuration (IP address assignment to interfaces and post-boot scripts) would be done. Without it, bare interfaces will still be created based on NEuca INI script generated by ORCA for the desired topology and the user would have to manually configure them. }}} * ~~10. Can we have more information about how the IP Address proxy options in the table on p. 4 work? Do the proxies expose all ports or just ssh?~~ (jbs) {{{ Right now only SSH. The plan is to add the ability for the user to ask to expose some port ranges in addition to that. It's on the todo list and is not complicated. 12) They plan to NAT access to VMs, meaning that experimenter resources are only available via SSH or maybe in future specifically requested port ranges. (Q10 from original list) I think we want to know more here, and clarify our concerns and desires. Perhaps those 'future plans' are enough, but we need to know more (like a schedule). 10. Can we have more information about how the IP Address proxy options in the table on p. 4 work? Do the proxies expose all ports or just ssh? Ilia had said "Right now only SSH. The plan is to add the ability for the user to ask to expose some port ranges in addition to that. It's on the todo list and is not complicated." That sounds good; is there a timeframe for that? Just to make sure the goal is clear, the idea is that experimenters may want to run TCP or UDP services on their VMs, and make it possible for users to connect to those services via the Internet. Answer: The plan is to add this ability, it's on the to-do list, and it'll be done by the time the first non-GPO/RENCI racks ship in April. (Which of the options in that table are you planning to go with? Or will this be a campus-by-campus decision? If the latter, which will you recommend? We prefer (C), which seems safe enough if the racks are behind a campus firewall, which we assume they will be.) This is a campus-by-campus decision. We can deal with either B or C. If there are not enough public IP addresses, we have a proxy solution. If there are enough, they can be used as is. 10. In the IP Address proxy options in the table in section 2.1, at the top of page 5 do the proxies expose all ports or just ssh? (Experimenters may want to run TCP or UDP services on their VMs, and allow users to connect to those services via the Internet.) Answer: The plan is to add this ability, it's on the to-do list, and it'll be done by the time the first non-GPO/RENCI racks ship in April. ISSUE: Which of the options in that table are you planning to go with? Or will this be a campus-by-campus decision? If the latter, which will you recommend? We prefer (C), which seems safe enough if the racks are behind a campus firewall, which we assume they will be. This is a campus-by-campus decision. We can deal with either B or C. If there are not enough public IP addresses, we have a proxy solution. If there are enough, they can be used as is. Ok, that sounds good. One other question about this: The ExoGENI racks will not expect that they have a dedicated IP subnet for these interfaces, which they need to route; but will instead expect that they'll connect to an existing IP subnet (or a newly-created one, I suppose), which the campus will route, right? (That sounds fine; I ask because it came up when we were deploying the starter racks in Chattanooga and Cleveland, so it may come up with campuses too.) We don't require an entire subnet. A list of available IP addresses is enough. }}} * ~~10a. Do all outbound connections work for all table options~~ (jbs) {{{ Not clear about the question I think the question is: "For all the options in Table N (don't have the number handy, but we should cite it), is it the case that there are no restrictions on outbound connections?" The original 10a said: 10a. Do all outbound connections work for all table options I clarified that what we were getting at here was: For all three options in that table, is it the case that there are no restrictions on outbound connections? We assume not, but wanted to check. Answer: Correct, there are no restrictions; all outbound connections are permitted. (Although some could be blocked if we needed to for some reason.) We will not block any outgoing connections on the racks. We cannot say anything for the campus. 10a. For all three options in that table, is it the case that there are no restrictions on outbound connections? Answer: Correct, there are no restrictions; all outbound connections are permitted. (Although some could be blocked if we needed to for some reason.) We will not block any outgoing connections on the racks. We cannot say anything for the campus. }}} * ~~10b. How does the proxy work for OpenFlow?~~ (jbs) {{{ I don't think they are related. Proxied IP connections go through the management net, so they don't touch the OF switch. I think our concern was: If the FlowVisor is reaching out to experimenter controllers through the proxy, does that raise any issues? (Relative to the alternative of "the FlowVisor connects to experimenter controllers directly" -- which may in fact be what happens, if it's on the head node.) Our concern here was: If the FlowVisor is reaching out to experimenter controllers through the proxy, does that raise any issues? (Relative to the alternative of "the FlowVisor connects to experimenter controllers directly" -- which may in fact be what happens, if it's on the head node.) The original 10b said: 10b. How does the proxy work for OpenFlow? I clarified that what we were getting at here was: If the FlowVisor is reaching out to experimenter controllers through the proxy, does that raise any issues? If outbound connections are unrestricted, and performance of the proxy is good, then this is probably not an issue. But we wanted to raise the question because it's a situation where dataplane traffic uses the management network, so if the proxy was expected to only have to handle experimenter SSH, that might not be sufficient. This is superseded by 10a and 10c: There are no proxy/firewall restrictions, and no performance issues, that are unique to OF/FV. 10b. If the FlowVisor is reaching out to experimenter controllers through the proxy, does that raise any issues? Answers: If outbound connections are unrestricted, and performance of the proxy is good, then this is probably not an issue. We should make sure to test this carefully with the initial GPO and RENCI racks, since FlowVisor can generate a lot of control traffic. }}} * ~~10c. What is the expected performance bottle-neck for proxying?~~ (jbs) {{{ Packet forwarding is relatively cheap at reasonable rates. The bottleneck will be the connection to the campus network. This gets to 10c: 10c. What is the expected performance bottle-neck for proxying? Ilia had said "Packet forwarding is relatively cheap at reasonable rates. The bottleneck will be the connection to the campus network." Just to put some numbers on this, the theory is that the connection to the campus network will not be more than 1 Gb/sec, and we think that the proxy can go at least that fast? 10c. What is the expected performance bottle-neck for proxying? Answer: We expect the connection to the campus network to be 1 Gbit or less, and that the proxy can go at least that fast. The answer is above - we don't think the head node will be the bottleneck. It will be the campus connection. }}} * ~~11. Is ExoGENI software essentially Orca software? How do they differ? ~~ (hpd) {{{ Same The software is ORCA (and associated stuff like ImageProxy and NEuca), but it is configured in a specific way, so we just say "ExoGENI" when we're talking about that configuration. }}} * 12. ~~What happens to ExoGENI racks and/or rack functionality if RENCI suffers a network or service outage?~~ (ahelsing: watch this, but Ilia agreed to make deployment choices we wanted) {{{ Should not be affected For 12, 12b. Also, old actors cannot see new actors. Currently the actor registry uses SSL connections. If RENCI goes off the net then a site or SM will not be able to restart. AMs/SMs that are running will not be able to refresh their lists, so they won't accept any new actors. If the registry issued certs (easy with ABAC), then this problem would go away, but it would be harder to revoke... This contradicts 12a. 1) Question 12: RENCI is a SPOF in your design, due to the RSpec conversion service and Actor Registry. It appears that a couple (minor?) changes would mitigate this risk. Let us know if we're off base here. }}} * 12a. ~~Will the absence of the RSpec/NDL conversion service mean RSpec-related requests will not work?~~ (ahelsing: RSpec converter being duplicated on all racks) {{{ Yes. We can host alternative translators in a number of places if it is a concern. We can host a translator on every rack if needed and configure its SM to talk to that translator. It is a simple stateless web-service. 2) RSpec conversion service is a SPOF. (Q 12a from GPO list) I think we'd like them to try running it elsewhere as well. a) Make the URL a configuration item in racks b) Test running it on the head node, to ensure no performance problems or library inconsistencies c) Consider running a backup version of the service somewhere. GPO? I think Ilia said this service is stateless and there's no issue running it on the individual racks. So i don't see any reason not to just run it on the individual racks, unless it's a serious resource hog. This contradicts 12. So that's not "no functionality will be affected". :^p (I don't think it's particularly important to call him on this, just mentioning it as a warning to us to keep our eyes open. :^) I think we should ask them to have a translater for each SM, unless there's a significant cost to that (in which case we should ask them to clarify what the cost is). a) Please install the RSpec conversion service on all racks, and make the URL for the conversion service be a configuration parameter. Be sure to test the load on the rack head node, once this and the OpenFlow pieces are running there. No problems with 12 a or b - this is supported today and is a deployment-time decision. 12a&b is there now? (RSpec converter URL is a config param and actors community on the rack among themselves fine on restart) Great. If you are comfortable with this deployment choice (run the RSpec converter on all racks), then please plan on it. 12a is there now because we have a way of statically specifying security associations between actors in a config file. The actor registry works on top of that filling in whatever is missing. So we can configure the ORCA actors in a rack to know about each other statically without relying on the registry and they will only learn from the registry about other racks. Sounds great }}} * 12b. ~~What impact will the lack of the ORCA Actor Registry have on racks?~~ (ahelsing: answered questions satisfactorily) {{{ Everything will continue running. New actors will not be able to see old actors. 4) Actor registry is a SPOF. (Q 12b from GPO list) This is less worrisome. New actors would be cut off. Racks cannot restart successfully. 5) The actor registry shows topologies in NDL. Once Ad conversion works (GEC13 he says), we should ask them to include a link showing that in RSpec as well. b) Please ensure that the 3 Orca actors on a rack can communicate with each other after rack reboot without re-talking to the Actor registry. IE a rack should work as a stand-alone GENI AM even if RENCI is inaccessible. No problems with 12 a or b - this is supported today and is a deployment-time decision. 12a&b is there now? (RSpec converter URL is a config param and actors community on the rack among themselves fine on restart) Great. If you are comfortable with this deployment choice (run the RSpec converter on all racks), then please plan on it. 12b is trivial since the converter service can be run anywhere and its location is a configuration parameter for the rack SM. Sounds great }}} * 12c. ~~Any other impacts?~~ (ahelsing) {{{ Can't think of any. }}} * 13. ~~What would fail if the rack Orca XMLRPC interface were disabled? What does the Orca XMLRPC feature do? Is it critical to the rack functions or just another way to use it?~~ (ahelsing) {{{ It is another way to use it. Nothing would fail, but we would like to keep it. It is integral to the actor (SM) so there is no way for it to fail independently. We may plan to add some new management functions through the XMLRPC interface, so the answer to this question might change. }}} * 14. ~~Define ORCA AM Delegation to a broker further---is it double delegated? How is it applied for local broker and ExoGENI broker?~~ (nriga) {{{ Probably best to refer to https://geni-orca.renci.org/trac/wiki/orca-introduction Double-delegated, but this would be site policy under site operator control. A site could reserve resources for local use by not delegating them. For example, they could buy more nodes and reserve them for local use. }}} * ~~14a. If delegation to broker is a deployment time decision, what is the plan?~~ (nriga) {{{ Delegation must occur for things to work, what is decided is how much to delegate. I'd say start with 50/50 for compute and probably 80/20 for vlans (local/global) Resources at a local rack are delegated to *either* the local broker *or* to the ExoSM broker, i.e. the resources *are not* double delegated. The original split will be 50-50, i.e. 50% of compute resources, vlan tags etc, will be delegated to the local broker and 50% to the ExoSM broker. The percentage is configurable and each admin can decide on a different split. The reconfiguration probably requires changes in a couple of configuration files and a restart of some (??) software. Tom believes that this might be more complicated since the broker have to address problems with existing tickets. 7) ExoSM owns half the racks. We may end up preferring to go direct to the racks. (Q 14a) We should have them document the process of changing that allocation, maybe even try it once, to be sure this isn't terribly disruptive or hard. }}} * ~~ 15. What will the flowspace look like for ExoGENI OpenFlow slivers? If the flowspace is based on VLAN tags, will this still be doable if the OpenFlow switch runs in hybrid mode? ~~ (hpd) (this is an adequate first cut answer and the use cases duplicate this - closing) {{{ Here are FlowVisor commands (for two ports one vlan slice): $ fvctl addFlowSpace 00:c8:08:17:f4:a6:6a:00 10 "in_port=23,dl_vlan=151" "Slice:ilia2=4" $ fvctl addFlowSpace 00:c8:08:17:f4:a6:6a:00 10 "in_port=24,dl_vlan=151" "Slice:ilia2=4" The term "hybrid mode" is not well-defined. 11) The use of OpenFlow vs VLANs, and the capabilities of the switches, seems an open and messy question. Josh/Niky/Nick/? need to follow up probably. - implications of hybrid mode - way to do an OpenFlow onramp - options instead of a NOX controller per VLAN - ways to use OpenVSwitch to do clever things - ... This is what I expected, and should work fine, although we haven't personally tried it much. We could, when we get bamboo up and running again. Does he mean that we haven't defined it well, or agreed on a definition, or something? Or that we don't know for sure what *IBM* is going to implement? I think we have a good definition of "hybrid mode", although the definition is different on HP and NEC switches, than on what we think the IBM switches will do. But if he just means that we don't yet know for sure what IBM is going to do, then yes. Do you mean that we haven't defined it well, or agreed on a definition, or something? Or that we don't know for sure what *IBM* is going to implement? I think we have a good definition of what we think we mean by "hybrid mode", although the definition is different on HP and NEC switches, than on what we think the IBM switches will do. But if you just mean that we don't yet know for sure what IBM is going to do, then I agree that this is an area with some question marks. The definition is exactly the same, for this switch, between NEC and IBM (same hardware, same software). The fact that this switch is different from other NEC switches is besides the point. In an OpenFlow world vendors are free to decide what they specifically mean by the word hybrid - all that hybrid means is that there is an openflow datapath instance and non-openflow instance on the same hardware, but how those instances interact is left to the vendor. The two most common implementations of a "hybrid" mode are: * VLAN-based hybrid mode - the switch handles all traffic tagged in a certain VLAN with instructions from the openflow instance. This often puts limitations on VLAN and QoS handling within the openflow instance. * Port-based hybrid mode - you literally just "slice" the switch so as it if were more than one switch. Traffic is divided between non-openflow and openflow instance based on what port it comes in on or goes out on. In both cases transitioning the boundary between the openflow and non-openflow datapaths is an implementation detail left to the vendor. The IBM switch actually supports both modes currently, but the non-openflow datapath in hybrid mode is incapable of anything more than L2 MAC learning. It's probably also worth mentioning that on this switch, while you can create your openflow instance with an id of 1 to 16 possibly leading you to believe that you could create up to 16 openflow instances, you cannot - you can only create 1 - if you make a new instance, it will replace the old one (per NEC). We must be talking about different switches. The BNT switch we tested and is in our specs is a 10G/40G 48-port switch. NEC does NOT have it yet - I spoke to them about it. The NEC switch at GPO is not a BNT switch, since it is 1G/10G and BNT told me they do not have a 1G/10G implementation yet. The BNT is an NEC PF 5820. My comments apply to that switch (not other NECs that implement different hybrid modes), and of course to hybrid mode in general (just to make sure we were all on the same page about what was generally available). The BNT switch I tested cannot do VLAN based OpenFlow (yes, you actually configure one VLAN to be OpenFlow, but a VLAN is used only as a port-grouping mechanism; its tag has no meaning). The only mode the BNT switch supports is port-based separation (and right now all ports have to be on that vlan; hybrid mode is coming). So, just to (try to) close the loop on this, all I originally wanted to address was Jeff's comment 15. The term "hybrid mode" is not well-defined. and my belief is that we do in fact understand what "hybrid mode" means, even if we're still not entirely on the same page about whether the IBM switch is in fact a NEC PF 5820, or something else. Jeff, do you think there's still an open issue here about hybrid mode not being well-defined, or are you happy? I was simply observing that there is no specification for "hybrid mode". I just want to be sure we define our terms. We're all in agreement about that, right? Nick said something about "different hybrid modes". As I recall, the original question was: 15. What will the flowspace look like for ExoGENI OpenFlow slivers? If the flowspace is based on VLAN tags, will this still be doable if the OpenFlow switch runs in hybrid mode? I responded "hybrid mode is not well-defined" because I did not understand the second part of the question. If the question is still live, could I ask you to restate it more concretely? There seems to be some concern behind it, and I'm not sure what that concern is. More broadly, we're still trying to figure out what the implications are of BNT's hybrid mode, and how to use it. I said something about that in my e-mail to this list on 1/12. I said: With a better hybrid mode we might be able to stitch and use OpenFlow at the same time, but this kind of "real" (to me) hybrid mode is not in the forseeable roadmaps for switch vendors. The weak support for hybrid mode may turn out to be a pretty deep problem. We're still working through the implications. For example, Ilia has pointed out that we're not sure whether a controller can touch any traffic that enters and/or exits a non-OF-aware circuit provider or RON, since ports facing those networks weren't planning to be in OpenFlow mode. But that's a separate issue. The answer in this case, given different hybrid modes, is generally thus: * If a switch implements a port-based hybrid mode (basically splitting the hardware into two switches along physical boundaries), then the openflow datapath is usually fully capable (with whatever the ASIC supports anyhow), which means that you can slice on VLAN tag in flowvisor * If a switch implements a VLAN-based hybrid mode (where the discriminant to which software path controls the forwarding is the VLAN tag) then generally the openflow datapath is *not* permitted to match on or modify VLAN tags, which means that you cannot slice on VLAN tag in flowvisor. For the switches we have picked the hybrid mode is port-based, we should have no problems slicing on VLAN tag. I tested it on the existing implementation (all ports had to be in OpenFlow mode). And just to close the loop on this, this was exactly our concern; whether the FV will be able to see/modify VLAN tags when the switch is running in hybrid mode. It sounds like the switches that are provisioned for the racks will support port-based hybrid mode and thus will allow you to create slivers in the FV based on VLAN tags. However, it is probably good to keep this in mind when you are talking with the vendor to ensure that this will be possible in the new firmware that will support the hybrid mode. Hmm, really? We haven't tried this, but I had thought that something like this would work: Say you've got a VM server, which provisions a VM to each of two experimenters, and gives each of them a virtual dataplane interface on a different VLAN, such that packets leaving the VM are tagged with that VLAN; but those virtual interfaces share a physical interface, which is connected to a hybrid-mode switch... ...Ah, ok, right: So if you want to slice by those VLAN IDs in FlowVisor, that port on the switch has to be an *access* port, in an OpenFlow- controlled VLAN, so it doesn't strip off the VLAN tags as they come in, and sends the tagged packets off to the FlowVisor. This is what you get for free with a pure OpenFlow switch; and you can do it on an NEC IP8800, but we think you *can't* do it on an HP. (And I don't think we've tested performance on the NEC, have we?) Anyway, probably not relevant to ExoGENI, with port-based hybrid mode; apologies for the tangent. :^) Just to continue this tangent for one email more... The HPs can do this, it's what they call aggregation mode. Rephrasing the question: If the general model is that the flowspace is sliced based on VLAN tags, will this still work if the OpenFlow switch runs in hybrid mode? Answer: Yes, this will work, because in port-based hybrid mode, each port will either be OpenFlow-controlled or not, and the OF-controlled ones will not be part of a VLAN, they'll just be part of a datapath. (Note that this might *not* work, however, in VLAN-based hybrid mode, because then you've got VLAN tags within a VLAN. This can probably be made to work, but it may require additional/different configuration. Shouldn't be an issue, but we should keep it in mind in the unlikely event that something changes from what we expect about how the switch does hybrid mode.) Our BNT/IBM switches will support port-based hybrid mode. We think it will work. 15. If the general model is that the flowspace is sliced based on VLAN tags, will this still work if the OpenFlow switch runs in hybrid mode? Answer: Yes, this will work, because in port-based hybrid mode, each port will either be OpenFlow-controlled or not, and the OF-controlled ones will not be part of a VLAN, they'll just be part of a datapath. (Note that this might *not* work, however, in VLAN-based hybrid mode, because then you've got VLAN tags within a VLAN. This can probably be made to work, but it may require additional/different configuration. Shouldn't be an issue, but we should keep it in mind in the unlikely event that something changes from what we expect about how the switch does hybrid mode.) Our BNT/IBM switches will support port-based hybrid mode. We think it will work. }}} * ~~ 16. In figure 1 (ExoGENI Rack overview), how do you expect the Dataplane link to Campus OpenFlow network to be set up and configured? Will it require manual setup? Are there any implications? Do we expect FOAM to be used to request and approve these OpenFlow connections? ~~ (hpd) (this is an adequate first cut answer and the use cases duplicate this - closing) {{{ Affirmative on FOAM. We assume it is a connection (10G or downconverted to 1G) to some OF-enabled campus switch. This link is optional, and it doesn't have to go to an OF network. It can be used to pipe campus VLANs into the switch, for interconnection with slices under OpenFlow control. I'll poke at the details of this more in my use case follow-up. }}} * ~~ 17. Is Internet2 Dynamic Network System (DYNES) supported? ~~ (hpd) {{{ DYNES uses ION (dynamic circuit service on Internet2). ION is supported because we support OSCARS - the software behind ION. }}} * 19. Stitching support: * 19a. ~~Confirm ION/Sherpa/OSCARS all comes through the RENCI SM?~~ (ahelsing) {{{ Yes }}} * 19b. ~~Can I connect to racks without the RENCI SM?~~ (ahelsing: but see g) {{{ If you have external stitching tool }}} * 19c. ~~RENCI SM and other racks coordinate via ORCA private interfaces?~~ (ahelsing) {{{ Yes }}} * 19d. ~~What resources are allocated to the ExoSM?~~ (jbs) {{{ See 14a plus all the intermediate network providers (LEARN, BEN, NLR, I2, ANI, NOX etc) 'Allocated' is a strange term to use. "visible" would be a better term. I think there's a difference between "visible", in the sense of "the ExoSM is aware of them", and "allocated" or "delegated", in the sense of "ExoSM manages them, and the local SM doesn't". Side note: Is there any chance we can reconcile terminology here, or are we going to be talking about "GENI AM, by which we mean ORCA SM", and "ORCA AM, which is not a GENI AM at all", and so on, for the rest of this project? :^\ We should have them document the process of changing that allocation, maybe even try it once, to be sure this isn't terribly disruptive or hard. I think we can assume that they'll document this, plan to try it out, and pester them if they don't document it. I think we're set here. }}} * ~~19e. Can experimenter go to racks separately for compute and then to the ExoSM just for the links? ~~ (hpd) {{{ This mode is not supported This is not supported BECAUSE it means that an AM could be asked to operate on the same slice by two different SMs/controllers, and ORCA does not support this. (It is a limitation we probably should not have.) }}} * ~~ 19f. Who is writing the Internet2 AM? ~~(hpd) (RENCI wrote the code already - closing) {{{ The code is already there, we need a physical connection (at StarLight would be best). }}} * 19g. ~~Does Single rack manifest expose external VLAN tag so I can stitch?~~ (ahelsing: they'll do this by GEC13) {{{ We'll work on it. We can add an external-facing port as part of an internal slice. 1) They do not currently support stitching resources you get from a single rack. (Q 19g) We should ask them to: a) expose the VLAN tag they have allocated in the manifest b) accept a VLAN tag allocated elsewhere in their request, and use it if it is available (else fail) I think this is something we want to get them to do. Definitely (a), even better (b). Ilia said they would work on it. I want to press for it to be done by GEC13 and in the initial racks. 19g) Please support stitching of rack resources by GENI tools for this year's racks - ideally by GEC13. Specifically: - More important: Support a request to a rack for compute resources and a VLAN out, where the resulting manifest specifies the allocated VLAN, such that this can be stitched to the next aggregate in the network. - Less important: Accept a VLAN tag in a request for a VLAN that the next aggregate in the network has allocated, and try to use it at that rack if it is available (failing the request if it is not available is expected at aggregates that do not support VLAN translation). 19g.1 will be available. Do you think you'll have this circa GEC13? April? September? 19g.1 We will try to get it by GEC13. It's not that much work. Sounds great 19g.2 less likely to be available soon (please show me an RSpec request for this). 19g.2: The closest I have is the sample requests to the PG aggregate in gcf/examples/stitching/libstitch/samples: http://trac.gpolab.bbn.com/gcf/browser/examples/stitching/libstitch/samples/utah-equest.xml?rev=795c50b86faf82f0fa8696d80005424e0b2089af Assume you were specifying a VLAN tag to the PG AM to stitch to ION: Within the stitching extension, at hop 3, you would specify both vlanRangeAvailability and suggestedVLANRange of the allocated VLAN tag. Presumably you could do something additionally in the element within the As I said, this is lower priority. 19g.2 we'll see Sounds great }}} * ~~20. Authorization: Orca APIs use what? Same as GENI? To what extent is this not exactly the same policies as the GENI APIs?~~(see TM comment below) {{{ Almost same as GENI. Some validity checks are disabled. I think Ilia interpreted this as a question about "Alt-G". Maybe it was. As for the internal APIs: In the ExoGENI configuration every ORCA AM will trust every registry-approved SM to validate a request before passing it to the AM. AMs only check that the SM is registry-endorsed. 6) Orca authorization (for their private APIs) apparently use similar checks to GENI. (Q20) We should ask exactly what they changed, to be sure it isn't worrisome. I don't have any real worries here though. This item can be closed. The ORCA APIs will effectively provide the same authorization as the GENI APIs by accepting identity certificates from known and trusted certificate authorities, namely the GPO and ProtoGENI CAs. While the GENI AM API requires credentials, there is no impediment to getting those credentials for any registered user today. It is nonsensical to require ORCA to build out additional infrastructure to require and honor those credentials through their own APIs. Hmm, so I have a question about this: If having a GENI user certificate is all you need in order to get an ExoGENI sliver via the ORCA API, does that mean that you wouldn't necessary have a GENI slice that contains your sliver? If that's not correct, and you do have a GENI slice: Where does that GENI slice come from? If that is correct, and you don't have a GENI slice: Is that ok, or will it cause other problems? (e.g. with things that assume that any allocated resource is part of a sliver, which is part of a slice, which is owned by a user -- if there's no slice, that chain may break down.) It is correct that in that case there is no slice as far as SA is concerned. This is where we get into the 'what is a sliver and what is a slice' argument. What ORCA creates are in fact slices, not slivers. We just call them slivers when GENI AM API is invoked. I would say that anyone who cares about this, should use GENI AM API on ORCA SMs and these problems go away. You get weaker stitching in that case. Tradeoffs, as usual. We agree on a base principle: ExoGENI will allocate resources only to registered GENI users who have been granted rights by a GENI-approved trust root to allocate resources on GENI. Is that enough of an answer to close down this item? In the short term, if the only way to get proof that a given user is authorized to allocate resources on GENI is for that user to obtain CreateSliver rights to some slice (any slice) on GENI, then that is what we will do. Once we have that proof, we can use it in various ways. Ideally there would be better ways to get that proof, and so the answer may change if and when better ways become available. As for "GENI slice that contains your sliver", well...I think it's a long discussion what that means, exactly. Please, let's not have that discussion now. TO-ADD MORE LN }}} * ~~21. What is the Actor registry used for? Is this an alternative non GENI way to authenticate inter-rack communications?~~ (hpd) {{{ Yes, it is a way to manually approve actors joining into ExoGENI. There is no GENI way to authenticate inter-rack communications. }}} * 22. ~~OpenFlow rack as onramp: when will this be supported?~~ (jbs) {{{ Jeff Chase wants it yesterday. Realistically some time this year. There is an MS student (Ke Xu ... Jessie) working in this area at Duke. We are not sure what she can do yet. We might be asking her to try some stuff on the GENI OF resources. I think on-ramp is easy, but OpenFlow has to work. That's the hard part. I've heard the phrase "onramp" a couple of times, but don't know exactly what it means. Is it just use case 4? If not, is there a definition somewhere? On-ramp is a stitch between private links owned by two different slices, by mutual consent of both slices. It is the moral equivalent of slices peering their virtual networks. }}} * ~~ 22a. Are there conflicts between FOAM and Orca mechanisms to create FlowVisor rules? ~~ (hpd) This question is being replace by the new GPO question 29. {{{ Hopefully FlowVisor will flag it. No. 9) Orca uses FlowVisor directly, opening up the possibility of conflicts between FOAM and Orca. The solution would be for Orca to use FOAM, once a pluggable API there exists, but it doesn't yet. ''We need to keep an eye on this.'' I'm pretty sure it won't, which is a point in favor of ORCA -> FOAM -> FV rather than ORCA -> FV + FOAM -> FV. I'm pretty sure FlowVisor won't -- in particular, there's nothing fundamentally wrong with creating flowspace rules in FV that describe overlapping flowspaces. But you usually don't get what you want, especially if "you" are multiple experimenters who aren't even aware of each other's slivers. To my mind, this is a point in favor of having ORCA talk to FOAM, rather than having both ORCA and FOAM talk directly to FlowVisor. That said, it's certainly possible for both ORCA and FOAM to find out the flowspace on the FlowVisor, and to use that to avoid allowing people to have overlapping flowspace. But they have to actually do that explicitly. }}} * ~~ 23. A few monitoring items are marked incomplete: Dates& plans? ~~ (hdempsey) * ~~23a pubsub event feed to GMOC (is GMOC ok with your plan?): (Chaos is tracking in GST [ticket:3369])~~ {{{ GEC13 Per Jon-Paul, GMOC originally proposed the pubsub model and will support it, details TBD. FYI: GMOC has agreed to send someone to tomorrow's monitoring call who can talk about the pubsub proposal they made for RENCI. So we should know more by then about what has been proposed for slice monitoring data submission (question 23A), what the timeframe is, etc, and ideally that will lead to us having a more intelligent opinion about whether we like it. 23a. Submitting per-slice/sliver relational data to GMOC: This has been discussed a little bit on the monitoring@geni.net list. It sounds like both ExoGENI and GMOC can provide support to get an approach involving XMPP's pubsub protocol working. GPO will work with GMOC to make sure that the per-slice/sliver data which is stored can be used across experiments with ExoGENI and non-ExoGENI pieces. So i think we are all set on this question. I think we're all on the same page here. I want to follow up again on these to make sure we are working from the same assumptions. If ExoGENI and GMOC can negotiate an approach using pubsub or direct nagios communication, and both sides can do the respective work to get that interface running, that's fine with GPO. Our desire here is just to have some useful data be submitted from each rack to GMOC (or polled from each rack by GMOC) reasonably often. However, the interface which already exists is the XML-over-HTTPS data submission API developed by GMOC. This API is active and usable for time-series (operational measurement) data right now, and GPO and GMOC are working to make sure it will be ready for relational data (e.g. slice metadata) within the next couple of months. ExoGENI racks will need to interface with these APIs as a minimum offering. Again, if ExoGENI and GMOC do the work between you to support something you both like better, that's very likely to be a fine substitute. Otherwise, the XML-based API is a "least common denominator" solution, and RENCI should submit data using it. My main concern is GMOC's work so far has been outside of the GENI I&M framework which may or may not be a good idea. I'm attempting to bring everything under one roof by using the XMPP bus that will also be used for GENI I&M to submit GMOC-relevant data and see if it flies. Ilia, This is not going to work. The GENI I&M framework is not fully implemented yet, and may not be for considerably longer than it takes for us to field the racks this year. The GMOC interface predates the I&M framework, and has already been in use in the mesoscale for well over a year. No GMOC-relevant operations data should be submitted via I&M, which is specifically for experimenter-relevant data. The GMOC has been part of the I&M project in order to make sure that it was possible to distribute operations data into the I&M framework if there was demand for that among experimenters. It may be that the GMOC evolves their interface to be more like the I&M interface for simplicity and ease of programming, especially for the aggregate providers. However, we can't count on that for Spiral 4. To clarify: we have no objection to the use of XMPP per se --- GMOC's generic interface for data submission uses XML sent via HTTPS, but, if data is collected or sent some other way, that's fine. As Heidi said, we just want to see operational data transmission from each rack to GMOC's operational monitoring database during this spiral. I believe using the existing data submission API is the most straightforward way to do that this year. However, if the anticipated benefits of XMPP outweigh the extra work, and the work can be done this spiral, that's fine. There are enough commonalities between what GMOC wants and what the experimenters want that their work and I&M will converge. By March we should have an XMPP bus with GENI authn/authz available (as part of our IMF project with Harry) via which we should be able to make data available to GMOC and at the same time make it available to anyone else with proper GENI credentials. I can think of two concerns about using this approach in the short term: 1. "Proper GENI credentials" sounds to me like an experimenter being able to get access to his own experiment's data. That's not the same thing as operational monitoring data [1], which isn't going to be per-sliver, but rather might contain some amount of metadata about all slivers, and information which is not per se about slivers. You're assuming experimenters only need access to data from their own slices. I disagree. There are circumstances where an experiment in GENI means looking at other experiments. Do you have an implementation for allowing a non-experimenter operations group like GMOC read-only access to broader monitoring data using a GENI credential? Working on it. 2. On the GMOC end, there needs to be code to use this XMPP interface, acquire data, and put it into some operational database at http://gmoc-db.grnoc.iu.edu/. If this is going to be done using a new interface rather than an existing one, someone will need to write that code. We have example code that can get data from Pub/Sub. }}} * ~~23b thing with U Alaska from individual VMs~~: VMI would be nice to have, but is not critical for rack design, so i think we are satisfied with what we know here. (Chaos) {{{ Need to ask them Question 23B says "thing with U Alaska from individual VMs". What does this question mean, and who asked it? I am pretty sure it was not me, though a lot can happen in two days I believe that RENCI wants to use the U. Alaska "virtual machine introspection" software to provide monitoring of what's happening inside individual VMs (based on my quick read of the design doc). Ah, thanks Chip. That's helpful, and, indeed, i see this in section 4.3 of their proposal: <> To me, this sounds useful if they can do it, and not critical if they can't. Who put it on our list of design review questions, and what is your concern about it? Aaron put it on our list and Ilia had no more information about it at this point--I think Aaron was just pointing out that the information about it was incomplete in the document, which is true. I've talked to Brian from U of A a few times about the VM introspection software and think he would be a good collaborative addition to the team if he and Ilia work this out. As you say, it won't be critical if they don't. I put it on the list, and I just wanted a date. They said, in not so many words, 'we want to use this'. So I wondered when. The VMI project has a milestone to demonstrate VMI in a Eucalyptus cluster environment at the March GEC. They are working with Renci on getting this technology into Eucalyptus clusters that federate using the Orca framework (to be demonstrated in July). }}} * 23c ~~Nagios interface to GMOC (is GMOC ok with your plan?)~~: (Chaos, GST [ticket:3369]) {{{ I don't know if GMOC is fully OK with it, but we prefer Nagios to the homegrown solution. We don't yet know what will be doable for GMOC. Mitch McCracken, the GMOC staffer who maintains the time-series data submission API, will be on our monitoring call tomorrow afternoon. If you or Jonathan Mills or someone else from RENCI would like to be on that call and talk in more detail about what you'd like to do, what GMOC would need to do to support it, and why you prefer it, that would be a good next step. Mitch is new to maintaining the API and coming up to speed, so we won't make any decisions on the spot, but it would be a good forum for sharing information and understanding a bit better what you'd like to do. Let me know if you need more information about the call --- i know at least Jonathan has attended it before. With GMOC's Mitch's permission, i asked RENCI to send someone to the Friday call to talk about operational time-series/event? data submission (question 23C). Mitch's short answer was that he probably wants something more centralized than whatever RENCI is proposing, but he's interested in understanding more about what RENCI actually wants, and so am i, so hopefully they will show up and talk about it, and, again, we can use that to figure out whether we like their answers. Jonathan Mills (who was present at the review) also attends the monitoring calls. He is in charge of modifying Nagios to our needs. Yes, and I am planning to attend the next ExoGENI monitoring call. At the monitoring call, Jonathan and Mitch agreed that it would be easy for RENCI to submit data from Nagios via the GMOC time-series data submission API. They have started working on this on monitoring@geni.net. I am satisfied that there is general agreement about what to do. 23c. Nagios interface to GMOC: This has also been discussed on monitoring@geni.net, and the consensus here is that ExoGENI and GMOC will work together to write a stub which sits on each rack's Nagios aggregator, and submits information from Nagios's status.dat file to GMOC via some data exchange format (probably the GMOC data submission API, but if ExoGENI and GMOC prefer something different, that is fine). This sounds like it should not be a lot of work beyond what has already been done to get Nagios working on the racks, and it's already being worked on. So that's great too. I would suggest using the same XMPP pubsub mechanism as is used for manifests. They will also have access to the browser interface in Nagios. As long as each rack is submitting its own operational time-series data directly to GMOC, i think whatever mechanism is easiest for Jonathan and Mitch is fine. Since operational data might be used to help during an outage, we do want to make sure as much data submission as possible continues to work during an outage. I have one thing to add, as an alternative way of getting the information out of Nagios itself, which is to directly query the LIvestatus broker. Livestatus is a Nagios broker module which can be queried in various ways. Broker modules are loaded into Nagios when the daemon launches, and thus they have direct access to its internal memory tables. Queries in this manner are the fastest because no intermediate action occurs (for instance, writing to status.dat is not necessary; neither is writing to a SQL db with NDO). Because it is reading object status directly from Nagios's memory, the results are always 100% up to date......no time delay. The broker module is already installed on any Nagios installation that I set up, because it is a required component of Check_MK. It can be queried from either a TCP or Unix socket. Details can be found here: http://mathias-kettner.de/checkmk_livestatus.html While this method of "getting at the data" has lots of upsides, it could require a rethinking of how the pubsub model would fit. It necessarily shifts us from parsing/translating a file on disk (status.dat) to having to actively query something. Querying/polling won't be a problem. This sounds like an interesting approach. I want to follow up again on these to make sure we are working from the same assumptions. If ExoGENI and GMOC can negotiate an approach using pubsub or direct nagios communication, and both sides can do the respective work to get that interface running, that's fine with GPO. Our desire here is just to have some useful data be submitted from each rack to GMOC (or polled from each rack by GMOC) reasonably often. However, the interface which already exists is the XML-over-HTTPS data submission API developed by GMOC. This API is active and usable for time-series (operational measurement) data right now, and GPO and GMOC are working to make sure it will be ready for relational data (e.g. slice metadata) within the next couple of months. ExoGENI racks will need to interface with these APIs as a minimum offering. Again, if ExoGENI and GMOC do the work between you to support something you both like better, that's very likely to be a fine substitute. Otherwise, the XML-based API is a "least common denominator" solution, and RENCI should submit data using it. My main concern is GMOC's work so far has been outside of the GENI I&M framework which may or may not be a good idea. I'm attempting to bring everything under one roof by using the XMPP bus that will also be used for GENI I&M to submit GMOC-relevant data and see if it flies. Ilia, This is not going to work. The GENI I&M framework is not fully implemented yet, and may not be for considerably longer than it takes for us to field the racks this year. The GMOC interface predates the I&M framework, and has already been in use in the mesoscale for well over a year. No GMOC-relevant operations data should be submitted via I&M, which is specifically for experimenter-relevant data. The GMOC has been part of the I&M project in order to make sure that it was possible to distribute operations data into the I&M framework if there was demand for that among experimenters. It may be that the GMOC evolves their interface to be more like the I&M interface for simplicity and ease of programming, especially for the aggregate providers. However, we can't count on that for Spiral 4. To clarify: we have no objection to the use of XMPP per se --- GMOC's generic interface for data submission uses XML sent via HTTPS, but, if data is collected or sent some other way, that's fine. As Heidi said, we just want to see operational data transmission from each rack to GMOC's operational monitoring database during this spiral. I believe using the existing data submission API is the most straightforward way to do that this year. However, if the anticipated benefits of XMPP outweigh the extra work, and the work can be done this spiral, that's fine. There are enough commonalities between what GMOC wants and what the experimenters want that their work and I&M will converge. By March we should have an XMPP bus with GENI authn/authz available (as part of our IMF project with Harry) via which we should be able to make data available to GMOC and at the same time make it available to anyone else with proper GENI credentials. I can think of two concerns about using this approach in the short term: 1. "Proper GENI credentials" sounds to me like an experimenter being able to get access to his own experiment's data. That's not the same thing as operational monitoring data [1], which isn't going to be per-sliver, but rather might contain some amount of metadata about all slivers, and information which is not per se about slivers. You're assuming experimenters only need access to data from their own slices. I disagree. There are circumstances where an experiment in GENI means looking at other experiments. Do you have an implementation for allowing a non-experimenter operations group like GMOC read-only access to broader monitoring data using a GENI credential? Working on it. 2. On the GMOC end, there needs to be code to use this XMPP interface, acquire data, and put it into some operational database at http://gmoc-db.grnoc.iu.edu/. If this is going to be done using a new interface rather than an existing one, someone will need to write that code. We have example code that can get data from Pub/Sub. }}} * 24. Rspec support questions: * ~~24a. When will you support GENI v3 RSpecs - part of the GEC13 completion? ~~(hpd) {{{ That's the goal. The differences are not that significant. }}} * 24b. When will you support what RSpec conversions? Can you send sample manifests and advertisements? When can we test? {{{ Will send separately. Testing can be done now (Luisa has). 3) Ilia offered sample manifests. We should ask for those - to start checking that they include what we expect. (Q 24b) }}} * 25. Have you tested performance of a single management node with a full load of running software (FV and OpenStack/Euca head and GENI services and monitoring etc.? Or Is FV on a separate VM? {{{ Everything but the OpenFlow components. It's not much of a load. Supporting FV still an open question in terms of performance needed. To Nick: Do you have enough info yet to know whether FV on a VM in ExoGENI rack will be OK at this point? Have you given Ilia any more info about what FV needs for acceptable performance in GENI? I have no further information from Ilia, and he has not requested any information from me in reference to this. My understanding is that he is aware that they have not characterized the FlowVisor workload and still need to do so. (I also have other concerns about software interaction and compatibility as expressed on the mailing list). }}} * 26. On layer 2 dataplane connectivity testing: Do you envision a long running slice where we can allocate VLANs to test as needed? What happens if the AM is unreachable/down? {{{ If AM is unreachable, you can't provision a VLAN. When the VLAN is up it should stay up regardless of AM status. We should bang on this a little more, and understand whether our monitoring stuff will in fact be in a slice, or a non-GENI thing. (I don't feel strongly about it, and don't recall now if we concluded that we preferred one or the other.) 26. Dataplane reachability testing: We think it would be a good idea to have two types of tests to go with the two types of VLANs: * Where the ExoGENI AM is used to provision a VLAN, we'd like to see a test which stands up a VLAN, verifies that it can be used, and reports to monitoring on whether that entire system (which includes the AM, of course) is healthy. I believe you discussed doing something like this already: does what i just said sound similar to what you have in mind? I think so. A simple reachability test would not be difficult to do, but currently is not a high priority. * Where an ExoGENI rack is going to be connected to a static (long-standing) VLAN outside of the rack, e.g. to the shared mesoscale VLANs or to a longstanding L2 connection to non-rack resources at a particular site, we'd like to see a static test interface on each VLAN which could be used to verify connectivity. It would be ideal if the test interface were non-OpenFlow-controlled on the rack, so that it could be used entirely to test "is this link up?". Does this seem reasonable? yes, but not with the current version of the switch which is OpenFlow all-or-nothing. When we have the hybrid mode towards the end of the year this should be possible. Good point --- if everything on the dataplane switch is OpenFlow-controlled, then a non-OF-controlled testpoint is not possible. However, i think it would still be possible to place a static test interface on a static VLAN which reports to e.g. FOAM. The external VLAN connection can be established and tested regardless of the hybrid-ness of the switch. If the goal is merely to establish whether a link an external interface is up or down, this can be done within or without openflow, regardless of the state of the switch (the ability of a switch to determine port up/down status, electrical, protocol, or administrative, is independent of the openflow implementation). If you are truly determined to require a non-openflow port to test (electrical?) connectivity, you can do that with the current BNT firmware as well (ports can be configured to be non-openflow, they just have no real features beyond that of a standard L2 learning switch, but those would suffice for this purpose), but there's no particular reason why this port can't be openflow controlled. I think Chaos wanted to have an interface with an assigned IP address internal to the switch that can be used for L3 reachability testing. L2 reachability testing, really. Sorry if i've been unclear: the goal here is to have some tools to detect problems with shared VLANs. If an ExoGENI rack participates in a core VLAN, but can't reach other things on that VLAN, then an experimenter might want to know something is wrong before trying to provision something attached to that VLAN on that rack. In practice, you don't want to provision too many different test resources, and there is a tradeoff between having a test which is most similar to an experiment ("if this test works, it is very likely that this resource is healthy enough for an experimenter to use") vs. having a test which gives you more information about what is wrong ("this test does not depend on OpenFlow, so, if this test fails, it tells us there is a connectivity problem caused by something other than OF on this rack"). I think it's fine with us to have a test which runs in a sliver, but if we're testing a static VLAN to which an experimenter would connect by e.g. reserving resources using FOAM, the resource test should use a FOAM sliver. An advantage to setting up a test interface which doesn't depend on a local sliver, is that people can use that interface for reachability testing without having to maintain that local sliver. But, in fact, all of our core testing uses slivers somewhere --- there's no good way around that. It just has to be feasible for that sliver to be used for frequent/automated testing. The easiest way to test l2 reachability is to configure l3 addresses on two endpoints of a Vlan. With traditional switches you can configure an interface on a Vlan and assign it IP address (internal to the switch). I thought that is what you meant. We do it here periodically (manually) between our switches. Sure, I would use FlowVisor or FOAM for the first test (no sliver required to know what ports are up, or whether the datapath seems to be available at all), and SNMP for the second (also no sliver required). If you want this information centrally (via GMOC or something) we should probably offer a read-only FOAM monitoring API, since getting the information via the existing admin API seems like a bad idea, but that's a trivial problem. Sorry i never got back to this. This kind of approach is fine with us for testing of static VLANs, and is indeed what we had in mind. }}} ~~ * 27. We want to share the final ExoGENI rack parts list and rack diagram (when you finish it) on the GENI web site. OK with you? ~~ (Luisa Nevers, see notes below) {{{ It's available here: https://docs.google.com/document/d/1hzleT6TNmiDb0YkkgqjXxPFJ37P4O6qApLmXgKJHBZQ/edit Collected remaining information for the parts list. See: http://groups.geni.net/syseng/wiki/GENI-Infrastructure-Portal/GENIRacks#ExoGENISpecifications Checked on wiring rack diagram and found from Brad Viviano that the diagram will be available after the GPO rack is assembled. Email exchange on Feb 13 to gpo-infra included a wiring diagram which is attached as file named Rack-diagram-wiring.xls }}} * 28. How does a site admin control resources which have been allocated to ExoSM and are controlled centrally by RENCI?" {{{ ORCA configuration files. There is an actor configuration file (XML) and a resource description file (scary NDL-OWL). }}} * ~~29. In the design review, ORCA indicated that they started a NOX instance to communicate with FlowVisor to communicate ExoGENI OpenFlow requests. Nick Bastin said he would like to replace this with an API to FOAM. Nick and Ilia promised to follow up. This question should also address conflicting FlowVisor requests capture in Q 22a.~~ (jbs) {{{ Then there was 22a: 22a. Are there conflicts between FOAM and Orca mechanisms to create FlowVisor rules? I had said "To my mind, this is a point in favor of having ORCA talk to FOAM, rather than having both ORCA and FOAM talk directly to FlowVisor." Nick also liked this idea; have you guys talked about it any further? The implementation we have today talks to FlowVisor. We prefer this method because it bypasses the need to create OF RSpec. Also, based on discussions with Nick, he is unwilling currently to modify FOAM RSpec to what we need. Hmm, I think there may be some confusion about this: I don't think that Nick was proposing that ORCA should write and submit rspecs to FOAM, but rather than ORCA would talk directly to a FOAM API. He plans to write a plugin API, which others (or he) can then use to write plugins to talk to FOAM via arbitrary custom APIs; most immediately, he says he'd be happy to write a custom FOAM API for ORCA. But rspecs don't enter into it at all in any case. The custom ORCA - FOAM API could look pretty much however you want. For that matter, it could be identical to the FlowVisor XMLRPC API -- but going through FOAM means that you can be aware of other FOAM slivers, and that FOAM is aware of ORCA-created slivers, for free. 29. Will ORCA talk directly to FlowVisor, or to FlowVisor via FOAM? require some FOAM development work, but Nick is eager to work on this. ISSUE: Nick and the ORCA folks should talk about timeframes, to make sure that Nick can do what the ORCA side needs, in time for them to use it, but he doesn't think it would be a problem to get it done very quickly. The implementation we have today talks to FlowVisor. We prefer this method because it bypasses the need to create OF RSpec. Also, based on discussions with Nick, he is unwilling currently to modify FOAM RSpec to what we need. That's not to say I'm blaming Nick - he has his reasons. FlowVisor at this time presents what appears to me the most stable and easy to program interface. As the code and RSpec evolve in the future we can revisit this question. Hmm, I think there may be some confusion about this: I don't think that Nick was proposing that ORCA should write and submit rspecs to FOAM, but rather than ORCA would talk directly to a FOAM API. He plans to write a plugin API, which others (or he) can then use to write plugins to talk to FOAM via arbitrary custom APIs; most immediately, he says he'd be happy to write a custom FOAM API for ORCA. But rspecs don't enter into it at all in any case. I may be mistaken about the current state of FOAM. I thought it supported GENI AM API, and that requires RSpec. Is there another interface and what is it? The point here is there *can* be another interface (there are already 4 API interfaces, soon to be 5 when we add GENI AM API v2), so there's not really any problem adding one for exogeni (or, alternatively, just making one that looks like the flowvisor XMLRPC interface). I don't have a problem with that when it becomes available. The custom ORCA - FOAM API could look pretty much however you want. For that matter, it could be identical to the FlowVisor XMLRPC API -- but going through FOAM means that you can be aware of other FOAM slivers, and that FOAM is aware of ORCA-created slivers, for free. I expect a fairly hard partitioning between label spaces that FOAM operates in and ORCA operates in, so this may not be a serious issue, but it may be worth discussing. The intention would be for FOAM to be aware of the sliver URNs if at all possible. Obviously this wouldn't be possible out of the box if we merely emulated the FV XML-RPC API, but if we added an extra parameter or two to CreateSlice it would be relatively easy information to provide. I don't think I followed that. Which sliver URNs? We (GPO) think this would be worth spending a ten minute phone call to talk about, and we'd like to help facilitate that (whether you end up going this route or not -- we just want to promote communication); any chance you'd be available later this afternoon? Or maybe Monday? That's fine. Next week is better. Please use Doodle. The URNs for the slivers which have associated FlowVisor slices. (Such that if one asked FOAM, it would know about all the slivers which had resources allocated…failing that, at least user URNs). I've craeted http://www.doodle.com/r58q8esy4vawn7q8 as a Doodle poll suggesting times this afternoon and tomorrow; I think Ilia and Nick are essential, and anyone else who's interested could listen in. (And anyone else who Ilia thinks is essential from the ExoGENI team -- Ilia, let me know if you have anyone else in mind.) "Note the message below is in response to a much earlier comment form Ilia which stated: I don't have a problem with that when it becomes available. When which becomes available? We're looking for some input here - is the path of least resistance to emulate the FV XML-RPC API, or should we develop something more specialized for exogeni? To Nick: I've seen several emails on the exigent-design list, and it sounds like you, Josh and Iila are planning a teleconf this week. Do you think you'll be able to put enough effort into the discussions to work out a rough agreement for a solution this week? I believe this is conflating two issues: 1) They have a separate software stack (their AM, not NOX) which communicates with FlowVisor outside the visibility of FOAM to allocate virtualized resources 2) They have suggested using NOX to provide baseline control of their openflow resources for non-openflow experimenters. I think many people (myself included) believe this is a bad idea, and we should explore precisely what they are trying to accomplish and how to best execute that. We are planning on discussing issue (1) this week, but there has been no further mention of issue (2). Mm, I'd forgotten about that part. Perhaps because I feel like "they're planning to provide a service that we think is a bad idea" seems like less of a problem (we can ask them to turn off that service) than "they're not planning to do something that we think they'll need to do". Nick, should we be more concerned about this than we are? I believe so. My general understanding is that because they can't run this switch in an "acceptable" hybrid mode (for varying values of whatever that is) they have identified a need to still be able to provide their non-openflow network service to experimenters, so they're going to run a controller to manage these slices and they've chosen NOX. This is at least my understanding. Ah, that sounds plausible. (I had lost the context here, and was thinking that they were talking about an even more optional service, which people who wanted to use OpenFlow could use if they didn't want to run their own controller; but in fact I think you're right, what they're talking about is something so that people who don't care at all about OpenFlow don't have to touch it at all.) However, I believe they should run in this mode all the time - using hybrid datapaths only creates problems and limits functionality for the openflow portion of the network. Could be; that sounds like a conversation we can have with them when the switch firmware allows hybrid mode. If the experiment with pure OF mode has gone well enough until then, it might be an easy sell -- and so that's some incentive to see the pure-OF way work well. That being said, I definitely don't think they should be running NOX to provide transport for non-openflow users - not particularly because this has anything to do with NOX so much as the fact that often when people say "run NOX" they mean run one of the NOX sample applications, which are not production applications (and certainly don't provide the functionality we would desire). GIven Ilia's apparent lack of interest in writing any code to work with the openflow side of their rack, I highly doubt that they're intending to write a custom app for NOX to facilitate their use case. That all sounds likely to me. This isn't something that we have a lot of experience with, because we've mostly been focused on supporting experimenters who (a) want to do nifty things with OpenFlow, and are thus writing their own controllers; or (b) just want a learning-switch controller, but are doing things on such a small scale that NOX 'switch' or 'pyswitch' is good enough for their needs. You mentioned "production applications"; do you have any insight into what *would* fit that bill, but not cost a lot of money? (Or time spent convincing a vendor (like BigSwitch or NEC) to donate a production controller, or whatever.) Is there in fact a better off-the-shelf solution than NOX 'switch'? Note: Comment below after meeting with Ilia, Josh and Nick. We did! And the main thing that we concluded is that Ilia doesn't think there's time before GEC 13 to change how ORCA talks to FlowVisor, so we're not going to try to throw together an ExoGENI-specific API to FOAM immediately. Instead: * Nick will continue to work towards the planned API plug-in layer for FOAM, which he thinks will be done by GEC 13. * The first two racks (RENCI and BBN) will have ORCA talking directly to FlowVisor, with the understanding that this may have some issues. * We'll aim to shift to a model where ORCA will talk directly to FOAM, after GEC 13. (Or, talk more between now and then about whether this is a good idea -- Jeff raised some questions about this, and we generally agreed that what we all really want is for FlowVisor to have only one administrative master, but that it isn't fundamentally important whether that master is FOAM or ORCA... So if ORCA can do everything we need it to do -- including managing flowspace for non-GENI resources that aren't part of the ExoGENI rack -- then perhaps it makes sense to not run FOAM in an ExoGENI rack at all. But we think this that isn't a short-term solution, because it would require a way for those non-GENI resources to interact with ORCA to tell it what flowspace they wanted, and I don't think we have even an idea about what that would be.) So, with that, I think question 29 from the original list is answered: In the first two ExoGENI racks (RENCI and BBN), ORCA and FOAM will both talk directly to FlowVisor; we'll continue to discuss between now and GEC 13 about how to narrow this down to having only one of them do that; and we'll aim to implement a single-master solution soon after GEC 13 (and definitely before any additional racks ship). Sound right? Anything else I missed or otherwise got wrong? }}} * 30. External resources (like the mesoscale) have to be manually configured to be available. We should make a list of the resources that could connect and we want connected, and get them to build those in advance. Like the mesoscale. {{{ Configuration for meso-scale is worth persuing need answers to Josh's use cases first. There will be a hard partitioning between resources controlled by the ExoGENI SM and other resources. Changing the partitioning is pretty straightforward and not very disruptive. We are initially planning to hand 10 VLANs that we have provisioned to our FrameNet endpoint and 1 VLAN that is provisioned to our ION endpoint to the ExoGENI team. Initially, they will control these VLANs with the ExoGENI SM. WE can give them more VLANs later, assuming it is easy. We only plan on provisioning a single OpenFlow VLAN to the ExoGENI rack in our lab (1750) to start. Currently we are thinking about provisioning extra special use OpenFlow VLANs from each mesoscale campus. We will have to let the ExoGENI team know how to reach these VLANs once we actually provision them. This should be as simple as provisioning VLANs down to the rack and letting the rack know the VLAN IDs. We are also thinking about having mesoscale campuses have a set of non-OF controlled VLANs. I think we should just be able to tell the ExoGENI team the VLAN ID and the endpoint (FrameNet, ION, etc) that the mesoscale campus uses, an then the ExoGENI SM should be able to connect to that VLAN. }}} = Nick Bastin's Questions = Network: * ~~ B-1. Why not use FOAM everywhere?~~ (hpd) * B-2. Why not run pure OpenFlow and slice on VLAN in FlowVisor w/translation at the rack edge? * ~~ B-3. How is IP space managed within the rack environment - can experimenters request more / specific IP space? ~~ (hpd) (Duplicate of question 10b) * B-4. The OpenFlow control channel looks to be extremely throughput constrained. * B-5(1). Does the switch not support the ENQUEUE action at all, or does it just not support all the openflow packet-queue structures? {{{ B-5. Is there an IPMI connection from the head node to the management switch? If so I think that makes for 45 management switch ports used. o Worker node. IBM x3650 with Virtual Media Key. 1 port for vKVM/IPMI/etc, 2 ports for 1GbE traffic. Total of 30 (assuming 10 worker nodes) o Head node. IBM x3650 with Virtual Media Key. 1 port for vKVM/IPMI/etc, 8 port for 1GbE traffic. Total of 9. o iSCSI enclosure. Redundant controllers, each with 2 ports. Total of 4. o Juniper VPN appliance. 1 WAN port, 1 LAN port. o PDU. 1 port (For 208V based PDU's) How many ports in total get used on the management switch will depend on the connectivity from each campus. If for example we can ONLY get 1 1GbE connection from campus the total will be 47 (46 from above, plus campus into the management switch). That would be the worst case situation and leaves us 1 open 1GbE port on the switch. Ok, I was just working off of the table on page 4 of the design document that has 44 ports used on the management switch. It's a little hard to reconcile that table with figures 1 and 2, as well as the text. Figure 1 has a red line connecting the management switch "to campus layer 3 network", and figure 2 has a line connecting the management switch to the Juniper SSG5 (which is not in figure 1), and no other connection to an outside L3 resource. The text in 2.1 states "The connections to the commodity Internet via the campus network is expected to serve management access by staff as well as experimenters" - I read this to mean that all control-plane access (management and experimenter) would be coming in over the SSG5. So, I guess the new question is, is there a direct campus L3 connection to the management switch, as well as a connection to the SSG5? Also, do you really mean that the SSG5 is connected twice to the management switch? (I understand how that would work, I'm just trying to figure out if that's what you mean) }}} Rack Configuration: * B-5(2). Is there an IPMI connection from the head node to the management switch? If so I think that makes for 45 management switch ports used. * B-6 I am concerned that the head node is under provisioned for all the services it needs to run - 12GB of ram seems low. {{{ We don't have empirical evidence that 12GB of memory won't be enough. We felt it was a safe starting value, but ensured there are free DIMM slots if we need to expand to 24GB or 36GB. Although the cost of 2GB DIMM's vs 4GB's isn't significant, when multiplied out to 12-14 sites it was enough that we decided to start with 12GB. If we decide later to move to 24GB, we'll expand future racks so they come from IBM that way. 15) They haven't tested the head node, when the FlowVisor and FOAM are getting actively used, to check for performance problems. It's unclear if there is an issue here or not, but the only real solution appears to be to double the RAM - which they can do later if necessary. They can do it later if necessary, but why not do it sooner? I'm curious what the actual numbers involved here are: my personal experience has been that RAM is (a) cheap, and (b) always the thing you're short of. I know they said they could send a tech on-site, but, for a new installation, 12GB of RAM should be something like $150. There's one head node per rack, and how many racks the first year? Again, i don't know the actual numbers or tradeoffs, but i think it's very likely that this is cheap and may solve a real problem. So IMHO they should just do it while it's early enough to never have to think about it again. Fair enough. I would also ask that we revisit the plan for all the software on the management node to be installed in the same OS instance - I really think this should be a virtualized environment (particularly because both FOAM and FlowVisor do not currently have RPM package builds). This will put significant constraints on the software to use the same JVM versions, etc., or create an integration challenge to create separate environments for the software to run in }}} * ~~B-7. How is the head node configured - do the services run in their own VMs, or do they need to co-exist on the same OS instance?~~ (jbs) {{{ The VM option remains open, however currently we are not seeing any software conflicts that would require that. VMs will take some performance overhead and they may make it more difficult to communicate between some elements of the software stack. We have already built most of the components on our OS of choice - CentOS 6.2 and we're not seeing any conflicts. Despite the fact the CentOS/RedHat is not always officially supported, there are usually instructions for advanced users on how to build the software that seem to work. Aha, ok. It might mean that you have to do more ongoing work to track updates to those components, if new versions don't build as cleanly, but, as you say, we can revisit if it turns out to be a problem. I think it's probably fine to call this closed, but since it was originally on Nick's list, I wanted to give him (or anyone else with contrary opinions) a chance to chime in before I crossed it off. B-7. How is the head node configured - do the services run in their own VMs, or do they need to co-exist on the same OS instance? ISSUE: We (GPO) think it would be better if the head node ran VMs, so that the various software that needs to run there can run in a more isolated environment, on its preferred OS; but it sounds like that's not how RENCI is planning to do it at this point. If you prefer the all-in-one-OS approach, can you talk more (maybe fork off a separate thread) about why? The VM option remains open, however currently we are not seeing any software conflicts that would require that. VMs will take some performance overhead and they may make it more difficult to communicate between some elements of the software stack. Aha, ok. It might mean that you have to do more ongoing work to track updates to those components, if new versions don't build as cleanly, but, as you say, we can revisit if it turns out to be a problem. I think it's probably fine to call this closed, but since it was originally on Nick's list, I wanted to give him (or anyone else with contrary opinions) a chance to chime in before I crossed it off. We have already built most of the components on our OS of choice - CentOS 6.2 and we're not seeing any conflicts. Despite the fact the CentOS/RedHat is not always officially supported, there are usually instructions for advanced users on how to build the software that seem to work. OK At the very least we're likely to run into the need to move common services (like SNMP) to custom ports, but I'm also concerned about finding ourselves in a situation where we have conflicts in required JVMs or similar (FlowVisor already trips over some known issues in commonly distributed JVMs) or Python versions. We use JREs downloaded from Oracle site, not shipped with the distro. CentOS 6.2 seems to be reasonable up to date with python (2.6.6 is the stock version ). Which components have SNMP interfaces on them? My summary is that Ilia is optimistic that there won't be any issues, Nick is pessimistic that there will be, Ilia has said that they'll revisit if they are, and that this is fine with us for now. Both FlowVisor and FOAM will have SNMP interfaces in the medium term. The suggested use case for most installations would be that they would disable the FV interface and just use the FOAM one if they were running both, but that will be more difficult if FOAM doesn't know detailed information about everything in FlowVisor. Also, I'm not saying there are necessarily any problems right *now* with JVM/Python versions etc, but this will be an ongoing software qualification concern when individual components become available with new versions. }}} * ~~ B-8. PDUs are also useful for remote management if a node gets completely bricked (such that IPMI is useless) - I would think that the marginal cost would be more than worth it.~~ (hpd) (we're helping RENCI to work on in the first couple of rack integration efforts. Ticket #3354) {{{ IBM doesn't offer switched PDU's with 120V on their standard Bill of Material. The 208V units on their standard BoM are switched and monitored. For the first 2 racks (RENCI and BBN) we are sticking with IBM's standard BoM because to use non-standard BoM parts means it can't be assembled in the factory and has to goto the "Integration Center" which increases the lead time. So for the BBN rack, we won't have switching. We hope for other sites that can only support 120V power we will be able to identify with IBM a reasonable switched PDU they can install. I've forgotten, can we take a 208V unit? If so, then if that would get us a switched PDU, then it might be worth doing. We (GPO) think it would be better if the head node ran VMs, so that the various software that needs to run there can run in a more isolated environment, on its preferred OS; but it sounds like that's not how RENCI is planning to do it at this point. If you prefer the all-in-one-OS approach, can you talk more (maybe fork off a separate thread) about why? The VM option remains open, however currently we are not seeing any software conflicts that would require that. VMs will take some performance overhead and they may make it more difficult to communicate between some elements of the software stack. We have already built most of the components on our OS of choice - CentOS 6.2 and we're not seeing any conflicts. Despite the fact the CentOS/RedHat is not always officially supported, there are usually instructions for advanced users on how to build the software that seem to work. }}} Resources: * ~~ B-9. Why not allow arbitrary bare-metal images? Is this any more dangerous than arbitrary VM images? ~~ (hpd) (Duplicate of question S.27) {{{ As discussed briefing in the concall. The reason to not allow custom bare metal images is two fold. 1) The decrease in security because users will have direct access to the bare metal network interface which connects to the management switch. 2) The complexity of creating a bare metal image means the user would have to have a system identical to the one inside the ExoGeni racks so they could load all the hardware drivers, etc. I don't think we've ruled out the possibility 100% and if a user provides a compelling reason for why they need it, then we can consider it. But I think we have enough on our plates with the initial deployment without adding this level of complexity on day one. }}} * ~~B-10. Where is the storage for the running instances - on the worker nodes?~~ (hpd) {{{ We will have the ability to provide storage either on the running worker or via NFS from the head. Long term plans include being able to provision raw iSCSI luns from the iSCSI unit with a slice and make those available as well. }}} * B-11. What are the average IOPS available for each VM on a fully loaded (max running VMs) worker node? {{{ Each worker has 2 hard drives. 1 146GB 10K RPM SAS and 1 600GB 10K RPM SAS. In the case of a VM worker, the OS (CentOS 6) will be installed on the 146GB drive and all the VM's storage will be installed on the 600GB drive. In a bare metal install the user would have access to both and could use them as they saw fit. The "standard" rating for a single 10K RPM SAS spindle is 180 IOPS. There are 6 drive slots on each worker, we can add more spindles, but for each spindle we add, we remove 1 worker because of the cost (i.e. 9 2.5" 600GB SAS spindles = about $4000, or the cost of a worker). In all the infrastructure designs it was a delicate balancing act between available funds and performance. Our goal being to build something that was usable today but extensible for the future. The first 2 racks are our on the job training. We fully expect that after these first 2 racks we will tweak the hardware configurations with IBM and hopefully have a smooth flow from IBM's integration center to the other sites for the remaining 10-12. This seems optimistic - the latency of a 10k rpm spindle with a 2.5" platter is 3ms, and the IBM 5433 (the 600GB drive in question) has a 4.2ms average read seek time (writes are slower, but we'll be optimistic here for the purposes of this discussion), which makes for ~139 IOPS (1 / 0.0072). Of course, neither of these numbers are particularly useful if we don't have an idea of the workload - more on this below. I've been doing some math on the back of some napkins and I think that might be a net positive tradeoff for total VM capacity based on a variety of workload calculations (although factoring bare metal into this makes that calculus more complicated). I still have some work to do on this, so I'll followup later with my thoughts. }}} = Adam Slagell's Questions = Software/Firmware Update * ~~S.1 What part of the software stack does exoGENI take responsibility for maintaining updates? IS there anything they don't?~~ (chaos, based on adam's comment) {{{ Sounds like VM/BM images and all the software that comes with the racks. I didn't see any gaps or buyer bewares. We will take care of software updates. The only buyer-beware concerns the operation of FOAM - we don't want to be in the business of approving user slices in FOAM and think this needs to be done by GPO or GPO delegate. }}} * S.2 Is there an automated updated system? If so, how is integrity insured? {{{ Sounds like no. Not at this time. The software is too diverse. Maybe for the system images some sort of integrity verification using digital signatures is feasible. Currently the images for VMs go through such a verification - the user submits a URL and a SHA-1 hash of the image they want booted. For bare-metal images if we add filesystem integrity verification, it can cover the images locally cached on the head node. 1. Auto update system: There are no plans for an autoupdate system for the GENI racks. With a large and complex software stack and many racks at many institutions, this could become problematic to keep up-to-date. The quickest way to a security incident is to have out of date software. BTW, isn't there a GENI project (by Justin Cappos I think) that is supposed to help make getting secure and reliable updates easy. }}} * S.3 Is there a service guarantee for updates? Say a flowvisor vulnerability is found and a patch made. How quickly can you push out updates? {{{ Since none of the GENI software I know runs as root, I think we can be relatively lax about this. I would say 72 hours if it is a straight-forward update that does not require significant reconfiguration and repackaging. 2. Vulnerability management: Any major system going out needs a plan for monitoring and investigating vulnerability impacts. The more complex the software stack and the more things that depart from a vanilla OS distribution, the harder this becomes. You need to (1) be aware of all potential vulnerabilities (challenging for a complex software stack), (2) test for exploitability, (3) determine impact, (4) test patch or mitigation, and (5) push out a solution all very rapidly. The previous comment in #1 really addresses just the last bit, and I see no vulnerability management plan into which you could insert it now. There's been talk of several strategies, and no single solution will get it all done. We will all know what the state of OS patching looks like, since I have a Nagios/Check_MK plugin that essentially runs a 'yum check-update'. It does this with the security plugin enabled. The result is, for each host, we will know how many updates it needs, and how many of those updates and security-related. Of course, this only helps us with the base OS; it cannot address potential vulnerabilities in the GENI-ORCA-OpenStack-Neuca world. Ilia will have to comment on the latter. In terms of stopping SSH brute force attacks, I think denyhosts is a good way to go. But our sshd is tcpwrappered by default anyway (set up by kickstart). This is kind of attack won't be an issue. There's also the VM and bare metal images as well, right. Regarding the VM images - since users are allowed to boot their own, the main weapon we have there is the ability to match resources to slices and shut down misbehaving resources. Bare-metal images will be restricted to a small selection (size 1 initially). The problem with frequently changing/updating those is that it makes repeatable experimentation more difficult, e.g. if an experimenter expects a certain image with certain versions of kernel, drivers and software and we continuously move that mark. The GPO will need to weigh in on what is more important - repeatability or the potential impact on security, because this is an important tradeoff we're talking about here. Good point. Also, I'd like to add, based on conversations yesterday, neither VM's nor bare-metel servers will have direct internet access. Our plan is to proxy all public IP traffic through the headnode at each site, using IP tables. This gives us the opportunity to shutdown a site very quickly if there is a report of a problem, but keep the problem system running (VM or bare metal), so we can analyze what is going on and resolve the issue with the experimenter. I'm not sure what you are saying exactly here. Are they private IPs that are NATed, are they going through an application layer gateway? What do you mean by not direct? Hmm, my impression is that if we wanted to create a new bare-metal image, we wouldn't necessarily delete old one(s), but rather that the list would grow over time. Ah, but that may not have been what you meant: Indeed, if we update an existing image to fix security problems, that would potentially have an impact on repeatability. I think we'd need to at least identify that the image had changed (e.g. by changing its name), so an exprimenter would be aware of that, and could re-validate that their experiment still produced the same results after the change. We could also devise some way for the experimenter to capture the vulnerable image, so they could run it somewhere else if they felt the need. (Or just boot it up on an isolated system of their own so that they could look at it, or whatever.) I was assuming you would add new images that you support over time, but existing ones would get security patches as time goes by. Of course you'd want to enumerate them and specify how they differ, perhaps in /CHANGELOG.txt or something. I don't know that we can guarantee that a particular 'security' patch will not affect the performance of one or other of the kernel subsystems thus affecting repeatability. Ja; I think "track, notify, and archive" is the right approach here. I'd turn this around -- we know that many changes will affect performance, sometimes in only minor ways, but sometimes in major ways. If space is not a problem, I'd plan to keep every old version of standard images around. The naming convention is just a detail, but OS-version-exogeni-current might be the name that gets you the latest patched/supported image, but the logs would show you the precise version you got (OS-version-exogeni-x.y or -yyyy-mm-dd). If an experimenter is running a slice that is on a closed (virtual) network, e.g. configured so that only a fixed set of well-known machines can reach it, then it is possible to bring up even old images with security vulnerabilities and repeat earlier test runs or collect new data using those older images. If that same experimenter wants to run on a slice that provides "service" to some larger, open set of users (on campuses or wherever), then they are going to appreciate having automatic support for getting the latest OS patches into the base images. I'm going to guess that we will see both sorts of use cases, but more "closed networks" first. Sounds like a reasonable balance. > The GPO will need to weigh in on what is more important - repeatability > or the potential impact on security, because this is an important > tradeoff we're talking about here. So, my two cents: in our lab, we do try to apply OS updates to our experimental images, the same way we would to any other nodes we run. I think having an update schedule which applies to experimental OS images for which standard patches are available, as well as for servers, is a good idea. If you can flag your images with metadata saying when they were last updated, so that experimenters know, so much the better. And if it's possible to keep old images around in case someone has a special-case need for one, again, that's a feature. I agree that it's a tradeoff, but i think doing periodic updates of images is the better bet. }}} * S.4 Will there be someone actively monitoring for vulnerabilities on the entire software stack, or is it best effort (e.g., we update all the problems we are told about by someone else). {{{ At this point there is no dedicated person. However our ACIS group (members of which are part of the operations staff) are usually aware of latest vulnerabilities as part of their data center responsibilities. It may be worth doing google alerts on Bugtraq for all the software. I'll ask our ACIS folks what they do today. }}} Log Collection & Management * S.5 What do you log and how? {{{ ORCA actor state transitions and handler execution outputs. We will log entire manifests to make them available to GMOC. The manifest will be the main vehicle for correlating substrate to slivers. There are syslogs on individual hosts as well. Other elements (FlowVisor, FOAM) have their own logs. 4. Logging: I think remote logging is a must for integrity and availability. This should be for syslogs and AM transactions that are needed to maintain accountability of actions. Some additional integrity checking on the hosts is nice, but icing on the cake. The remote logging infrastructure is mostly complete. There is a central server, in a protected VLAN deep in the heart of RENCI, running rsyslog on CentOS 6.2 It only accepts connections in a high numbered port, using RELP, from control.exogeni.net. The latter is a forwarder for all logs. We have a simple LogAnalyzer web interface to the central rsyslog box (which is syslog.exogeni.renci.org). This is protected with SSL, Apache basic auth, using LDAPS to authenticate to ldap.exogeni.net. What remains to be done here involves making all the nodes in each rack forward their messages appropriately. And lastly, if there are any non-standard logs we need capturing (for instance, OpenStack, Neuca, or ORCA logs), I'll need to create a template for handling them. }}} * S.6 Are remote copies logged? {{{ Not at this time }}} * S.7 Do you do anything special on the racks to maintain the integrity of the logs? {{{ Not at this time }}} * S.7.B What about other file integrity checking for config files and critical system files. {{{ Has not been considered so far, but I think can be added. Also useful is minimizing setuid programs or watching for changes to setuid bits. Noted 3. SetUIDs and configuration management: I think that it is good that most things don't need to run as root on the racks, but the number of setuid programs should be minimized too. Once you have the list, I think xCat has decent configuration management utilities to make sure security hardening policies like that persist across upgrades and changes. If not, you should have a plan on how to make sure that updates don't move you to a less secure state by modifying configuration unintentionally. }}} * S.8 Do you log enough to map timestamp/IP/port tuple to a particular slice? {{{ Sounds like it is the information is there, though it may take some manual investigation, especially if NAT was involved. 8) They haven't really worked out logging, but mostly hope to just send everything to GMOC and be done. This is probably just fine. This is essentially 'racks will log to a remote Logging API' which is consistent with recent architecture group discussions. We just need to (a) ensure we are asking for all the right bits of information, and (b) have them at least outline the algorithm for going through all those logs to get the information we really need (eg, what slice ID used IP X Port Y at time Z?) We should check more specifically on what is stored on the racks in terms of logs, if anything. Manifests have the information. }}} * S.8.B What if you bridge some other device into GENI through your AM but hide it behind your NAT? For example, could there be some campus device causing a problem, but show up as one of the IPs on your rack, but not actually be under your control? And in that case, could you determine from your logs the device and what slice it was a part of? {{{ In the current architecture this is not really possible. The IP addresses given to the rack are used by rack resources only. }}} * ~~S.9 Can you easily tell what slices are running on a given rack? How about each node on a rack?~~ (chaos, based on adam's comment) {{{ Sounds like that is not a problem Yes, although we need to do better with respect to making this information available in an easier form. }}} * S.10 How long do you keep local copies of the logs? {{{ Depends on the verbosity. Once manifests start getting published onto XMPP bus, this will be no longer an issue, as a separate log repository can slurp them up and keep them in one place. The syslog logs probably should be configured to go to a central syslog server in addition to having a local copy. }}} * S.11 Is there a mechanism that could be used to send allocation log information back to the clearinghouse for global policy verification for slices? {{{ XMPP bus - we want to use it as the means to make this data available to multiple consumers. }}} Administrative Interfaces * S.12 What is the authentication mechanism for the VPN? {{{ LDAP + possibly RADIUS slaved to LDAP (for switches) LDAP would be for authorization, but what kind of credential would be used for authentication. Maybe I am missing something. LDAP stores usernames and passwords (as well as groups, which would be used to partition rights). RADIUS can read LDAP. }}} * S.13 Does being on the VPN on one rack get you to the admin interfaces of all the others, or is this one way from RENCI? {{{ One way from RENCI }}} * S.13.B How does one authenticate to the admin interface (separate from the VPN)? Is it root login? {{{ Depends on the device (e.g. a switch vs. a compute node). We opt for sudo whenever possible. }}} * S.14 Are the credentials used to authenticate to the admin interface different for each rack? {{{ This has not been discussed or codified. When architecting this, it would good to strive for containment. So if one unscrupulous person with a GENI rack reverse engineers something, it doesn't give them the credentials they would need to do bad things to other racks. It can complicate initial setups but probably pays off in the long run. Noted }}} * S.14.B What about within a rack, is the root or admin password the same for each node/device? {{{ We tend to use the same password for all worker nodes currently. I think within a rack, all nodes of the same type could be considered at the same level of trust and treated this way. Noted }}} * S.15 Is authentication for admins the same whether or not they login through the VPN or SSH into the head node? {{{ LDAP will be the back end, so yes. So again, I am confused. LDAP as I have seen it used is just for authorization. There are still SSH keys or passwords or OTP tokens for different accounts. LDAP stores usernames and passwords. SSH uses PAM on the end hosts to talk to LDAP over SSL channel. Switches use RADIUS that is slaved to LDAP directly. OK, makes sense. Though I presume it is actually salt and hashes stored. Yes, the passwords stored in LDAP are not plain-text. Typically an MD-5 hash is used. }}} * S.15.B Are the SSH credentials to the head node different for each rack or shared? {{{ Same as S.14. I don't know that these two questions are different. So here I am talking about two separate racks installed at different institutions. Would a password or key that a local admin used to SSH into the head node at University X also let them do the same at University Y? It is likely we will use LDAP groups to partition users such that users are limited to specific racks. Root logins will likely be disabled (and we may disallow 'sudo su -' for most users). }}} * S.16 How is accountability of actions recorded if there are more than one admin or is it just a shared root login? {{{ we tend to use sudo, so some of the commands and privilege escalations are logged. }}} * S.17 Does the KVM for console access have an network interface that gives remote console access? {{{ Sounds like NO. No }}} * S.18 What devices and interfaces can you see from the VPN interface? {{{ All of them. }}} * S.18.B Does this differ for those logging in through the head node? {{{ No. Head node access is a redundant means to do the same. }}} * S.19 Would the hosting organization have a different admin interface? {{{ No, just a different set of logins with different credentials. Hosting organizations probably will not have VPN access. }}} * S.20 Is the only authentication mechanism password based, or two factor auth or ssh keys used? {{{ Right now based on LDAP passwords only. Oh, so you are using LDAP to distribute something like the /etc/shadow file? So here, we use LDAP just to essentially distribute /etc/password, but authentication is done through PAM with Kerberos or OTP. Am I understanding this right, that LDAP does both for you, sort of like NIS? Sort of. Except we don't distribute /etc/passwd - PAM talks to LDAP live and there usually is a caching daemon that caches the getpasswd entries temporarily. 5. Remote root access: It was not clear whether remote root login was allowed anywhere. I read that sudo was used when possible, but I would hope no sshd_config files allow remote root login. root SSH is disabled by default in our kickstarts }}} * S.21 If ssh keys are used anywhere, are they stored unencrypted on any of these racks. {{{ I suspect yes with xCat. Yes. }}} * S.22 If SSH keys are used, are they different for different racks? {{{ We will probably generate different keys. }}} * S.23 If passwordless SSH keys are used, can they be used multi-directionally? For example, if an xCat process needs to use them to do something on a less trusted part of the system, that other piece should not be able to use the same key to ssh back into the xCat manager. {{{ xCAT uses only explicitly registered keys, so this can be avoided. However we will disallow node-to-node logins as per: http://sourceforge.net/apps/mediawiki/xcat/index.php?title=Disable_node_to_node_root_passwordless_access }}} * S.24 Do the admin interfaces need to connect back to anywhere initiating outbound connections? {{{ Not that I know of. }}} * S.25 What is meant by " Since ExoGENI slices have management network access via the commodity Internet, this is the default behavior." on pg 13? (Perhaps you will have explained this by now and can ignore) {{{ This simply says that if you don't care about isolated connectivity between slivers, you always have the commodity Internet connecting them. }}} Isolation * ~~S.26 Are you tired yet? I am. :-)~~ (chaos, per adam's comment) * S.27 What is the vetting process for bare metal nodes? {{{ Sounds like no process yet, but there is recognition that we don't want bare metal hosts to be able to sniff in promiscuous mode and break the nice isolation properties 26. Dataplane reachability testing: We think it would be a good idea to have two types of tests to go with the two types of VLANs: * Where the ExoGENI AM is used to provision a VLAN, we'd like to see a test which stands up a VLAN, verifies that it can be used, and reports to monitoring on whether that entire system (which includes the AM, of course) is healthy. I believe you discussed doing something like this already: does what i just said sound similar to what you have in mind? * Where an ExoGENI rack is going to be connected to a static (long-standing) VLAN outside of the rack, e.g. to the shared mesoscale VLANs or to a longstanding L2 connection to non-rack resources at a particular site, we'd like to see a static test interface on each VLAN which could be used to verify connectivity. It would be ideal if the test interface were non-OpenFlow-controlled on the rack, so that it could be used entirely to test "is this link up?". Does this seem reasonable? No process yet. 7. Image vetting: I think a process, or maybe a set of criteria, is needed for vetting bare metal images. What are the requirements? Things such as an "inability to sniff traffic in promiscuous mode on the NICs" would fit into such a list. Is the example you propose below an actual proposed requirement or just a for instance? I ask because the capabilities that come immediately to my mind as wanting bare metal seem likely to want to do exactly this. It was being proposed and an example. I think it is desirable to prevent from a security perspective because it provides better isolation of slices. Most of these are non-controversial (at least to security folks!) but I didn't quite understand a couple points, maybe because I joined the review late and will admit to not reading all the messages. 7. Image vetting: ... are GENI researchers going to be able to sudo root on bare metal images? (I would have presumed yes, but maybe that isn't the model.) I did not presume so because there was talk about state being preserved between jobs/users. If they aren't wiping images between experiments and users have root access, then there is a whole other security issue. > It was being proposed and an example. I think it is desirable to prevent > from a security perspective because it provides better isolation of slices. The ability to capture traffic from a promiscuous NIC on a bare-metal image has no impact on slice isolation. This is very much something that we should allow. It depends. If it allows me to watch traffic on other slices and there is any expectation of privacy, then it does impact a form of isolation. If there is neither an expectation or promise of privacy, or switching would prevent one from seeing such traffic even if in promiscuous mode, then the it isn't an issue. I don't know the answer to either of those questions, though. The privacy question is a good one, and should be discussed, but isn't a factor here. If you have bare metal, you have exclusive access to the switch port and can't capture traffic that belongs to another slice. }}} * S.28 Are the bare metal hosts diskless? {{{ No, they have 146GB FS for OS and 600GB FS for data. However, they are wiped clean and reinstalled from a fresh vetted image between allocations. State is gone. We're still debating whether we want stateful or stateless bare-metal nodes. Both options are open. The nice thing about stateless is it more like a white list. If there is state left behind, you have to always wonder if you thought of everything that you need to clean up in between users. This is still TBD and there are advantages to both. }}} * S.29 What are the main isolation mechanisms between slices? {{{ VM hypervisors or wiped bare-metal systems isolate experiments at system level. At the network level, this is done with VLANs. The same VLAN won't have slivers from multiple slices. Yes. VLANs have QoS associated with them wherever possible (rate and buffer size limits). 6. Isolation between racks: Isolation between racks is important, especially since these are distributed across the country. Reverse engineering something at one rack should not result in some class-wide vulnerability that affects all racks. Companies like IBM often like to install things with default keys and passwords, and you really need to make sure those are changed and individualized for different racks. Any password hash on a rack off-site is accessible and potentially crackable. Most of these are non-controversial (at least to security folks!) but I didn't quite understand a couple points, maybe because I joined the review late and will admit to not reading all the messages. 6. ... "Any password hash on a rack off-site is accessible..." ... I thought all these racks were getting installed in "well-known" facilities. So while remote, they aren't exactly in physically unprotected locations, right? I don't know how much we trust the administrators at the dozens and eventually hundreds of sites. Might students be admins of some of these racks? If it really is a small set of trusted admins with racks on data center floors, then it is less of an issue. }}} Miscellaneous * S.30 For each rack, could the aggregate operator give a concrete block of IP addresses unique to it? {{{ Sounds like this is a policy issue and could be made a part of the configuration guidelines for each rack. It is helpful for the LLR to be able to tell from IP if something is from a GENI rack and at which organization quickly. A block or a list of addresses is fine }}} * S.31 Are any user credentials stored anywhere, even temporarily? If so how are they protected and how long do they live? {{{ You argue this is not applicable in the white paper I think? If ABAC is adopted in GENI, user certs may be cached on the head node as part of authorization process. They, however, constitute public information and do not require confidentiality protection }}}