wiki:GENIRacksHome/ExogeniOpenQuestions

Version 1 (modified by lnevers@bbn.com, 7 years ago) (diff)

--

These are numbered questions sent to exogeni-design@geni.net; discussions are captured to track resolution. Questions are crossed out when answered. Person closing a question should adding his/her name to the question with a brief explanation. Any document that is referred to with a URL is attached to the page for historical reference. Following are notes attributions:

AS: Adam Slagell           CE: Chip Elliot         IB: Ilia Baldine             JS: Josh Smift         NR: Niky Riga
AH: Aaron Helsinger        CG: Chaos Golubitsky    JC: Jeff Chase               LN: Luisa Nevers       VT: Vic Thomas
BV: Brad Viviano           HD: Heidi Dempsey       JM: Jonathan Mills           NB: Nick Bastin        TU: Tim Upthegrove

GPO ExoGENI Questions

  • 1. Does ExoSM speak GENI API? (nriga)
    <NR> Yes, ExoSM is just like any Orca SM running in a rack and can
         be thought as a GENI AM that can make reservations in racks
         as well as provide the network connecting resources on different
         racks. Per Ilia comments ExoSM can also give an experimenter
         resources from only one rack by making a bound request that
         bounds all resources to specific rack. My understanding is
         also that all topology information is available to the
         experimenter through the GENI API (listresources) only through
         the ExoSM and not through the rack-local Orca SMs.
    
  • 2. Can you describe the ExoGENI software stack a bit more in the teleconf (Figure 7)? (ahelsing)
    • 2a. Is the AltG API the same as the Orca XMLRPC API at the SM? (ahelsing)
      <IB> Yes.
      
    • 2b. Can you draw the software stack for the worker nodes in the same style as Figure 7 for comparison? (ahelsing)
      <IB> Worker nodes are either turned off (booted and installed using
           xCAT when needed) or run Centos 6.1 with OpenStack worker node
           configuration.
      <JC> The cloud worker nodes also need a cloud node manager installed.
           This requires minor modifications for NEuca.  This is the thing
           that lets us create multiple interfaces on VMs and stitch them
           to other VLANs.
      <JS> Do we understand what controls how many nodes are bare-metal and how many
           are available for VMs? Can this be adjusted on the fly? By whom?
      <HD> The allocation question is a policy question, the mechanism should be defined later - both postponed.
      <HD> Josh to get software stack for Worker node.
      
      
  • 3. Is Eucalyptus or OpenStack used for the compute resources? (chaos)
    <IB> OpenStack
    
    • 3a. If OpenStack is being used, what testing or analysis convinced you to choose OpenStack? (chaos)
      <IB> We've done performance comparision. OpenStack instances boot
           significantly faster (orders of magnitude) due to the use of
           COW for boot images. We can also see a path to making VM
           migration work between ExoGENI sites with OpenStack (could not
           figure it out with Eucalyptus).
      
  • 4. When will ExoGENI racks support xCAT-based bare-metal node allocation? (chaos)
    <IB> My hope by GEC13
    
  • 5. How do bare-metal images get vetted? (hpd)
    <IB> TBD
    <HD> Same as S.27, closing this one, Adam will follow up.
    
    • 5a. Given that VM images are unvetted, why vet bare-metal images?
      <IB> Security concerns. Also bare-metal images are harder to prepare.
           Mistakes will mean users occupy fairly limited bare-metal
           resources just learning how to boot them.
      <HD> Same as S.27, closing this one, Adam will follow up.
      
  • 6. Can we have more detail about disk images: (jbs)
    • 6a. How are central images selected? Is there a central repository? (jbs)
      <IB> For vetted images for bare metal. There may also be a small
           informal repo for sample VM images.
      <JC> For 6a-6b-6c.   ImageProxy can fetch images given a URL.  So
           people can put their images anywhere.  We have software (pod)
           to make it easy for users to upload images and share them.
           The user interface is a little rough, and it is not quite
           deploy-ready, but it could be used.   Ilia's concern (I think)
           is that we don't have budget to run and manage a repository
           server with a lot of disk.  But GPO could certainly host it.
      <AH> 13) Building new VM images takes work. (Q6)
           You have to add NEuca and maybe something OpenStack? This was hard with
           Eucalyptus, but maybe it is easier with OpenStack? Their answer that
           this is documented elsewhere isn't terribly re-assuring (since the
           Eucalyptus documentation wasn't enough for Tom). Do we want to check
           their list of images? Have them add 1 or 2? Have them collect/edit
           documentation for this process?
      <CG> Well, OpenStack may be a lot better for this --- we simply don't know.
           IMHO, the right answer to "this is documented elsewhere" is, "great,
           then it should be easy for you to make a wiki page pointing to usable
           procedures elsewhere".
           Well, actually: i said that, and then thought about the RENCI integration
           of external documentation and internal for ORCA/NEuca, which i have not
           found all that readable when i've tried to use.  So maybe we'd rather
           have them duplicate the steps that an experimenter would use to create
           an image?  I'm not sure here.
      <JS> I predict that we aren't planning to host a repository server. Are we? If
           not, do we think that someone else is? Do we want to push RENCI to do that?
      <JS> Answer: RENCI has a central repository at http://geni-images.renci.org/images/,
           which ExoGENI will use too (or maybe a subdirectory, or some such). Images
           for that repository must be reviewed by RENCI, the GPO, or our delegate.
           All vetted bare-metal images will live here, and a small number of
           commonly-used VM images could be hosted here too. RENCI or GPO will put
           together a nicer index page (it's currently just an Apache DirectoryIndex
           listing, with no comments or explanations)
      <IB> This is correct although as far as the bare-metal nodes are concerned
           the images will be cached in each rack and the booting will happen
           from there. I have put together a small page listing available VM
           images : https://geni-orca.renci.org/trac/wiki/neuca-images
      <JB> I've also rephrased some of the questions a bit from their original forms.
           6a. How are central images selected? Is there a central repository? 
           Answer: RENCI has a central repository at http://geni-images.renci.org/images/,
           which ExoGENI will use too (or maybe a subdirectory, or some such). Images
           for that repository must be reviewed by RENCI, the GPO, or our delegate. 
           All vetted bare-metal images will live here, and a small number of
           commonly-used VM images could be hosted here too. RENCI or GPO will put
           together a nicer index page (it's currently just an Apache DirectoryIndex
           listing, with no comments or explanations)
      <IB> This is correct although as far as the bare-metal nodes are concerned the 
           images will be cached in each rack and the booting will happen from there. 
           I have put together a small page listing available VM images :
           https://geni-orca.renci.org/trac/wiki/neuca-images
      
      
    • 6b. Are there default images hosted at RENCI? What are they? (jbs)
      <IB> We have a few on http://geni-images.renci.org/images/
      <JS> Answer: The exact images haven't been specified, but there aren't in
           principle any reason why we can't publish any images that we decide
           we want, within disk space limitations. We (GPO) will presumably use
           our rack to come up with some, probably focusing on modern and stable
           versions of Ubuntu and Fedora/CentOS.
      <IB> Yes and we encourage multiple locations (URLs/web servers) from which
           the images are served. A directory listing them can be stored in one
           place.
      <JB> I've also rephrased some of the questions a bit from their original forms.
           6b. Are there default images hosted at RENCI? What are they? 
           Answer: The exact images haven't been specified, but there aren't in
           principle any reason why we can't publish any images that we decide we
           want, within disk space limitations. We (GPO) will presumably use our rack
           to come up with some, probably focusing on modern and stable versions of
           Ubuntu and Fedora/CentOS.
      <IB> Yes and we encourage multiple locations (URLs/web servers) from which the 
           images are served. A directory listing them can be stored in one place.
      
    • 6c. Will RENCI also store some user images? (jbs)
      <IB> Only a few.
      <JS> Answer: RENCI will only store experimenter-created images if they've
           been reviewed (see 6a), but ImageProxy can fetch and use an image
           from any experimenter-supplied URL, and RENCI has software that makes
           it easy for experimenters to upload images and share them, although
           it's not quite deployment-ready yet.
      <IB> Yes. Duke team is working on POD (Persistent Object Depository) that
           can fulfill this role. I repeat that this is an optional component -
           a user can create an image and serve it from *any* web server.
      <JB> I've also rephrased some of the questions a bit from their original forms.
           6c. Will RENCI also store some user images? 
           Answer: RENCI will only store experimenter-created images if they've been
           reviewed (see 6a), but ImageProxy can fetch and use an image from any
           experimenter-supplied URL, and RENCI has software that makes it easy for
           experimenters to upload images and share them, although it's not quite
           deployment-ready yet.
      <IB> Yes. Duke team is working on POD (Persistent Object Depository) that can 
           fulfill this role. I repeat that this is an optional component - a user 
           can create an image and serve it from *any* web server. 
      
    • 6d. Will there be instructions for building custom images? (jbs)
      <IB> For VMs yes, although basically OpenStack, Eucalyptus and
           Amazon have pretty extensive guides on how to do that.
      <JS> Answer: RENCI will publish instructions for building VM images, and
           there are good general docs available from OpenStack, Eucalyptus, and
           Amazon too.
      <IB> Here are the current instructions:
           https://geni-orca.renci.org/trac/wiki/NEuca-guest-configuration
      <JB> I've also rephrased some of the questions a bit from their original forms.
           6d. Will there be instructions for building custom images? 
           Answer: RENCI will publish instructions for building VM images, and there
           are good general docs available from OpenStack, Eucalyptus, and Amazon too.
      <IB> Here are the current instructions: 
            https://geni-orca.renci.org/trac/wiki/NEuca-guest-configuration
      
      
    • 6f. Must the experimenter add NEuca? (hpd)
      <IB> NEuca-py tools *should* be added to the image such that post
           boot configuration (IP address assignment to interfaces and
           post-boot scripts) would be done. Without it, bare interfaces
           will still be created based on NEuca INI script generated by
           ORCA for the desired topology and the user would have to
           manually configure them.
      
  • 10. Can we have more information about how the IP Address proxy options in the table on p. 4 work? Do the proxies expose all ports or just ssh? (jbs)
    <IB> Right now only SSH. The plan is to add the ability for the
         user to ask to expose some port ranges in addition to that.
         It's on the todo list and is not complicated.
    <AH> 12) They plan to NAT access to VMs, meaning that experimenter resources
         are only available via SSH or maybe in future specifically requested
         port ranges. (Q10 from original list)
         I think we want to know more here, and clarify our concerns and desires.
         Perhaps those 'future plans' are enough, but we need to know more (like
         a schedule).
    <JS> 10. Can we have more information about how the IP Address proxy options   
         in the table on p. 4 work? Do the proxies expose all ports or just ssh?
         Ilia had said "Right now only SSH. The plan is to add the ability for the
         user to ask to expose some port ranges in addition to that. It's on the
         todo list and is not complicated."
         That sounds good; is there a timeframe for that?
         Just to make sure the goal is clear, the idea is that experimenters may
         want to run TCP or UDP services on their VMs, and make it possible for
         users to connect to those services via the Internet.
    <JS> Answer: The plan is to add this ability, it's on the to-do list, and
         it'll be done by the time the first non-GPO/RENCI racks ship in April.
         (Which of the options in that table are you planning to go with? Or
         will this be a campus-by-campus decision? If the latter, which will
         you recommend? We prefer (C), which seems safe enough if the racks
         are behind a campus firewall, which we assume they will be.)
    <IB> This is a campus-by-campus decision. We can deal with either B or C.
         If there are not enough public IP addresses, we have a proxy
         solution. If there are enough, they can be used as is.
    <JB> 10. In the IP Address proxy options in the table in section 2.1, at the
         top of page 5 do the proxies expose all ports or just ssh? 
         (Experimenters may want to run TCP or UDP services on their VMs, and
         allow users to connect to those services via the Internet.)
         Answer: The plan is to add this ability, it's on the to-do list, and it'll
         be done by the time the first non-GPO/RENCI racks ship in April.
         ISSUE: Which of the options in that table are you planning to go with? Or
         will this be a campus-by-campus decision? If the latter, which will you
         recommend? We prefer (C), which seems safe enough if the racks are behind
         a campus firewall, which we assume they will be.
    <IB> This is a campus-by-campus decision. We can deal with either B or C. If 
         there are not enough public IP addresses, we have a proxy solution. If 
         there are enough, they can be used as is. 
    <JB> Ok, that sounds good.
         One other question about this: The ExoGENI racks will not expect that they
         have a dedicated IP subnet for these interfaces, which they need to route;
         but will instead expect that they'll connect to an existing IP subnet (or
         a newly-created one, I suppose), which the campus will route, right? (That
         sounds fine; I ask because it came up when we were deploying the starter
         racks in Chattanooga and Cleveland, so it may come up with campuses too.)
    <IB> We don't require an entire subnet. A list of available IP addresses is enough.
    
    
    • 10a. Do all outbound connections work for all table options (jbs)
      <IB> Not clear about the question
      <JS> I think the question is: "For all the options in Table N (don't have the 
           number handy, but we should cite it), is it the case that there are no
           restrictions on outbound connections?"
      <JS> The original 10a said:
           10a. Do all outbound connections work for all table options
           I clarified that what we were getting at here was:
           For all three options in that table, is it the case that there are no
           restrictions on outbound connections?
           We assume not, but wanted to check.
      <JS> Answer: Correct, there are no restrictions; all outbound connections
           are permitted. (Although some could be blocked if we needed to for
           some reason.)
      <IB> We will not block any outgoing connections on the racks. We cannot
           say anything for the campus.
      <JB> 10a. For all three options in that table, is it the case that there are
           no restrictions on outbound connections?
           Answer: Correct, there are no restrictions; all outbound connections are
           permitted. (Although some could be blocked if we needed to for some reason.)
      <IB> We will not block any outgoing connections on the racks. We cannot say anything 
           for the campus.
      
    • 10b. How does the proxy work for OpenFlow? (jbs)
      <IB> I don't think they are related.
      <JC> Proxied IP connections go through the management net, so they
           don't touch the OF switch.
      <JS> I think our concern was: If the FlowVisor is reaching out to experimenter
           controllers through the proxy, does that raise any issues? (Relative to
           the alternative of "the FlowVisor connects to experimenter controllers
           directly" -- which may in fact be what happens, if it's on the head node.)
      <JS> Our concern here was: If the FlowVisor is reaching out to experimenter  
           controllers through the proxy, does that raise any issues? (Relative to
           the alternative of "the FlowVisor connects to experimenter controllers
           directly" -- which may in fact be what happens, if it's on the head node.)
      <JS> The original 10b said:
           10b. How does the proxy work for OpenFlow?
           I clarified that what we were getting at here was:
           If the FlowVisor is reaching out to experimenter controllers through the
           proxy, does that raise any issues?
           If outbound connections are unrestricted, and performance of the proxy is
           good, then this is probably not an issue. But we wanted to raise the
           question because it's a situation where dataplane traffic uses the
           management network, so if the proxy was expected to only have to handle
           experimenter SSH, that might not be sufficient.
      <JS> This is superseded by 10a and 10c: There are no proxy/firewall
           restrictions, and no performance issues, that are unique to OF/FV.
      <JS> 10b. If the FlowVisor is reaching out to experimenter controllers
           through the proxy, does that raise any issues?
           Answers: If outbound connections are unrestricted, and performance of the
           proxy is good, then this is probably not an issue. We should make sure to
           test this carefully with the initial GPO and RENCI racks, since FlowVisor
           can generate a lot of control traffic.  
      
    • 10c. What is the expected performance bottle-neck for proxying? (jbs)
      <IB> Packet forwarding is relatively cheap at reasonable rates. The
           bottleneck will be the connection to the campus network.
      <JS> This gets to 10c:
           10c. What is the expected performance bottle-neck for proxying?
           Ilia had said "Packet forwarding is relatively cheap at reasonable rates.
           The bottleneck will be the connection to the campus network."
           Just to put some numbers on this, the theory is that the connection to the
           campus network will not be more than 1 Gb/sec, and we think that the proxy
           can go at least that fast?
      <JS> 10c. What is the expected performance bottle-neck for proxying?                 
           Answer: We expect the connection to the campus network to be 1 Gbit or
           less, and that the proxy can go at least that fast.
      <IB> The answer is above - we don't think the head node will be the bottleneck. It will be the campus connection.
      
  • 11. Is ExoGENI software essentially Orca software? How do they differ? (hpd)
    <IB> Same
    <JC> The software is ORCA (and associated stuff like ImageProxy and
         NEuca), but it is configured in a specific way, so we just say
         "ExoGENI" when we're talking about that configuration.
    
  • 12. What happens to ExoGENI racks and/or rack functionality if RENCI suffers a network or service outage? (ahelsing: watch this, but Ilia agreed to make deployment choices we wanted)
    <IB> Should not be affected
    <JC> For 12, 12b.  Also, old actors cannot see new actors.  Currently
         the actor registry uses SSL connections.  If RENCI goes off
         the net then a site or SM will not be able to restart.  AMs/SMs
         that are running will not be able to refresh their lists, so
         they won't accept any new actors.   If the registry issued
         certs (easy with ABAC), then this problem would go away, but
         it would be harder to revoke...
    <JS> This contradicts 12a.
    <AH> 1) Question 12: RENCI is a SPOF in your design, due to the RSpec   
         conversion service and Actor Registry.
         It appears that a couple (minor?) changes would mitigate this risk. Let
         us know if we're off base here.
    
    • 12a. Will the absence of the RSpec/NDL conversion service mean RSpec-related requests will not work? (ahelsing: RSpec converter being duplicated on all racks)
      <IB> Yes. We can host alternative translators in a number of places
           if it is a concern. We can host a translator on every rack if
           needed and configure its SM to talk to that translator. It is
           a simple stateless web-service.
      <AH> 2) RSpec conversion service is a SPOF. (Q 12a from GPO list)
           I think we'd like them to try running it elsewhere as well.
              a) Make the URL a configuration item in racks
              b) Test running it on the head node, to ensure no performance problems
              or library inconsistencies
              c) Consider running a backup version of the service somewhere. GPO?
      <CG> I think Ilia said this service is stateless and there's no issue running
           it on the individual racks.  So i don't see any reason not to just run
           it on the individual racks, unless it's a serious resource hog.
      <JS> This contradicts 12.
           So that's not "no functionality will be affected". :^p  (I don't think it's
           particularly important to call him on this, just mentioning it as a
           warning to us to keep our eyes open. :^)
      <JS> I think we should ask them to have a translater for each SM, unless
           there's a significant cost to that (in which case we should ask them to
           clarify what the cost is).
      <AH> a) Please install the RSpec conversion service on all racks, and make
           the URL for the conversion service be a configuration parameter. Be sure
           to test the load on the rack head node, once this and the OpenFlow
           pieces are running there.
      <IB> No problems with 12 a or b - this is supported today and is a deployment-time decision. 
      <AH> 12a&b is there now? (RSpec converter URL is a config param and actors
           community on the rack among themselves fine on restart) Great.
           If you are comfortable with this deployment choice (run the RSpec
           converter on all racks), then please plan on it.
      <IB> 12a is there now because we have a way of statically specifying security
           associations between actors in a config file. The actor registry works on
           top of that filling in whatever is missing. So we can configure the ORCA
           actors in a rack to know about each other statically without relying on the
           registry and they will only learn from the registry about other racks.
      <AH> Sounds great
      
    • 12b. What impact will the lack of the ORCA Actor Registry have on racks? (ahelsing: answered questions satisfactorily)
      <IB> Everything will continue running. New actors will not be able to see old actors.
      <AH> 4) Actor registry is a SPOF. (Q 12b from GPO list)
           This is less worrisome. New actors would be cut off. Racks cannot
           restart successfully.
      <AH> 5) The actor registry shows topologies in NDL. Once Ad conversion works
           (GEC13 he says), we should ask them to include a link showing that in
           RSpec as well.
      <AH> b) Please ensure that the 3 Orca actors on a rack can communicate with
          each other after rack reboot without re-talking to the Actor registry.
          IE a rack should work as a stand-alone GENI AM even if RENCI is inaccessible.
      <IB> No problems with 12 a or b - this is supported today and is a deployment-time decision. 
      <AH> 12a&b is there now? (RSpec converter URL is a config param and actors
           community on the rack among themselves fine on restart) Great.
           If you are comfortable with this deployment choice (run the RSpec
           converter on all racks), then please plan on it.
      <IB> 12b is trivial since the converter service can be run anywhere and its
           location is a configuration parameter for the rack SM.
      <AH> Sounds great
      
    • 12c. Any other impacts? (ahelsing)
      <IB> Can't think of any.
      
  • 13. What would fail if the rack Orca XMLRPC interface were disabled? What does the Orca XMLRPC feature do? Is it critical to the rack functions or just another way to use it? (ahelsing)
    <IB> It is another way to use it. Nothing would fail, but we would
         like to keep it. It is integral to the actor (SM) so there is
         no way for it to fail independently.
    <JC> We may plan to add some new management functions through the
         XMLRPC interface, so the answer to this question might change.
    
  • 14. Define ORCA AM Delegation to a broker further---is it double delegated? How is it applied for local broker and ExoGENI broker? (nriga)
    <IB> Probably best to refer to
         https://geni-orca.renci.org/trac/wiki/orca-introduction
    <JC> Double-delegated, but this would be site policy under site
         operator control.  A site could reserve resources for local
         use by not delegating them.  For example, they could buy more
         nodes and reserve them for local use.
    
    • 14a. If delegation to broker is a deployment time decision, what is the plan? (nriga)
      <IB> Delegation must occur for things to work, what is decided is
           how much to delegate. I'd say start with 50/50 for compute and
           probably 80/20 for vlans (local/global)
      <NR> Resources at a local rack are delegated to *either* the local
           broker *or* to the ExoSM broker, i.e. the resources *are not*
           double delegated. The original split will be 50-50, i.e. 50%
           of compute resources, vlan tags etc, will be delegated to the
           local broker and 50% to the ExoSM broker. The percentage is
           configurable and each admin can decide on a different split.
           The reconfiguration probably requires changes in a couple of
           configuration files and a restart of some (??) software. Tom
           believes that this might be more complicated since the broker
           have to address problems with existing tickets.
      <AH> 7) ExoSM owns half the racks. We may end up preferring to go direct to
           the racks. (Q 14a)
           We should have them document the process of changing that allocation,
           maybe even try it once, to be sure this isn't terribly disruptive or hard.
      
  • 15. What will the flowspace look like for ExoGENI OpenFlow slivers? If the flowspace is based on VLAN tags, will this still be doable if the OpenFlow switch runs in hybrid mode? (hpd) (this is an adequate first cut answer and the use cases duplicate this - closing)
    <IB> Here are FlowVisor commands (for two ports one vlan slice):
           $ fvctl addFlowSpace 00:c8:08:17:f4:a6:6a:00 10 "in_port=23,dl_vlan=151" "Slice:ilia2=4"
           $ fvctl addFlowSpace 00:c8:08:17:f4:a6:6a:00 10 "in_port=24,dl_vlan=151" "Slice:ilia2=4"
    <JC> The term "hybrid mode" is not well-defined.
    <AH> 11) The use of OpenFlow vs VLANs, and the capabilities of the switches,
         seems an open and messy question. Josh/Niky/Nick/? need to follow up
          probably.
             - implications of hybrid mode
             - way to do an OpenFlow onramp
             - options instead of a NOX controller per VLAN
             - ways to use OpenVSwitch to do clever things
             - ...
    <JS> This is what I expected, and should work fine, although we haven't
         personally tried it much. We could, when we get bamboo up and running again.
    <JS> Does he mean that we haven't defined it well, or agreed on a definition,
         or something? Or that we don't know for sure what *IBM* is going to implement?
         I think we have a good definition of "hybrid mode", although the
         definition is different on HP and NEC switches, than on what we think the
         IBM switches will do. But if he just means that we don't yet know for sure
         what IBM is going to do, then yes.
    <JS> Do you mean that we haven't defined it well, or agreed on a definition, or    
         something? Or that we don't know for sure what *IBM* is going to implement?
         I think we have a good definition of what we think we mean by "hybrid
         mode", although the definition is different on HP and NEC switches, than
         on what we think the IBM switches will do. But if you just mean that we
         don't yet know for sure what IBM is going to do, then I agree that this is
         an area with some question marks.
    <NB> The definition is exactly the same, for this switch, between NEC and
         IBM (same hardware, same software).  The fact that this switch is
         different from other NEC switches is besides the point.  In an
         OpenFlow world vendors are free to decide what they specifically mean
         by the word hybrid - all that hybrid means is that there is an
         openflow datapath instance and non-openflow instance on the same
         hardware, but how those instances interact is left to the vendor.  The
         two most common implementations of a "hybrid" mode are:
          * VLAN-based hybrid mode - the switch handles all traffic tagged in a
         certain VLAN with instructions from the openflow instance.  This often
         puts limitations on VLAN and QoS handling within the openflow instance.
         * Port-based hybrid mode - you literally just "slice" the switch so as
         it if were more than one switch.  Traffic is divided between
         non-openflow and openflow instance based on what port it comes in on
         or goes out on.
         In both cases transitioning the boundary between the openflow and
         non-openflow datapaths is an implementation detail left to the vendor.
         The IBM switch actually supports both modes currently, but the
         non-openflow datapath in hybrid mode is incapable of anything more
         than L2 MAC learning.
         It's probably also worth mentioning that on this switch, while you can
         create your openflow instance with an id of 1 to 16 possibly leading
         you to believe that you could create up to 16 openflow instances, you
         cannot - you can only create 1 - if you make a new instance, it will
         replace the old one (per NEC).
    <IB> We must be talking about different switches.
         The BNT switch we tested and is in our specs is a 10G/40G 48-port switch.
         NEC does NOT have it yet - I spoke to them about it.
         The NEC switch at GPO is not a BNT switch, since it is 1G/10G and BNT told
         me they do not have a 1G/10G implementation yet.
    <NB> The BNT is an NEC PF 5820.  My comments apply to that switch (not
         other NECs that implement different hybrid modes), and of course to
         hybrid mode in general (just to make sure we were all on the same page
         about what was generally available).
    <IB> The BNT switch I tested cannot do VLAN based OpenFlow (yes, you actually
         configure one VLAN to be OpenFlow, but a VLAN is used only as a port-grouping
         mechanism; its tag has no meaning). The only mode the BNT switch supports is
         port-based separation (and right now all ports have to be on that vlan; hybrid
         mode is coming).
    <JS> So, just to (try to) close the loop on this, all I originally wanted to
         address was Jeff's comment
    
           15.  The term "hybrid mode" is not well-defined.
    
         and my belief is that we do in fact understand what "hybrid mode" means,
         even if we're still not entirely on the same page about whether the IBM
         switch is in fact a NEC PF 5820, or something else.
         Jeff, do you think there's still an open issue here about hybrid mode not
         being well-defined, or are you happy?
    <JC> I was simply observing that there is no specification for "hybrid mode".
         I just want to be sure we define our terms.   We're all in    agreement
         about that, right?  Nick said something about "different hybrid modes".
         As I recall, the original question was:
         15. What will the flowspace look like for ExoGENI OpenFlow slivers? If
         the flowspace is based on VLAN tags, will this still be doable if the
         OpenFlow switch runs in hybrid mode?
         I responded "hybrid mode is not well-defined" because I did not understand
         the second part of the question.  If the question is still live, could I ask
         you to restate it more concretely?  There seems to be some concern behind it,
         and I'm not sure what that concern is.
         More broadly, we're still trying to figure out what the implications are of
         BNT's hybrid mode, and how to use it.  I said something about that in my e-mail
         to this list on 1/12.  I said:
         With a better hybrid mode we might be able to stitch and use OpenFlow at the same
         time, but this kind of "real" (to me) hybrid mode is not in the forseeable roadmaps
         for switch vendors.  The weak support for hybrid mode may turn out to be a pretty
         deep problem.  We're still working through the implications.   For example, Ilia has
         pointed out that we're not sure whether a controller can touch any traffic that enters
         and/or exits a non-OF-aware circuit provider or RON, since ports facing those networks
         weren't planning to be in OpenFlow mode.  But that's a separate issue.
    <NB> The answer in this case, given different hybrid modes, is generally thus:
         * If a switch implements a port-based hybrid mode (basically
         splitting the hardware into two switches along physical boundaries),
         then the openflow datapath is usually fully capable (with whatever the
         ASIC supports anyhow), which means that you can slice on VLAN tag in flowvisor
         * If a switch implements a VLAN-based hybrid mode (where the
         discriminant to which software path controls the forwarding is the
         VLAN tag) then generally the openflow datapath is *not* permitted to
         match on or modify VLAN tags, which means that you cannot slice on
         VLAN tag in flowvisor.
    <IB> For the switches we have picked the hybrid mode is port-based, we should have
         no problems slicing on VLAN tag. I tested it on the existing implementation
        (all ports had to be in OpenFlow mode).
    <NR> And just to close the loop on this, this was exactly our concern; whether the FV will be
         able to see/modify VLAN tags when the switch is running in hybrid mode.
         It sounds like the switches that are provisioned for the racks will support port-based
         hybrid mode and thus will allow you to create slivers in the FV based on VLAN tags.
         However, it is probably good to keep this in mind when you are talking with the vendor
         to ensure that this will be possible in the new firmware that will support the hybrid mode.
    <JS> Hmm, really? We haven't tried this, but I had thought that something like
         this would work: Say you've got a VM server, which provisions a VM to each
         of two experimenters, and gives each of them a virtual dataplane interface
         on a different VLAN, such that packets leaving the VM are tagged with that
         VLAN; but those virtual interfaces share a physical interface, which is
         connected to a hybrid-mode switch...
         ...Ah, ok, right: So if you want to slice by those VLAN IDs in FlowVisor,
         that port on the switch has to be an *access* port, in an OpenFlow-
         controlled VLAN, so it doesn't strip off the VLAN tags as they come in,
         and sends the tagged packets off to the FlowVisor. This is what you get
         for free with a pure OpenFlow switch; and you can do it on an NEC IP8800,
         but we think you *can't* do it on an HP. (And I don't think we've tested
         performance on the NEC, have we?)
         Anyway, probably not relevant to ExoGENI, with port-based hybrid mode;
         apologies for the tangent. :^)
    <NB> Just to continue this tangent for one email more...
         The HPs can do this, it's what they call aggregation mode.
    <JS> Rephrasing the question: If the general model is that the flowspace
         is sliced based on VLAN tags, will this still work if the OpenFlow
         switch runs in hybrid mode?
         Answer: Yes, this will work, because in port-based hybrid mode, each
         port will either be OpenFlow-controlled or not, and the OF-controlled
         ones will not be part of a VLAN, they'll just be part of a datapath.
         (Note that this might *not* work, however, in VLAN-based hybrid mode,
         because then you've got VLAN tags within a VLAN. This can probably be
         made to work, but it may require additional/different configuration.
         Shouldn't be an issue, but we should keep it in mind in the unlikely
         event that something changes from what we expect about how the switch
         does hybrid mode.)
    <IB> Our BNT/IBM switches will support port-based hybrid mode. We think it
         will work.
    <JB> 15. If the general model is that the flowspace is sliced based on VLAN
         tags, will this still work if the OpenFlow switch runs in hybrid mode?
         Answer: Yes, this will work, because in port-based hybrid mode, each port
         will either be OpenFlow-controlled or not, and the OF-controlled ones will
         not be part of a VLAN, they'll just be part of a datapath.
         (Note that this might *not* work, however, in VLAN-based hybrid mode,
         because then you've got VLAN tags within a VLAN. This can probably be made
         to work, but it may require additional/different configuration. Shouldn't
         be an issue, but we should keep it in mind in the unlikely event that
         something changes from what we expect about how the switch does hybrid mode.)
    <IB> Our BNT/IBM switches will support port-based hybrid mode. We think it will work.
    
  • 16. In figure 1 (ExoGENI Rack overview), how do you expect the Dataplane link to Campus OpenFlow network to be set up and configured? Will it require manual setup? Are there any implications? Do we expect FOAM to be used to request and approve these OpenFlow connections? (hpd) (this is an adequate first cut answer and the use cases duplicate this - closing)
    <IB> Affirmative on FOAM. We assume it is a connection (10G or
         downconverted to 1G) to some OF-enabled campus switch.
    <JC> This link is optional, and it doesn't have to go to an OF
         network.  It can be used to pipe campus VLANs into the switch,
         for interconnection with slices under OpenFlow control.
    <JS> I'll poke at the details of this more in my use case follow-up.
    
  • 17. Is Internet2 Dynamic Network System (DYNES) supported? (hpd)
    <IB> DYNES uses ION (dynamic circuit service on Internet2). ION is
         supported because we support OSCARS - the software behind ION.
    
  • 19. Stitching support:
    • 19a. Confirm ION/Sherpa/OSCARS all comes through the RENCI SM? (ahelsing)
      <IB> Yes
      
    • 19b. Can I connect to racks without the RENCI SM? (ahelsing: but see g)
      <IB> If you have external stitching tool
      
    • 19c. RENCI SM and other racks coordinate via ORCA private interfaces? (ahelsing)
      <IB> Yes
      
    • 19d. What resources are allocated to the ExoSM? (jbs)
      <IB> See 14a plus all the intermediate network providers (LEARN,
           BEN, NLR, I2, ANI, NOX etc)
      <JC> 'Allocated' is a strange term to use.  "visible" would be a
           better term.
      <JS> I think there's a difference between "visible", in the sense of "the ExoSM
           is aware of them", and "allocated" or "delegated", in the sense of "ExoSM
           manages them, and the local SM doesn't".
      
          Side note: Is there any chance we can reconcile terminology here, or are
          we going to be talking about "GENI AM, by which we mean ORCA SM", and
          "ORCA AM, which is not a GENI AM at all", and so on, for the rest of this
          project? :^\
      <AH> We should have them document the process of changing that allocation,
           maybe even try it once, to be sure this isn't terribly disruptive or hard.
      <JS> I think we can assume that they'll document this, plan to try it out,
           and pester them if they don't document it. I think we're set here.
      
    • 19e. Can experimenter go to racks separately for compute and then to the ExoSM just for the links? (hpd)
      <IB> This mode is not supported
      <JC> This is not supported BECAUSE it means that an AM could be
           asked to operate on the same slice by two different SMs/controllers,
           and ORCA does not support this.  (It is a limitation we probably
           should not have.)
      
    • 19f. Who is writing the Internet2 AM? (hpd) (RENCI wrote the code already - closing)
      <IB> The code is already there, we need a physical connection (at
           StarLight would be best).
      
    • 19g. Does Single rack manifest expose external VLAN tag so I can stitch? (ahelsing: they'll do this by GEC13)
      <IB> We'll work on it. We can add an external-facing port as part
           of an internal slice.
      <AH> 1) They do not currently support stitching resources you get from a
           single rack. (Q 19g)
           We should ask them to:
             a) expose the VLAN tag they have allocated in the manifest
             b) accept a VLAN tag allocated elsewhere in their request, and use it if
             it is available (else fail)
           I think this is something we want to get them to do. Definitely (a),
           even better (b). Ilia said they would work on it. I want to press for it
           to be done by GEC13 and in the initial racks.
      <AH> 19g) Please support stitching of rack resources by GENI tools for this 
           year's racks - ideally by GEC13.
           Specifically:
           - More important: Support a request to a rack for compute resources and
           a VLAN out, where the resulting manifest specifies the allocated VLAN,
           such that this can be stitched to the next aggregate in the network.
           - Less important: Accept a VLAN tag in a request for a VLAN that the
           next aggregate in the network has allocated, and try to use it at that
           rack if it is available (failing the request if it is not available is
           expected at aggregates that do not support VLAN translation).
      <IB> 19g.1 will be available.
      <AH> Do you think you'll have this circa GEC13? April? September?
      <IB> 19g.1 We will try to get it by GEC13. It's not that much work.
      <AH> Sounds great
      
      <IB> 19g.2 less likely to be available soon (please show me an RSpec request for this).
      <AH> 19g.2: The closest I have is the sample requests to the PG aggregate in
            gcf/examples/stitching/libstitch/samples:
             http://trac.gpolab.bbn.com/gcf/browser/examples/stitching/libstitch/samples/utah-equest.xml?rev=795c50b86faf82f0fa8696d80005424e0b2089af
      
            Assume you were specifying a VLAN tag to the PG AM to stitch to ION:
            Within the stitching extension, at hop 3, you would specify both
            vlanRangeAvailability and suggestedVLANRange of the allocated VLAN tag.
            Presumably you could do something additionally in the <interface>
            element within the <node client_id="left">
            As I said, this is lower priority.
      <IB> 19g.2 we'll see
      <AH> Sounds great
      
  • 20. Authorization: Orca APIs use what? Same as GENI? To what extent is this not exactly the same policies as the GENI APIs?(see TM comment below)
    <IB> Almost same as GENI. Some validity checks are disabled.
    <JC> I think Ilia interpreted this as a question about "Alt-G".
         Maybe it was.  As for the internal APIs: In the ExoGENI
         configuration every ORCA AM will trust every registry-approved
         SM to validate a request before passing it to the AM.  AMs
         only check that the SM is registry-endorsed.
    <AH> 6) Orca authorization (for their private APIs) apparently use similar
         checks to GENI. (Q20)
         We should ask exactly what they changed, to be sure it isn't worrisome.
         I don't have any real worries here though.
    <TM> This item can be closed. The ORCA APIs will effectively provide the same 
         authorization as the GENI APIs by accepting identity certificates from 
         known and trusted certificate authorities, namely the GPO and ProtoGENI 
         CAs. While the GENI AM API requires credentials, there is no impediment 
         to getting those credentials for any registered user today. It is nonsensical 
         to require ORCA to build out additional infrastructure to require and honor 
         those credentials through their own APIs.
    <JS> Hmm, so I have a question about this: If having a GENI user certificate is
         all you need in order to get an ExoGENI sliver via the ORCA API, does that
         mean that you wouldn't necessary have a GENI slice that contains your sliver?
         If that's not correct, and you do have a GENI slice: Where does that GENI
         slice come from?
         If that is correct, and you don't have a GENI slice: Is that ok, or will
         it cause other problems? (e.g. with things that assume that any allocated
         resource is part of a sliver, which is part of a slice, which is owned by
         a user -- if there's no slice, that chain may break down.)
    <IB> It is correct that in that case there is no slice as far as SA is concerned. 
         This is where we get into the 'what is a sliver and what is a slice' argument. 
         What ORCA creates are in fact slices, not slivers. We just call them slivers 
         when GENI AM API is invoked. 
         I would say that anyone who cares about this, should use GENI AM API on ORCA SMs
         and these problems go away. You get weaker stitching in that case. Tradeoffs, as usual.
    <JC> We agree on a base principle: ExoGENI will allocate resources only to registered 
         GENI users who have been granted rights by a GENI-approved trust root to allocate 
         resources on GENI.
         Is that enough of an answer to close down this item?
         In the short term, if the only way to get proof that a given user is authorized to 
         allocate resources on GENI is for that user to obtain CreateSliver rights to some 
         slice (any slice) on GENI, then that is what we will do.  Once we have that proof, 
         we can use it in various ways.
         Ideally there would be better ways to get that proof, and so the answer may change 
         if and when better ways become available.
         As for "GENI slice that contains your sliver", well...I think it's a long discussion 
         what that means, exactly.  Please, let's not have that discussion now. 
    TO-ADD MORE LN
    
  • 21. What is the Actor registry used for? Is this an alternative non GENI way to authenticate inter-rack communications? (hpd)
    <IB> Yes, it is a way to manually approve actors joining into ExoGENI. There is no GENI way to authenticate inter-rack communications.
    
  • 22. OpenFlow rack as onramp: when will this be supported? (jbs)
    <IB> Jeff Chase wants it yesterday. Realistically some time this year.
    <JC> There is an MS student (Ke Xu ... Jessie) working in this area
         at Duke.  We are not sure what she can do yet.  We might be
         asking her to try some stuff on the GENI OF resources.   I
         think on-ramp is easy, but OpenFlow has to work.  That's the
         hard part.
    <JS> I've heard the phrase "onramp" a couple of times, but don't know exactly
         what it means. Is it just use case 4? If not, is there a definition somewhere?
    <JC> On-ramp is a stitch between private links owned by two different
         slices, by mutual consent of both slices. It is the moral equivalent
         of slices peering their virtual networks.
    
    • 22a. Are there conflicts between FOAM and Orca mechanisms to create FlowVisor rules? (hpd) This question is being replace by the new GPO question 29.
      <IB> Hopefully FlowVisor will flag it.
      <JC> No.
      <AH> 9) Orca uses FlowVisor directly, opening up the possibility of conflicts
           between FOAM and Orca.
           The solution would be for Orca to use FOAM, once a pluggable API there
           exists, but it doesn't yet. ''We need to keep an eye on this.''
      <AH> I'm pretty sure it won't, which is a point in favor of ORCA -> FOAM -> FV
           rather than ORCA -> FV + FOAM -> FV.
      <JS> I'm pretty sure FlowVisor won't -- in particular, there's nothing
           fundamentally wrong with creating flowspace rules in FV that describe
           overlapping flowspaces. But you usually don't get what you want,
           especially if "you" are multiple experimenters who aren't even aware of
           each other's slivers.
      
           To my mind, this is a point in favor of having ORCA talk to FOAM, rather
           than having both ORCA and FOAM talk directly to FlowVisor.
      
           That said, it's certainly possible for both ORCA and FOAM to find out the
           flowspace on the FlowVisor, and to use that to avoid allowing people to
           have overlapping flowspace. But they have to actually do that explicitly.
      
  • 23. A few monitoring items are marked incomplete: Dates& plans? (hdempsey)
    • 23a pubsub event feed to GMOC (is GMOC ok with your plan?): (Chaos is tracking in GST 3369)
      <IB> GEC13
      <CG> Per Jon-Paul, GMOC originally proposed the pubsub model and
           will support it, details TBD.
      <CG> FYI: GMOC has agreed to send someone to tomorrow's monitoring call who
           can talk about the pubsub proposal they made for RENCI.  So we should
           know more by then about what has been proposed for slice monitoring
           data submission (question 23A), what the timeframe is, etc, and ideally
           that will lead to us having a more intelligent opinion about whether we
           like it.
      <CG> 23a. Submitting per-slice/sliver relational data to GMOC:            
           This has been discussed a little bit on the monitoring@geni.net list.
          It sounds like both ExoGENI and GMOC can provide support to get an
          approach involving XMPP's pubsub protocol working.  GPO will work with
          GMOC to make sure that the per-slice/sliver data which is stored can
          be used across experiments with ExoGENI and non-ExoGENI pieces.
          So i think we are all set on this question.
      <IB> I think we're all on the same page here.
      <CG> I want to follow up again on these to make sure we are working from   
           the same assumptions.  If ExoGENI and GMOC can negotiate an approach
           using pubsub or direct nagios communication, and both sides can do the
           respective work to get that interface running, that's fine with GPO.
           Our desire here is just to have some useful data be submitted from each
           rack to GMOC (or polled from each rack by GMOC) reasonably often.
           However, the interface which already exists is the XML-over-HTTPS data
           submission API developed by GMOC.  This API is active and usable for
           time-series (operational measurement) data right now, and GPO and GMOC
           are working to make sure it will be ready for relational data (e.g. slice
           metadata) within the next couple of months.  ExoGENI racks will need to
           interface with these APIs as a minimum offering.
           Again, if ExoGENI and GMOC do the work between you to support something
           you both like better, that's very likely to be a fine substitute.
           Otherwise, the XML-based API is a "least common denominator" solution,
           and RENCI should submit data using it.
      <IB> My main concern is GMOC's work so far has been outside of the GENI I&M 
           framework which may or may not be a good idea. I'm attempting to bring 
           everything under one roof by using the XMPP bus that will also be used for 
           GENI I&M to submit GMOC-relevant data and see if it flies.
      <HD> Ilia,
           This is not going to work.  The GENI I&M framework is not fully implemented 
           yet,  and may not be for considerably longer than it takes for us to field 
           the racks this year.  The GMOC interface predates the I&M framework, and has
           already been in use in the mesoscale for well over a year.  No GMOC-relevant 
           operations data should be submitted via I&M, which is specifically for 
           experimenter-relevant data.  The GMOC has been part of the I&M project in order
           to make sure that it was possible to distribute operations data into the I&M 
           framework if there was demand for that among experimenters.
           It may be that the GMOC evolves their interface to be more like the I&M interface 
           for simplicity and ease of programming, especially for the aggregate providers.  
           However, we can't count on that for Spiral 4.
      <CG> To clarify: we have no objection to the use of XMPP per se --- GMOC's
           generic interface for data submission uses XML sent via HTTPS, but, if
           data is collected or sent some other way, that's fine.  As Heidi said, we
           just want to see operational data transmission from each rack to GMOC's
           operational monitoring database during this spiral.  I believe using
           the existing data submission API is the most straightforward way to do
           that this year.  However, if the anticipated benefits of XMPP outweigh
           the extra work, and the work can be done this spiral, that's fine.
      <IB> There are enough commonalities between what GMOC wants and what the 
           experimenters want that their work and I&M will converge.
           By March we should have an XMPP bus with GENI authn/authz available 
           (as part of our IMF project with Harry) via which we should be able 
           to make data available to GMOC and at the same time make it available 
           to anyone else with proper GENI credentials.
      <CG> I can think of two concerns about using this approach in the short term:
       1. "Proper GENI credentials" sounds to me like an experimenter being
          able to get access to his own experiment's data.  That's not the
          same thing as operational monitoring data [1], which isn't going
          to be per-sliver, but rather might contain some amount of metadata
          about all slivers, and information which is not per se about slivers.
      <IB> You're assuming experimenters only need access to data from their own 
           slices. I disagree. There are circumstances where an experiment in 
           GENI means looking at other experiments.
      <CG> Do you have an implementation for allowing a non-experimenter
          operations group like GMOC read-only access to broader monitoring
          data using a GENI credential?
      <IB> Working on it.
      <CG>  2. On the GMOC end, there needs to be code to use this XMPP interface,
          acquire data, and put it into some operational database at
          http://gmoc-db.grnoc.iu.edu/.  If this is going to be done using
          a new interface rather than an existing one, someone will need to
          write that code.
      <IB> We have example code that can get data from Pub/Sub. 
      
    • 23b thing with U Alaska from individual VMs: VMI would be nice to have, but is not critical for rack design, so i think we are satisfied with what we know here. (Chaos)
      <IB> Need to ask them
      <CG> Question 23B says "thing with U Alaska from individual VMs".  What does
           this question mean, and who asked it?  I am pretty sure it was not me,
           though a lot can happen in two days
      <CE> I believe that RENCI wants to use the U. Alaska "virtual machine
           introspection" software to provide monitoring of what's happening
           inside individual VMs (based on my quick read of the design doc).
      <CG> Ah, thanks Chip.  That's helpful, and, indeed, i see this in section
           4.3 of their proposal: <<Proposal text not enclosed here>>
           To me, this sounds useful if they can do it, and not critical if they
           can't.  Who put it on our list of design review questions, and what is
           your concern about it?
      <HD> Aaron put it on our list and Ilia had no more information about it at
           this point--I think Aaron was just pointing out that the information
           about it was incomplete in the document, which is true.   I've talked
           to Brian from U of A a few times about the VM introspection software
           and think he would be a good collaborative addition to the team if he
           and Ilia work this out.  As you say, it won't be critical if they don't.
      <AH> I put it on the list, and I just wanted a date. They said, in not so
           many words, 'we want to use this'. So I wondered when.
      <VT> The VMI project has a milestone to demonstrate VMI in a Eucalyptus
           cluster environment at the March GEC.  They are working with Renci on
           getting this technology into Eucalyptus clusters that federate using
           the Orca framework (to be demonstrated in July).
      
    • 23c Nagios interface to GMOC (is GMOC ok with your plan?): (Chaos, GST 3369)
      <IB> I don't know if GMOC is fully OK with it, but we prefer Nagios to the homegrown solution.
      <CG> We don't yet know what will be doable for GMOC.  Mitch McCracken,
           the GMOC staffer who maintains the time-series data submission API,
           will be on our monitoring call tomorrow afternoon.  If you or Jonathan
           Mills or someone else from RENCI would like to be on that call and talk
           in more detail about what you'd like to do, what GMOC would need to do
           to support it, and why you prefer it, that would be a good next step.
           Mitch is new to maintaining the API and coming up to speed, so we won't
           make any decisions on the spot, but it would be a good forum for sharing
           information and understanding a bit better what you'd like to do.
           Let me know if you need more information about the call --- i know at
           least Jonathan has attended it before.
      <CG> With GMOC's Mitch's permission, i asked RENCI to send someone to the
           Friday call to talk about operational time-series/event? data submission
           (question 23C).  Mitch's short answer was that he probably wants something
           more centralized than whatever RENCI is proposing, but he's interested
           in understanding more about what RENCI actually wants, and so am i,
           so hopefully they will show up and talk about it, and, again, we can
           use that to figure out whether we like their answers.
      <IB> Jonathan Mills (who was present at the review) also attends the monitoring calls.
           He is in charge of modifying Nagios to our needs.
      <JM> Yes, and I am planning to attend the next ExoGENI monitoring call.
      <CG> At the monitoring call, Jonathan and Mitch agreed that it would
           be easy for RENCI to submit data from Nagios via the GMOC
           time-series data submission API.  They have started working
           on this on monitoring@geni.net.  I am satisfied that there is
           general agreement about what to do.
      <CG> 23c. Nagios interface to GMOC:                                               
           This has also been discussed on monitoring@geni.net, and the consensus
           here is that ExoGENI and GMOC will work together to write a stub
           which sits on each rack's Nagios aggregator, and submits information
           from Nagios's status.dat file to GMOC via some data exchange format
           (probably the GMOC data submission API, but if ExoGENI and GMOC prefer
           something different, that is fine).  This sounds like it should not be
           a lot of work beyond what has already been done to get Nagios working
           on the racks, and it's already being worked on.  So that's great too.
      <IB> I would suggest using the same XMPP pubsub mechanism as is used for
           manifests. They will also have access to the browser interface in Nagios.
      <CG> As long as each rack is submitting its own operational time-series data
           directly to GMOC, i think whatever mechanism is easiest for Jonathan
           and Mitch is fine.  Since operational data might be used to help during
           an outage, we do want to make sure as much data submission as possible
           continues to work during an outage.
      <JM> I have one thing to add, as an alternative way of getting the information
           out of Nagios itself, which is to directly query the LIvestatus broker.
           Livestatus is a Nagios broker module which can be queried in various ways.
           Broker modules are loaded into Nagios when the daemon launches, and thus
           they have direct access to its internal memory tables.  Queries in this
           manner are the fastest because no intermediate action occurs (for instance,
           writing to status.dat is not necessary; neither is writing to a SQL db with NDO).
           Because it is reading object status directly from Nagios's memory, the results
           are always 100% up to date......no time delay.  The broker module is already
           installed on any Nagios installation that I set up, because it is a required
           component of Check_MK.  It can be queried from either a TCP or Unix socket.
           Details can be found here:  http://mathias-kettner.de/checkmk_livestatus.html
           While this method of "getting at the data" has lots of upsides, it could require
           a rethinking of how the pubsub model would fit.  It necessarily shifts us from
           parsing/translating a file on disk (status.dat) to having to actively query something.
      <IB> Querying/polling won't be a problem. This sounds like an interesting approach.
      <CG> I want to follow up again on these to make sure we are working from   
           the same assumptions.  If ExoGENI and GMOC can negotiate an approach
           using pubsub or direct nagios communication, and both sides can do the
           respective work to get that interface running, that's fine with GPO.
           Our desire here is just to have some useful data be submitted from each
           rack to GMOC (or polled from each rack by GMOC) reasonably often.
           However, the interface which already exists is the XML-over-HTTPS data
           submission API developed by GMOC.  This API is active and usable for
           time-series (operational measurement) data right now, and GPO and GMOC
           are working to make sure it will be ready for relational data (e.g. slice
           metadata) within the next couple of months.  ExoGENI racks will need to
           interface with these APIs as a minimum offering.
           Again, if ExoGENI and GMOC do the work between you to support something
           you both like better, that's very likely to be a fine substitute.
           Otherwise, the XML-based API is a "least common denominator" solution,
           and RENCI should submit data using it.
      <IB> My main concern is GMOC's work so far has been outside of the GENI I&M 
           framework which may or may not be a good idea. I'm attempting to bring 
           everything under one roof by using the XMPP bus that will also be used for 
           GENI I&M to submit GMOC-relevant data and see if it flies.
      <HD> Ilia,
           This is not going to work.  The GENI I&M framework is not fully implemented 
           yet,  and may not be for considerably longer than it takes for us to field 
           the racks this year.  The GMOC interface predates the I&M framework, and has
           already been in use in the mesoscale for well over a year.  No GMOC-relevant 
           operations data should be submitted via I&M, which is specifically for 
           experimenter-relevant data.  The GMOC has been part of the I&M project in order
           to make sure that it was possible to distribute operations data into the I&M 
           framework if there was demand for that among experimenters.
           It may be that the GMOC evolves their interface to be more like the I&M interface 
           for simplicity and ease of programming, especially for the aggregate providers.  
           However, we can't count on that for Spiral 4.
      <CG> To clarify: we have no objection to the use of XMPP per se --- GMOC's
           generic interface for data submission uses XML sent via HTTPS, but, if
           data is collected or sent some other way, that's fine.  As Heidi said, we
           just want to see operational data transmission from each rack to GMOC's
           operational monitoring database during this spiral.  I believe using
           the existing data submission API is the most straightforward way to do
           that this year.  However, if the anticipated benefits of XMPP outweigh
           the extra work, and the work can be done this spiral, that's fine.
      <IB> There are enough commonalities between what GMOC wants and what the 
           experimenters want that their work and I&M will converge.
           By March we should have an XMPP bus with GENI authn/authz available 
           (as part of our IMF project with Harry) via which we should be able 
           to make data available to GMOC and at the same time make it available 
           to anyone else with proper GENI credentials.
      
      <CG> I can think of two concerns about using this approach in the short term:
       1. "Proper GENI credentials" sounds to me like an experimenter being
          able to get access to his own experiment's data.  That's not the
          same thing as operational monitoring data [1], which isn't going
          to be per-sliver, but rather might contain some amount of metadata
          about all slivers, and information which is not per se about slivers.
      <IB> You're assuming experimenters only need access to data from their own 
           slices. I disagree. There are circumstances where an experiment in 
           GENI means looking at other experiments.
      <CG> Do you have an implementation for allowing a non-experimenter
          operations group like GMOC read-only access to broader monitoring
          data using a GENI credential?
      <IB> Working on it.
      <CG>  2. On the GMOC end, there needs to be code to use this XMPP interface,
          acquire data, and put it into some operational database at
          http://gmoc-db.grnoc.iu.edu/.  If this is going to be done using
          a new interface rather than an existing one, someone will need to
          write that code.
      <IB> We have example code that can get data from Pub/Sub.
      
  • 24. Rspec support questions:
    • 24a. When will you support GENI v3 RSpecs - part of the GEC13 completion? (hpd)
      <IB> That's the goal. The differences are not that significant.
      
    • 24b. When will you support what RSpec conversions? Can you send sample manifests and advertisements? When can we test?
      <IB> Will send separately. Testing can be done now (Luisa has).
      <AH> 3) Ilia offered sample manifests. We should ask for those - to start
           checking that they include what we expect. (Q 24b)
      
  • 25. Have you tested performance of a single management node with a full load of running software (FV and OpenStack/Euca head and GENI services and monitoring etc.? Or Is FV on a separate VM?
    <IB> Everything but the OpenFlow components. It's not much of a
         load. Supporting FV still an open question in terms of performance
         needed.
    <HD> To Nick: Do you have enough info yet to know whether FV on a VM in 
         ExoGENI rack will be OK at this point?  Have you given Ilia any 
         more info about what FV needs for acceptable performance in GENI?
    <NB> I have no further information from Ilia, and he has not requested any
         information from me in reference to this.  My understanding is that he
         is aware that they have not characterized the FlowVisor workload and
         still need to do so.  (I also have other concerns about software
         interaction and compatibility as expressed on the mailing list).
    
  • 26. On layer 2 dataplane connectivity testing: Do you envision a long running slice where we can allocate VLANs to test as needed? What happens if the AM is unreachable/down?
    <IB> If AM is unreachable, you can't provision a VLAN. When the
         VLAN is up it should stay up regardless of AM status.
    <JS> We should bang on this a little more, and understand whether our
        monitoring stuff will in fact be in a slice, or a non-GENI thing. (I don't
        feel strongly about it, and don't recall now if we concluded that we
        preferred one or the other.)
    <GC> 26. Dataplane reachability testing:                               
       We think it would be a good idea to have two types of tests to go
       with the two types of VLANs:
       * Where the ExoGENI AM is used to provision a VLAN, we'd like to see
         a test which stands up a VLAN, verifies that it can be used,
         and reports to monitoring on whether that entire system (which
         includes the AM, of course) is healthy.  I believe you discussed
         doing something like this already: does what i just said sound
         similar to what you have in mind?
    <IB> I think so. A simple reachability test would not be difficult to do,
         but currently is not a high priority.
    
    <CG> * Where an ExoGENI rack is going to be connected to a static
         (long-standing) VLAN outside of the rack, e.g. to the shared
         mesoscale VLANs or to a longstanding L2 connection to non-rack
         resources at a particular site, we'd like to see a static test
         interface on each VLAN which could be used to verify connectivity.
         It would be ideal if the test interface were non-OpenFlow-controlled
         on the rack, so that it could be used entirely to test "is this
         link up?".  Does this seem reasonable?
    <IB> yes, but not with the current version of the switch which is OpenFlow
         all-or-nothing. When we have the hybrid mode towards the end of the year
         this should be possible.
    <CG> Good point --- if everything on the dataplane switch is
         OpenFlow-controlled, then a non-OF-controlled testpoint is not possible.
         However, i think it would still be possible to place a static test
         interface on a static VLAN which reports to e.g. FOAM.
    <NB> The external VLAN connection can be established and tested regardless
         of the hybrid-ness of the switch.  If the goal is merely to establish
         whether a link an external interface is up or down, this can be done
         within or without openflow, regardless of the state of the switch (the
         ability of a switch to determine port up/down status, electrical,
         protocol, or administrative, is independent of the openflow
         implementation).  If you are truly determined to require a
         non-openflow port to test (electrical?) connectivity, you can do that
         with the current BNT firmware as well (ports can be configured to be
         non-openflow, they just have no real features beyond that of a
         standard L2 learning switch, but those would suffice for this
         purpose), but there's no particular reason why this port can't be
         openflow controlled.
    <IB> I think Chaos wanted to have an interface with an assigned IP address
         internal to the switch that can be used for L3 reachability testing.
    <CG> L2 reachability testing, really.  Sorry if i've been unclear: the goal
         here is to have some tools to detect problems with shared VLANs.  If an
         ExoGENI rack participates in a core VLAN, but can't reach other things
         on that VLAN, then an experimenter might want to know something is wrong
         before trying to provision something attached to that VLAN on that rack.
         In practice, you don't want to provision too many different test
         resources, and there is a tradeoff between having a test which is most
         similar to an experiment ("if this test works, it is very likely that
         this resource is healthy enough for an experimenter to use") vs. having
         a test which gives you more information about what is wrong ("this test
         does not depend on OpenFlow, so, if this test fails, it tells us there is
         a connectivity problem caused by something other than OF on this rack").
         I think it's fine with us to have a test which runs in a sliver, but
         if we're testing a static VLAN to which an experimenter would connect
         by e.g. reserving resources using FOAM, the resource test should use a
         FOAM sliver.  An advantage to setting up a test interface which doesn't
         depend on a local sliver, is that people can use that interface for
         reachability testing without having to maintain that local sliver.  But,
         in fact, all of our core testing uses slivers somewhere --- there's no
         good way around that.  It just has to be feasible for that sliver to be
         used for frequent/automated testing.
    <IB> The easiest way to test l2 reachability is to configure l3 addresses on
         two endpoints of a Vlan. With traditional switches you can  configure an
         interface on a Vlan and assign it IP address (internal to the switch).
         I thought that is what you meant. We do it here periodically (manually)
         between our switches.
    <NB> Sure, I would use FlowVisor or FOAM for the first test (no sliver
         required to know what ports are up, or whether the datapath seems to
         be available at all), and SNMP for the second (also no sliver
         required).
         If you want this information centrally (via GMOC or something) we
         should probably offer a read-only FOAM monitoring API, since getting
         the information via the existing admin API seems like a bad idea, but
         that's a trivial problem.
    <CG> Sorry i never got back to this.  This kind of approach is fine with us         
         for testing of static VLANs, and is indeed what we had in mind.
    

* 27. We want to share the final ExoGENI rack parts list and rack diagram (when you finish it) on the GENI web site. OK with you? (Luisa Nevers, see notes below)

<IB> It's available here:
       https://docs.google.com/document/d/1hzleT6TNmiDb0YkkgqjXxPFJ37P4O6qApLmXgKJHBZQ/edit
<LN> Collected remaining information for the parts list.  See:                                           
     http://groups.geni.net/syseng/wiki/GENI-Infrastructure-Portal/GENIRacks#ExoGENISpecifications 
     Checked on wiring rack diagram and found from Brad Viviano that the diagram will be 
     available after the GPO rack is assembled.
<LN> Email exchange on Feb 13 to gpo-infra included a wiring diagram which is attached as file
     named Rack-diagram-wiring.xls
  • 28. How does a site admin control resources which have been allocated to ExoSM and are controlled centrally by RENCI?"
    <IB> ORCA configuration files. There is an actor configuration file
         (XML) and a resource description file (scary NDL-OWL).
    
  • 29. In the design review, ORCA indicated that they started a NOX instance to communicate with FlowVisor to communicate ExoGENI OpenFlow requests. Nick Bastin said he would like to replace this with an API to FOAM. Nick and Ilia promised to follow up. This question should also address conflicting FlowVisor requests capture in Q 22a. (jbs)
    <JS> Then there was 22a:                                             
         22a. Are there conflicts between FOAM and Orca mechanisms to create
         FlowVisor rules?
         I had said "To my mind, this is a point in favor of having ORCA talk to
         FOAM, rather than having both ORCA and FOAM talk directly to FlowVisor."
         Nick also liked this idea; have you guys talked about it any further?
    <IB> The implementation we have today talks to FlowVisor. We prefer this
         method because it bypasses the need to create OF RSpec. Also, based
         on discussions with Nick, he is unwilling currently to modify FOAM
         RSpec to what we need.
    <JS> Hmm, I think there may be some confusion about this: I don't think that
         Nick was proposing that ORCA should write and submit rspecs to FOAM, but
         rather than ORCA would talk directly to a FOAM API. He plans to write a
         plugin API, which others (or he) can then use to write plugins to talk to
         FOAM via arbitrary custom APIs; most immediately, he says he'd be happy to
         write a custom FOAM API for ORCA. But rspecs don't enter into it at all in
         any case.
         The custom ORCA - FOAM API could look pretty much however you want. For
         that matter, it could be identical to the FlowVisor XMLRPC API -- but
         going through FOAM means that you can be aware of other FOAM slivers, and
         that FOAM is aware of ORCA-created slivers, for free.
    <JB> 29. Will ORCA talk directly to FlowVisor, or to FlowVisor via FOAM?          
         require some FOAM development work, but Nick is eager to work on this.
         ISSUE: Nick and the ORCA folks should talk about timeframes, to make sure
         that Nick can do what the ORCA side needs, in time for them to use it, but
         he doesn't think it would be a problem to get it done very quickly.
    <IB> The implementation we have today talks to FlowVisor. We prefer this method 
         because it bypasses the need to create OF RSpec. Also, based on discussions 
         with Nick, he is unwilling currently to modify FOAM RSpec to what we need. 
    <IB> That's not to say I'm blaming Nick - he has his reasons. FlowVisor at this 
         time presents what appears to me the most stable and easy to program interface. 
         As the code and RSpec evolve in the future we can revisit this question.
    <JS> Hmm, I think there may be some confusion about this: I don't think that
         Nick was proposing that ORCA should write and submit rspecs to FOAM, but
         rather than ORCA would talk directly to a FOAM API. He plans to write a
         plugin API, which others (or he) can then use to write plugins to talk to
         FOAM via arbitrary custom APIs; most immediately, he says he'd be happy to
         write a custom FOAM API for ORCA. But rspecs don't enter into it at all in
         any case.
    <IB> I may be mistaken about the current state of FOAM. I thought it supported 
         GENI AM API, and that requires RSpec. Is there another interface and what is it? 
    <NB> The point here is there *can* be another interface (there are already
         4 API interfaces, soon to be 5 when we add GENI AM API v2), so there's
         not really any problem adding one for exogeni (or, alternatively, just
         making one that looks like the flowvisor XMLRPC interface).
    <IB> I don't have a problem with that when it becomes available. 
    <JS> The custom ORCA - FOAM API could look pretty much however you want. For
         that matter, it could be identical to the FlowVisor XMLRPC API -- but
         going through FOAM means that you can be aware of other FOAM slivers, and
         that FOAM is aware of ORCA-created slivers, for free.
    <IB> I expect a fairly hard partitioning between label spaces that FOAM operates in 
         and ORCA operates in, so this may not be a serious issue, but it may be worth discussing.
    <NB> The intention would be for FOAM to be aware of the sliver URNs if at
         all possible.  Obviously this wouldn't be possible out of the box if
         we merely emulated the FV XML-RPC API, but if we added an extra
         parameter or two to CreateSlice it would be relatively easy
         information to provide.
    <IB> I don't think I followed that. Which sliver URNs? 
    
    <JS> We (GPO) think this would be worth spending a ten minute phone call to
         talk about, and we'd like to help facilitate that (whether you end up
         going this route or not -- we just want to promote communication); any
         chance you'd be available later this afternoon? Or maybe Monday?  
    <IB> That's fine. Next week is better. Please use Doodle.                            
    <NB> The URNs for the slivers which have associated FlowVisor slices.
         (Such that if one asked FOAM, it would know about all the slivers
          which had resources allocated…failing that, at least user URNs).
    <JS> I've craeted http://www.doodle.com/r58q8esy4vawn7q8 as a Doodle poll
         suggesting times this afternoon and tomorrow; I think Ilia and Nick are
         essential, and anyone else who's interested could listen in. (And anyone
         else who Ilia thinks is essential from the ExoGENI team -- Ilia, let me
         know if you have anyone else in mind.)
    
    "Note the message below is in response to a much earlier comment form Ilia
    which stated:  <IB> I don't have a problem with that when it becomes available.
    
    <NB> When which becomes available?  We're looking for some input here - is
         the path of least resistance to emulate the FV XML-RPC API, or should
         we develop something more specialized for exogeni?
    <HD> To Nick: I've seen several emails on the exigent-design list, and it 
         sounds like you, Josh and Iila are planning a teleconf this week.  
         Do you think you'll be able to put enough effort into the discussions 
         to work out a rough agreement for a solution this week?
    <NB> I believe this is conflating two issues:
         1) They have a separate software stack (their AM, not NOX) which
         communicates with FlowVisor outside the visibility of FOAM to allocate
         virtualized resources
         2) They have suggested using NOX to provide baseline control of their
         openflow resources for non-openflow experimenters.  I think many
         people (myself included) believe this is a bad idea, and we should
         explore precisely what they are trying to accomplish and how to best
         execute that.
       
         We are planning on discussing issue (1) this week, but there has been
         no further mention of issue (2).
    <JS> Mm, I'd forgotten about that part. Perhaps because I feel like "they're
         planning to provide a service that we think is a bad idea" seems like less
         of a problem (we can ask them to turn off that service) than "they're not
         planning to do something that we think they'll need to do".
         Nick, should we be more concerned about this than we are?
    <NB> I believe so.
         My general understanding is that because they can't run this switch in
         an "acceptable" hybrid mode (for varying values of whatever that is)
         they have identified a need to still be able to provide their
         non-openflow network service to experimenters, so they're going to run
         a controller to manage these slices and they've chosen NOX.  This is
         at least my understanding.
    <JS> Ah, that sounds plausible. (I had lost the context here, and was thinking
         that they were talking about an even more optional service, which people
         who wanted to use OpenFlow could use if they didn't want to run their own
         controller; but in fact I think you're right, what they're talking about
         is something so that people who don't care at all about OpenFlow don't
         have to touch it at all.)
    <NB> However, I believe they should run in this mode all the time - using
         hybrid datapaths only creates problems and limits functionality for
         the openflow portion of the network.  
    <JS> Could be; that sounds like a conversation we can have with them when the
         switch firmware allows hybrid mode. If the experiment with pure OF mode
         has gone well enough until then, it might be an easy sell -- and so that's
         some incentive to see the pure-OF way work well.
    <NB> That being said, I definitely
         don't think they should be running NOX to provide transport for
         non-openflow users - not particularly because this has anything to do
         with NOX so much as the fact that often when people say "run NOX" they
         mean run one of the NOX sample applications, which are not production
         applications (and certainly don't provide the functionality we would
         desire).  GIven Ilia's apparent lack of interest in writing any code
         to work with the openflow side of their rack, I highly doubt that
         they're intending to write a custom app for NOX to facilitate their
         use case.
    <JS> That all sounds likely to me. This isn't something that we have a lot of
         experience with, because we've mostly been focused on supporting
         experimenters who (a) want to do nifty things with OpenFlow, and are thus
         writing their own controllers; or (b) just want a learning-switch
         controller, but are doing things on such a small scale that NOX 'switch'
         or 'pyswitch' is good enough for their needs.
         You mentioned "production applications"; do you have any insight into what
         *would* fit that bill, but not cost a lot of money? (Or time spent
         convincing a vendor (like BigSwitch or NEC) to donate a production
         controller, or whatever.) Is there in fact a better off-the-shelf solution
         than NOX 'switch'?
    
    Note: Comment below after meeting with Ilia, Josh and Nick.
    <JS> We did! And the main thing that we concluded is that Ilia doesn't think
         there's time before GEC 13 to change how ORCA talks to FlowVisor, so we're
         not going to try to throw together an ExoGENI-specific API to FOAM
         immediately. Instead:
    
         * Nick will continue to work towards the planned API plug-in layer for
           FOAM, which he thinks will be done by GEC 13.
         
         * The first two racks (RENCI and BBN) will have ORCA talking directly to
           FlowVisor, with the understanding that this may have some issues.
         
         * We'll aim to shift to a model where ORCA will talk directly to FOAM,
           after GEC 13. (Or, talk more between now and then about whether this is
           a good idea -- Jeff raised some questions about this, and we generally
           agreed that what we all really want is for FlowVisor to have only one
           administrative master, but that it isn't fundamentally important whether
           that master is FOAM or ORCA... So if ORCA can do everything we need it
           to do -- including managing flowspace for non-GENI resources that aren't
           part of the ExoGENI rack -- then perhaps it makes sense to not run FOAM
           in an ExoGENI rack at all. But we think this that isn't a short-term
           solution, because it would require a way for those non-GENI resources to
           interact with ORCA to tell it what flowspace they wanted, and I don't
           think we have even an idea about what that would be.)
         
         So, with that, I think question 29 from the original list is answered: In
         the first two ExoGENI racks (RENCI and BBN), ORCA and FOAM will both talk
         directly to FlowVisor; we'll continue to discuss between now and GEC 13
         about how to narrow this down to having only one of them do that; and
         we'll aim to implement a single-master solution soon after GEC 13 (and
         definitely before any additional racks ship).
    
         Sound right? Anything else I missed or otherwise got wrong?
    
    
  • 30. External resources (like the mesoscale) have to be manually configured to be available. We should make a list of the resources that could connect and we want connected, and get them to build those in advance. Like the mesoscale.
    <HD> Configuration for meso-scale is worth persuing need answers to Josh's use cases first.
    <IB> There will be a hard partitioning between resources controlled by the ExoGENI SM and 
    other resources.  Changing the partitioning is pretty straightforward and not very disruptive.
    <TU> We are initially planning to hand 10 VLANs that we have provisioned to our FrameNet 
    endpoint and 1 VLAN that is provisioned to our ION endpoint to the ExoGENI team.  Initially, 
    they will control these VLANs with the ExoGENI SM.  WE can give them more VLANs later, 
    assuming it is easy.  We only plan on provisioning a single OpenFlow VLAN to the ExoGENI rack 
    in our lab (1750) to start. 
    <TU> Currently we are thinking about provisioning extra special use OpenFlow VLANs from each 
    mesoscale campus.  We will have to let the ExoGENI team know how to reach these VLANs once we 
    actually provision them.  This should be as simple as provisioning VLANs down to the rack and 
    letting the rack know the VLAN IDs.
    <TU> We are also thinking about having mesoscale campuses have a set of non-OF controlled 
    VLANs.  I think we should just be able to tell the ExoGENI team the VLAN ID and the endpoint 
    (FrameNet, ION, etc) that the mesoscale campus uses, an then the ExoGENI SM should be able to 
    connect to that VLAN.
    

Nick Bastin's Questions

Network:

  • B-1. Why not use FOAM everywhere? (hpd)
  • B-2. Why not run pure OpenFlow and slice on VLAN in FlowVisor w/translation at the rack edge?
  • B-3. How is IP space managed within the rack environment - can experimenters request more / specific IP space? (hpd) (Duplicate of question 10b)
  • B-4. The OpenFlow control channel looks to be extremely throughput constrained.
  • B-5(1). Does the switch not support the ENQUEUE action at all, or does it just not support all the openflow packet-queue structures?
    <BV> B-5. Is there an IPMI connection from the head node to the
         management switch? If so I think that makes for 45 management
         switch ports used.
         o Worker node.  IBM x3650 with Virtual Media Key.  1 port for
           vKVM/IPMI/etc, 2 ports for 1GbE traffic.  Total of 30 (assuming
           10 worker nodes)
         o Head node.  IBM x3650 with Virtual Media Key.  1 port for
           vKVM/IPMI/etc, 8 port for 1GbE traffic.  Total of 9.
         o iSCSI enclosure.  Redundant controllers, each with 2 ports.
           Total of 4.
         o Juniper VPN appliance.  1 WAN port, 1 LAN port.
         o PDU.  1 port (For 208V based PDU's)
        How many ports in total get used on the management switch will
        depend on the connectivity from each campus.  If for example
        we can ONLY get 1 1GbE connection from campus the total will
        be 47 (46 from above, plus campus into the management switch).
        That would be the worst case situation and leaves us 1 open
        1GbE port on the switch.
    <NB> Ok, I was just working off of the table on page 4 of the design
         document that has 44 ports used on the management switch.  It's a
         little hard to reconcile that table with figures 1 and 2, as well as
         the text.  Figure 1 has a red line connecting the management switch
         "to campus layer 3 network", and figure 2 has a line connecting the
         management switch to the Juniper SSG5 (which is not in figure 1), and
         no other connection to an outside L3 resource.  The text in 2.1 states
         "The connections to the commodity Internet via the campus network is
         expected to serve management access by staff as well as experimenters"
         - I read this to mean that all control-plane access (management and
         experimenter) would be coming in over the SSG5.  So, I guess the new
         question is, is there a direct campus L3 connection to the management
         switch, as well as a connection to the SSG5?  Also, do you really mean
         that the SSG5 is connected twice to the management switch?  (I
         understand how that would work, I'm just trying to figure out if
         that's what you mean)
    

Rack Configuration:

  • B-5(2). Is there an IPMI connection from the head node to the management switch? If so I think that makes for 45 management switch ports used.
  • B-6 I am concerned that the head node is under provisioned for all the services it needs to run - 12GB of ram seems low.
    <BV> We don't have empirical evidence that 12GB of memory won't be
         enough. We felt it was a safe starting value, but ensured there
         are free DIMM slots if we need to expand to 24GB or 36GB.
         Although the cost of 2GB DIMM's vs 4GB's isn't significant,
         when multiplied out to 12-14 sites it was enough that we decided
         to start with 12GB.  If we decide later to move to 24GB, we'll
         expand future racks so they come from IBM that way.
    <AH> 15) They haven't tested the head node, when the FlowVisor and FOAM are
         getting actively used, to check for performance problems. It's unclear
         if there is an issue here or not, but the only real solution appears to
         be to double the RAM - which they can do later if necessary.
    <CG> They can do it later if necessary, but why not do it sooner?  I'm curious
         what the actual numbers involved here are: my personal experience has been
         that RAM is (a) cheap, and (b) always the thing you're short of.  I know
         they said they could send a tech on-site, but, for a new installation,
         12GB of RAM should be something like $150.  There's one head node per
         rack, and how many racks the first year?  Again, i don't know the actual
         numbers or tradeoffs, but i think it's very likely that this is cheap
         and may solve a real problem.  So IMHO they should just do it while it's
         early enough to never have to think about it again.
    <NB> Fair enough.  I would also ask that we revisit the plan for all the
         software on the management node to be installed in the same OS
         instance - I really think this should be a virtualized environment
         (particularly because both FOAM and FlowVisor do not currently have
         RPM package builds).  This will put significant constraints on the
         software to use the same JVM versions, etc., or create an integration
         challenge to create separate environments for the software to run in
    
  • B-7. How is the head node configured - do the services run in their own VMs, or do they need to co-exist on the same OS instance? (jbs)
    <IB> The VM option remains open, however currently we are not seeing any
         software conflicts that would require that. VMs will take some
         performance overhead and they may make it more difficult to
         communicate between some elements of the software stack.
    
         We have already built most of the components on our OS of choice -
         CentOS 6.2 and we're not seeing any conflicts. Despite the fact the
         CentOS/RedHat is not always officially supported, there are usually
         instructions for advanced users on how to build the software that seem
         to work.
    <JS> Aha, ok. It might mean that you have to do more ongoing work to track
         updates to those components, if new versions don't build as cleanly, but,
         as you say, we can revisit if it turns out to be a problem. I think it's
         probably fine to call this closed, but since it was originally on Nick's
         list, I wanted to give him (or anyone else with contrary opinions) a
         chance to chime in before I crossed it off.
    <JS> B-7. How is the head node configured - do the services run in their own            
         VMs, or do they need to co-exist on the same OS instance?
         ISSUE: We (GPO) think it would be better if the head node ran VMs, so that
         the various software that needs to run there can run in a more isolated
         environment, on its preferred OS; but it sounds like that's not how RENCI
         is planning to do it at this point. If you prefer the all-in-one-OS
         approach, can you talk more (maybe fork off a separate thread) about why?
    <IB> The VM option remains open, however currently we are not seeing any software 
         conflicts that would require that. VMs will take some performance overhead 
         and they may make it more difficult to communicate between some elements of 
         the software stack.
    <JS> Aha, ok. It might mean that you have to do more ongoing work to track
         updates to those components, if new versions don't build as cleanly, but,
         as you say, we can revisit if it turns out to be a problem. I think it's
         probably fine to call this closed, but since it was originally on Nick's
         list, I wanted to give him (or anyone else with contrary opinions) a
         chance to chime in before I crossed it off.
         We have already built most of the components on our OS of choice - CentOS 
         6.2 and we're not seeing any conflicts. Despite the fact the CentOS/RedHat 
         is not always officially supported, there are usually instructions for advanced 
         users on how to build the software that seem to work. 
    <IB> OK
    <NB> At the very least we're likely to run into the need to move common
         services (like SNMP) to custom ports, but I'm also concerned about
         finding ourselves in a situation where we have conflicts in required
         JVMs or similar (FlowVisor already trips over some known issues in
         commonly distributed JVMs) or Python versions.
    <IB> We use JREs downloaded from Oracle site, not shipped with the distro. CentOS 6.2 
         seems to be reasonable up to date with python (2.6.6 is the stock version ). 
         Which components have SNMP interfaces on them? 
    <JS> My summary is that Ilia is optimistic that there won't be any issues,
         Nick is pessimistic that there will be, Ilia has said that they'll
         revisit if they are, and that this is fine with us for now.
    <NB> Both FlowVisor and FOAM will have SNMP interfaces in the medium term.
         The suggested use case for most installations would be that they would
         disable the FV interface and just use the FOAM one if they were
         running both, but that will be more difficult if FOAM doesn't know
         detailed information about everything in FlowVisor.  Also, I'm not
         saying there are necessarily any problems right *now* with JVM/Python
         versions etc, but this will be an ongoing software qualification
         concern when individual components become available with new versions.
    
  • B-8. PDUs are also useful for remote management if a node gets completely bricked (such that IPMI is useless) - I would think that the marginal cost would be more than worth it. (hpd) (we're helping RENCI to work on in the first couple of rack integration efforts. Ticket #3354)
    <BV> IBM doesn't offer switched PDU's with 120V on their standard
         Bill of Material.  The 208V units on their standard BoM are
         switched and monitored.  For the first 2 racks (RENCI and BBN)
         we are sticking with IBM's standard BoM because to use
         non-standard BoM parts means it can't be assembled in the
         factory and has to goto the "Integration Center" which increases
         the lead time.  So for the BBN rack, we won't have switching.
         We hope for other sites that can only support 120V power we
         will be able to identify with IBM a reasonable switched PDU
         they can install.
    <JS> I've forgotten, can we take a 208V unit? If so, then if that would get us
         a switched PDU, then it might be worth doing.
    <JS> We (GPO) think it would be better if the head node ran VMs, so that
         the various software that needs to run there can run in a more
         isolated environment, on its preferred OS; but it sounds like that's
         not how RENCI is planning to do it at this point. If you prefer the
         all-in-one-OS approach, can you talk more (maybe fork off a separate
         thread) about why?
    <IB> The VM option remains open, however currently we are not seeing any
         software conflicts that would require that. VMs will take some
         performance overhead and they may make it more difficult to
         communicate between some elements of the software stack.
    <IB> We have already built most of the components on our OS of choice -
         CentOS 6.2 and we're not seeing any conflicts. Despite the fact the
         CentOS/RedHat is not always officially supported, there are usually
         instructions for advanced users on how to build the software that
         seem to work.
    

Resources:

  • B-9. Why not allow arbitrary bare-metal images? Is this any more dangerous than arbitrary VM images? (hpd) (Duplicate of question S.27)
    <BV> As discussed briefing in the concall.  The reason to not allow
         custom bare metal images is two fold.  1) The decrease in
         security because users will have direct access to the bare
         metal network interface which connects to the management switch.
         2) The complexity of creating a bare metal image means the
         user would have to have a system identical to the one inside
         the ExoGeni racks so they could load all the hardware drivers,
         etc.  I don't think we've ruled out the possibility 100% and
         if a user provides a compelling reason for why they need it,
         then we can consider it.  But I think we have enough on our
         plates with the initial deployment without adding this level
         of complexity on day one.
    
  • B-10. Where is the storage for the running instances - on the worker nodes? (hpd)
    <BV> We will have the ability to provide storage either on the
         running worker or via NFS from the head.  Long term plans
         include being able to provision raw iSCSI luns from the iSCSI
         unit with a slice and make those available as well.
    
  • B-11. What are the average IOPS available for each VM on a fully loaded (max running VMs) worker node?
    <BV> Each worker has 2 hard drives.  1 146GB 10K RPM SAS and 1 600GB
         10K RPM SAS.  In the case of a VM worker, the OS (CentOS 6)
         will be installed on the 146GB drive and all the VM's storage
         will be installed on the 600GB drive.  In a bare metal install
         the user would have access to both and could use them as they
         saw fit.  The "standard" rating for a single 10K RPM SAS spindle
         is 180 IOPS.  There are 6 drive slots on each worker, we can
         add more spindles, but for each spindle we add, we remove 1
         worker because of the cost (i.e. 9 2.5" 600GB SAS spindles =
         about $4000, or the cost of a worker).
    
         In all the infrastructure designs it was a delicate balancing
         act between available funds and performance.  Our goal being
         to build something that was usable today but extensible for
         the future.  The first 2 racks are our on the job training.
         We fully expect that after these first 2 racks we will tweak
         the hardware configurations with IBM and hopefully have a
         smooth flow from IBM's integration center to the other sites
         for the remaining 10-12.
    
    <NB> This seems optimistic - the latency of a 10k rpm spindle with a 2.5"
         platter is 3ms, and the IBM 5433 (the 600GB drive in question) has a
         4.2ms average read seek time (writes are slower, but we'll be
         optimistic here for the purposes of this discussion), which makes for
         ~139 IOPS (1 / 0.0072).  Of course, neither of these numbers are
         particularly useful if we don't have an idea of the workload - more on
         this below.
    <NB> I've been doing some math on the back of some napkins and I think that
         might be a net positive tradeoff for total VM capacity based on a
         variety of workload calculations (although factoring bare metal into
         this makes that calculus more complicated).  I still have some work to
         do on this, so I'll followup later with my thoughts.
    
    

Adam Slagell's Questions

Software/Firmware Update

  • S.1 What part of the software stack does exoGENI take responsibility for maintaining updates? IS there anything they don't? (chaos, based on adam's comment)
    <AS> Sounds like VM/BM images and all the software that comes with
         the racks. I didn't see any gaps or buyer bewares.
    <IB> We will take care of software updates. The only buyer-beware concerns 
         the operation of FOAM - we don't want to be in the business of approving 
         user slices in FOAM and think this needs to be done by GPO or GPO delegate. 
    
  • S.2 Is there an automated updated system? If so, how is integrity insured?
    <AS> Sounds like no.
    <IB> Not at this time. The software is too diverse.
    <AS> Maybe for the system images some sort of integrity verification 
         using digital signatures is feasible.
    <IB> Currently the images for VMs go through such a verification -
         the user submits a URL and a SHA-1 hash of the image they want booted. 
         For bare-metal images if we add filesystem integrity verification, it
         can cover the images locally cached on the head node.
    <AS> 1. Auto update system: There are no plans for an autoupdate system for
         the GENI racks. With a large and complex software stack and many racks 
         at many institutions, this could become problematic to keep up-to-date. 
         The quickest way to a security incident is to have out of date software.
         BTW, isn't there a GENI project (by Justin Cappos I think) that is 
         supposed to help make getting secure and reliable updates easy.
    
  • S.3 Is there a service guarantee for updates? Say a flowvisor vulnerability is found and a patch made. How quickly can you push out updates?
    <IB> Since none of the GENI software I know runs as root, I think we can be 
         relatively lax about this. I would say 72 hours if it is a straight-forward 
         update that does not require significant reconfiguration and repackaging. 
    <AS> 2. Vulnerability management: Any major system going out needs a plan for 
         monitoring and investigating vulnerability impacts. The more complex the 
        software stack and the more things that depart from a vanilla OS distribution, 
        the harder this becomes. You need to (1) be aware of all potential vulnerabilities 
        (challenging for a complex software stack), (2) test for exploitability, (3) determine 
        impact, (4) test patch or mitigation, and (5) push out a solution all very rapidly. 
        The previous comment in #1 really addresses just the last bit, and I see no 
        vulnerability management plan into which you could insert it now.
    <JM> There's been talk of several strategies, and no single solution will get it
         all done.  We will all know what the state of OS patching looks like, since 
         I have a Nagios/Check_MK plugin that essentially runs a 'yum check-update'.  
         It does this with the security plugin enabled.  The result is, for each host, 
         we will know how many updates it needs, and how many of those updates and 
         security-related.  Of course, this only helps us with the base OS; it cannot 
         address potential vulnerabilities in the GENI-ORCA-OpenStack-Neuca world.  
         Ilia will have to comment on the latter.
         In terms of stopping SSH brute force attacks, I think denyhosts is a good way 
         to go.  But our sshd is tcpwrappered by default anyway (set up by kickstart).  
         This is kind of attack won't be an issue. 
    <AS> There's also the VM and bare metal images as well, right.
    <IB> Regarding the VM images - since users are allowed to boot their own, 
         the main weapon we have there is the ability to match resources to slices 
         and shut down misbehaving resources. 
         Bare-metal images will be restricted to a small selection (size 1 initially). 
         The problem with frequently changing/updating those is that it makes repeatable 
         experimentation more difficult, e.g. if an experimenter expects a certain image 
         with certain versions of kernel, drivers and software and we continuously move that mark. 
         The GPO will need to weigh in on what is more important - repeatability or the 
         potential impact on security, because this is an important tradeoff we're talking 
         about here.
    <AS> Good point.
    <BV> Also, I'd like to add, based on conversations yesterday, neither VM's 
         nor bare-metel servers will have direct internet access.  Our plan is 
         to proxy all public IP traffic through the headnode at each site, using 
         IP tables.  This gives us the opportunity to shutdown a site very quickly 
         if there is a report of a problem, but keep the problem system running 
         (VM or bare metal), so we can analyze what is going on and resolve the 
         issue with the experimenter.
    <AS> I'm not sure what you are saying exactly here. Are they private IPs that 
         are NATed, are they going through an application layer gateway? What do 
         you mean by not direct?
    <JS> Hmm, my impression is that if we wanted to create a new bare-metal image,
         we wouldn't necessarily delete old one(s), but rather that the list would
         grow over time.
         Ah, but that may not have been what you meant: Indeed, if we update an
         existing image to fix security problems, that would potentially have an
         impact on repeatability. I think we'd need to at least identify that the
         image had changed (e.g. by changing its name), so an exprimenter would be
         aware of that, and could re-validate that their experiment still produced
         the same results after the change.
         We could also devise some way for the experimenter to capture the
         vulnerable image, so they could run it somewhere else if they felt the
         need. (Or just boot it up on an isolated system of their own so that they
         could look at it, or whatever.)
    <AS> I was assuming you would add new images that you support over time, but 
         existing ones would get security patches as time goes by. Of course you'd
         want to enumerate them and specify how they differ, perhaps in /CHANGELOG.txt 
         or something.
    <IB> I don't know that we can guarantee that a particular 'security' patch will 
         not affect the performance of one or other of the kernel subsystems thus 
         affecting repeatability.
    <JS> Ja; I think "track, notify, and archive" is the right approach here.
    <SS> I'd turn this around -- we know that many changes will affect performance, 
         sometimes in only minor ways, but sometimes in major ways.
         If space is not a problem, I'd plan to keep every old version of standard 
         images around. The naming convention is just a detail, but
         OS-version-exogeni-current might be the name that gets you the latest 
         patched/supported image, but the logs would show you the precise version 
         you got (OS-version-exogeni-x.y or -yyyy-mm-dd).
         If an experimenter is running a slice that is on a closed (virtual) network, 
         e.g. configured so that only a fixed set of well-known machines can reach it, 
         then it is possible to bring up even old images with security vulnerabilities 
         and repeat earlier test runs or collect new data using those older images.
         If that same experimenter wants to run on a slice that provides "service" 
         to some larger, open set of users (on campuses or wherever), then they are 
         going to appreciate having automatic support for getting the latest OS patches 
         into the base images.
         I'm going to guess that we will see both sorts of use cases, but more 
         "closed networks" first.
    <AS> Sounds like a reasonable balance.
    
    <CG> > The GPO will need to weigh in on what is more important - repeatability
         > or the potential impact on security, because this is an important
         > tradeoff we're talking about here.
         So, my two cents: in our lab, we do try to apply OS updates to our
         experimental images, the same way we would to any other nodes we run.
         I think having an update schedule which applies to experimental OS
         images for which standard patches are available, as well as for servers,
         is a good idea.  If you can flag your images with metadata saying when
         they were last updated, so that experimenters know, so much the better.
         And if it's possible to keep old images around in case someone has a
         special-case need for one, again, that's a feature.
         I agree that it's a tradeoff, but i think doing periodic updates of
         images is the better bet.
    
  • S.4 Will there be someone actively monitoring for vulnerabilities on the entire software stack, or is it best effort (e.g., we update all the problems we are told about by someone else).
    <IB> At this point there is no dedicated person. However our ACIS group 
        (members of which are part of the operations staff) are usually aware 
        of latest vulnerabilities as part of their data center responsibilities.
    <AS> It may be worth doing google alerts on Bugtraq for all the software.
    <IB> I'll ask our ACIS folks what they do today. 
    

Log Collection & Management

  • S.5 What do you log and how?
    <IB> ORCA actor state transitions and handler execution outputs. We will log 
         entire manifests to make them available to GMOC. The manifest will be the 
         main vehicle for correlating substrate to slivers.
         There are syslogs on individual hosts as well. Other elements (FlowVisor, 
         FOAM) have their own logs.
    <AS> 4. Logging: I think remote logging is a must for integrity and availability. 
         This should be for syslogs and AM transactions that are needed to maintain 
         accountability of actions. Some additional integrity checking on the hosts
         is nice, but icing on the cake.
    <JM> The remote logging infrastructure is mostly complete.  There is a central 
         server, in a protected VLAN deep in the heart of RENCI, running rsyslog 
         on CentOS 6.2  It only accepts connections in a high numbered port, using 
         RELP, from control.exogeni.net.  The latter is a forwarder for all logs.  
         We have a simple LogAnalyzer web interface to the central rsyslog box (which 
         is syslog.exogeni.renci.org).  This is protected with SSL, Apache basic auth, 
         using LDAPS to authenticate to ldap.exogeni.net.  What remains to be done here 
         involves making all the nodes in each rack forward their messages appropriately. 
         And lastly, if there are any non-standard logs we need capturing (for instance, 
         OpenStack, Neuca, or ORCA logs), I'll need to create a template for handling them. 
    
  • S.6 Are remote copies logged?
    <IB> Not at this time
    
  • S.7 Do you do anything special on the racks to maintain the integrity of the logs?
    <IB> Not at this time
    
  • S.7.B What about other file integrity checking for config files and critical system files.
    <IB> Has not been considered so far, but I think can be added. 
    <AS> Also useful is minimizing setuid programs or watching for changes 
         to setuid bits.
    <IB> Noted
    <AS> 3. SetUIDs and configuration management: I think that it is good 
         that most things don't need to run as root on the racks, but the 
         number of setuid programs should be minimized too. Once you have the 
         list, I think xCat has decent configuration management utilities to 
         make sure security hardening policies like that persist across upgrades 
         and changes. If not, you should have a plan on how to make sure that 
         updates don't move you to a less secure state by modifying configuration 
         unintentionally.
    
  • S.8 Do you log enough to map timestamp/IP/port tuple to a particular slice?
    <AS> Sounds like it is the information is there, though it may take
         some manual investigation, especially if NAT was involved.
    <AH> 8) They haven't really worked out logging, but mostly hope to just send
         everything to GMOC and be done.
         This is probably just fine. This is essentially 'racks will log to a
         remote Logging API' which is consistent with recent architecture group
         discussions. We just need to (a) ensure we are asking for all the right
         bits of information, and (b) have them at least outline the algorithm
         for going through all those logs to get the information we really need
         (eg, what slice ID used IP X Port Y at time Z?)
    
         We should check more specifically on what is stored on the racks in
         terms of logs, if anything.
    <IB> Manifests have the information. 
    
  • S.8.B What if you bridge some other device into GENI through your AM but hide it behind your NAT? For example, could there be some campus device causing a problem, but show up as one of the IPs on your rack, but not actually be under your control? And in that case, could you determine from your logs the device and what slice it was a part of?
    <IB> In the current architecture this is not really possible. The IP 
         addresses given to the rack are used by rack resources only.
    
  • S.9 Can you easily tell what slices are running on a given rack? How about each node on a rack? (chaos, based on adam's comment)
    <AS> Sounds like that is not a problem
    <IB> Yes, although we need to do better with respect to making 
         this information available in an easier form.
    
  • S.10 How long do you keep local copies of the logs?
    <IB> Depends on the verbosity. Once manifests start getting published 
         onto XMPP bus, this will be no longer an issue, as a separate log 
         repository can slurp them up and keep them in one place.
         The syslog logs probably should be configured to go to a central 
         syslog server in addition to having a local copy.
    
  • S.11 Is there a mechanism that could be used to send allocation log information back to the clearinghouse for global policy verification for slices?
    <IB> XMPP bus - we want to use it as the means to make this data 
         available to multiple consumers.
    

Administrative Interfaces

  • S.12 What is the authentication mechanism for the VPN?
    <IB> LDAP + possibly RADIUS slaved to LDAP (for switches)
    <AS> LDAP would be for authorization, but what kind of credential 
         would be used for authentication. Maybe I am missing something.
    <IB> LDAP stores usernames and passwords (as well as groups, which
         would be used to partition rights). RADIUS can read LDAP.
    
  • S.13 Does being on the VPN on one rack get you to the admin interfaces of all the others, or is this one way from RENCI?
    <IB> One way from RENCI 
    
  • S.13.B How does one authenticate to the admin interface (separate from the VPN)? Is it root login?
    <IB> Depends on the device (e.g. a switch vs. a compute node). 
         We opt for sudo whenever possible.
    
  • S.14 Are the credentials used to authenticate to the admin interface different for each rack?
    <IB> This has not been discussed or codified.
    <AS> When architecting this, it would good to strive for containment. So 
         if one unscrupulous person with a GENI rack reverse engineers something, 
         it doesn't give them the credentials they would need to do bad things to 
         other racks. It can complicate initial setups but probably pays off in 
         the long run.
    <IB> Noted
    
  • S.14.B What about within a rack, is the root or admin password the same for each node/device?
    <IB> We tend to use the same password for all worker nodes currently. 
    <AS> I think within a rack, all nodes of the same type could be considered 
         at the same level of trust and treated this way.
    <IB> Noted
    
  • S.15 Is authentication for admins the same whether or not they login through the VPN or SSH into the head node?
    <IB> LDAP will be the back end, so yes.
    <AS> So again, I am confused. LDAP as I have seen it used is just for 
         authorization. There are still SSH keys or passwords or OTP tokens 
         for different accounts. 
    <IB> LDAP stores usernames and passwords. SSH uses PAM on the end hosts 
         to talk to LDAP over SSL channel. Switches use RADIUS that is slaved 
         to LDAP directly.
    <AS> OK, makes sense. Though I presume it is actually salt and hashes stored.
    <IB> Yes, the passwords stored in LDAP are not plain-text. Typically an 
         MD-5 hash is used.
    
  • S.15.B Are the SSH credentials to the head node different for each rack or shared?
    <IB> Same as S.14. I don't know that these two questions are different. 
    <AS> So here I am talking about two separate racks installed at different 
         institutions. Would a password or key that a local admin used to SSH 
         into the head node at University X also let them do the same at 
         University Y?
    <IB> It is likely we will use LDAP groups to partition users such that users 
         are limited to specific racks. Root logins will likely be disabled 
        (and we may disallow 'sudo su -' for most users). 
    
  • S.16 How is accountability of actions recorded if there are more than one admin or is it just a shared root login?
    <IB> we tend to use sudo, so some of the commands and privilege 
         escalations are logged.
    
  • S.17 Does the KVM for console access have an network interface that gives remote console access?
    <AS> Sounds like NO.
    <IB> No
    
  • S.18 What devices and interfaces can you see from the VPN interface?
    <IB> All of them. 
    
  • S.18.B Does this differ for those logging in through the head node?
    <IB> No. Head node access is a redundant means to do the same.
    
  • S.19 Would the hosting organization have a different admin interface?
    <IB> No, just a different set of logins with different credentials. 
         Hosting organizations probably will not have VPN access.
    
  • S.20 Is the only authentication mechanism password based, or two factor auth or ssh keys used?
    <IB> Right now based on LDAP passwords only. 
    <AS> Oh, so you are using LDAP to distribute something like the /etc/shadow 
         file? So here, we use LDAP just to essentially distribute /etc/password, 
         but authentication is done through PAM with Kerberos or OTP. Am I 
         understanding this right, that LDAP does both for you, sort of like NIS?
    <IB> Sort of. Except we don't distribute /etc/passwd - PAM talks to LDAP live
         and there usually is a caching daemon that caches the getpasswd entries 
         temporarily.
    <AS> 5. Remote root access: It was not clear whether remote root login was 
         allowed anywhere. I read that sudo was used when possible, but I would 
         hope no sshd_config files allow remote root login.
    <JM> root SSH is disabled by default in our kickstarts 
    
  • S.21 If ssh keys are used anywhere, are they stored unencrypted on any of these racks.
    <AS> I suspect yes with xCat.
    <IB> Yes.
    
  • S.22 If SSH keys are used, are they different for different racks?
    <IB> We will probably generate different keys.
    
  • S.23 If passwordless SSH keys are used, can they be used multi-directionally? For example, if an xCat process needs to use them to do something on a less trusted part of the system, that other piece should not be able to use the same key to ssh back into the xCat manager.
    <IB> xCAT uses only explicitly registered keys, so this can be avoided. 
         However we will disallow node-to-node logins as per:
         http://sourceforge.net/apps/mediawiki/xcat/index.php?title=Disable_node_to_node_root_passwordless_access
    
  • S.24 Do the admin interfaces need to connect back to anywhere initiating outbound connections?
    <IB> Not that I know of. 
    
  • S.25 What is meant by " Since ExoGENI slices have management network access via the commodity Internet, this is the default behavior." on pg 13? (Perhaps you will have explained this by now and can ignore)
    <IB> This simply says that if you don't care about isolated connectivity 
        between slivers, you always have the commodity Internet connecting them.
    

Isolation

  • S.26 Are you tired yet? I am. :-) (chaos, per adam's comment)
  • S.27 What is the vetting process for bare metal nodes?
    <AS> Sounds like no process yet, but there is recognition that we
         don't want bare metal hosts to be able to sniff in promiscuous
         mode and break the nice isolation properties
    <CG>  26. Dataplane reachability testing:
          We think it would be a good idea to have two types of tests to go
          with the two types of VLANs:
          * Where the ExoGENI AM is used to provision a VLAN, we'd like to see
            a test which stands up a VLAN, verifies that it can be used,
            and reports to monitoring on whether that entire system (which
            includes the AM, of course) is healthy.  I believe you discussed
            doing something like this already: does what i just said sound
            similar to what you have in mind?
          * Where an ExoGENI rack is going to be connected to a static
            (long-standing) VLAN outside of the rack, e.g. to the shared
            mesoscale VLANs or to a longstanding L2 connection to non-rack
            resources at a particular site, we'd like to see a static test
            interface on each VLAN which could be used to verify connectivity.
            It would be ideal if the test interface were non-OpenFlow-controlled
            on the rack, so that it could be used entirely to test "is this
            link up?".  Does this seem reasonable?
    <IB> No process yet. 
    <AS> 7. Image vetting: I think a process, or maybe a set of criteria, is 
         needed for vetting bare metal images. What are the requirements? Things
         such as an "inability to sniff traffic in promiscuous mode on the NICs" 
         would fit into such a list.
    <MB> Is the example you propose below an actual proposed requirement or just 
         a for instance? I ask because the capabilities that come immediately to 
         my mind as wanting bare metal seem likely to want to do exactly this.
    <AS> It was being proposed and an example. I think it is desirable to prevent 
         from a security perspective because it provides better isolation of slices. 
    <SS> Most of these are non-controversial (at least to security folks!) but I 
        didn't quite understand a couple points, maybe because I joined the review
        late and will admit to not reading all the messages.
         7. Image vetting: ... are GENI researchers going to be able to sudo root 
         on bare metal images? (I would have presumed yes, but maybe that isn't the model.)
    <AS> I did not presume so because there was talk about state being preserved 
         between jobs/users. If they aren't wiping images between experiments and 
         users have root access, then there is a whole other security issue. 
    <NB> > It was being proposed and an example. I think it is desirable to prevent
         > from a security perspective because it provides better isolation of slices.
         The ability to capture traffic from a promiscuous NIC on a bare-metal image 
         has no impact on slice isolation.  This is very much something that we should allow.
    <AS> It depends. If it allows me to watch traffic on other slices and there is any 
         expectation of privacy, then it does impact a form of isolation. If there is 
         neither an expectation or promise of privacy, or switching would prevent one 
         from seeing such traffic even if in promiscuous mode, then the it isn't an 
         issue. I don't know the answer to either of those questions, though.
    <NB> The privacy question is a good one, and should be discussed, but isn't
         a factor here.  If you have bare metal, you have exclusive access to
         the switch port and can't capture traffic that belongs to another slice.
    
    
  • S.28 Are the bare metal hosts diskless?
    <AS> No, they have 146GB FS for OS and 600GB FS for data. However,
         they are wiped clean and reinstalled from a fresh vetted image
         between allocations. State is gone.
    <IB> We're still debating whether we want stateful or stateless bare-metal
         nodes. Both options are open.
    <AS> The nice thing about stateless is it more like a white list. If there 
         is state left behind, you have to always wonder if you thought of 
         everything that you need to clean up in between users.
    <IB> This is still TBD and there are advantages to both. 
    
  • S.29 What are the main isolation mechanisms between slices?
    <AS> VM hypervisors or wiped bare-metal systems isolate experiments
         at system level. At the network level, this is done with VLANs.
         The same VLAN won't have slivers from multiple slices.
    <IB> Yes. VLANs have QoS associated with them wherever possible 
        (rate and buffer size limits).
    <AS> 6. Isolation between racks: Isolation between racks is important, 
         especially since these are distributed across the country. Reverse 
         engineering something at one rack should not result in some class-wide 
         vulnerability that affects all racks. Companies like IBM often like 
         to install things with default keys and passwords, and you really 
         need to make sure those are changed and individualized for different 
         racks. Any password hash on a rack off-site is accessible and potentially 
         crackable.
    <SS> Most of these are non-controversial (at least to security folks!) but 
         I didn't quite understand a couple points, maybe because I joined the
         review late and will admit to not reading all the messages.
         6. ... "Any password hash on a rack off-site is accessible..." 
         ... I thought all these racks were getting installed in "well-known" 
         facilities. So while remote, they aren't exactly in physically unprotected
         locations, right?
    <AS> I don't know how much we trust the administrators at the dozens 
         and eventually hundreds of sites. Might students be admins of some 
         of these racks? If it really is a small set of trusted admins with 
         racks on data center floors, then it is less of an issue. 
    

Miscellaneous

  • S.30 For each rack, could the aggregate operator give a concrete block of IP addresses unique to it?
    <AS> Sounds like this is a policy issue and could be made a part
         of the configuration guidelines for each rack. It is helpful
         for the LLR to be able to tell from IP if something is from a
         GENI rack and at which organization quickly.
    <IB> A block or a list of addresses is fine
    
  • S.31 Are any user credentials stored anywhere, even temporarily? If so how are they protected and how long do they live?
    <AS> You argue this is not applicable in the white paper I think?
    <IB> If ABAC is adopted in GENI, user certs may be cached on the 
         head node as part of authorization process. They, however, 
         constitute public information and do not require confidentiality 
         protection
    

Attachments (5)

Download all attachments as: .zip