Opened 10 years ago

Closed 10 years ago

#51 closed (fixed)

shared vlan requests fail to complete and result in geni_status failed

Reported by: lnevers@bbn.com Owned by: ibaldin@renci.org
Priority: major Milestone: EG-EXP-5
Component: Experiment Version: SPIRAL4
Keywords: vlan Cc:
Dependencies:

Description

Using several RSpecs that included a shared VLAN request

<s:link_shared_vlan name="1750"/>

all resulted in geni_status failed.

Attachments (1)

exo-3vm-shared-vlan.rspec (1.9 KB) - added by lnevers@bbn.com 10 years ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 10 years ago by jbs@bbn.com

I was able to create a sliver with a VM on VLAN 1750, in slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15. However, I couldn't create a second one, in jbs16; I got a couple of different errors, including

Asked https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc to reserve resources. No manifest Rspec returned. Request id: e1b9e4bd-5e7e-47cc-a535-4a5ec8323dff
Embedding workflow ERROR: 1:Insufficient resources or Unknown domain: http://geni-orca.renci.org/owl/UnboundDomain/vlan:vlan:0! Requested: vlan:1 

and

Asked https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc to reserve resources. No manifest Rspec returned. Request id: c6cd2477-6dc1-4d6b-b10d-df8204444acc
Embedding workflow ERROR: 100:More than two-way splitting is not suppoerted: Requested type:vm:1 

Here's the rspec for the one that failed:

<?xml version="1.0" encoding="UTF-8"?>
<!--
This rspec will reserve the ORCA resources at BBN
used by the jbs16 slice.

It requests one VM, running Debian 5, named "bbn-orca-jbs16", and with a
dataplane interface on VLAN 1750, and a jbs16 IP address.

AM: https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc
-->

<rspec xmlns="http://www.protogeni.net/resources/rspec/2"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:sharedvlan="http://www.protogeni.net/resources/rspec/ext/shared-vlan/1"
       xsi:schemaLocation="http://www.protogeni.net/resources/rspec/2
                           http://www.protogeni.net/resources/rspec/2/request.xsd"
       type="request" >

  <node client_id="bbn-orca-jbs16">

    <sliver_type name="m1.small">
      <disk_image name="http://geni-images.renci.org/images/standard/debian/debian-squeeze-amd64-neuca-2g.zfilesystem.sparse.v0.2.xml" version="397c431cb9249e1f361484b08674bc3381455bb9" />
    </sliver_type>

    <services>
      <execute shell="sh" command="hostname bbn-orca-jbs16"/>
    </services>

    <interface client_id="bbn-orca-jbs16:1">
      <ip address="10.42.16.31" netmask="255.255.255.0" />
    </interface>

  </node>

  <link client_id="mesoscale">
    <interface_ref client_id="bbn-orca-jbs16:1" />
    <sharedvlan:link_shared_vlan name="1750" />
  </link>

</rspec>

The one that succeeded is essentially identical:

<?xml version="1.0" encoding="UTF-8"?>
<!--
This rspec will reserve the ORCA resources at BBN
used by the jbs15 slice.

It requests one VM, running Debian 5, named "bbn-orca-jbs15", and with a
dataplane interface on VLAN 1750, and a jbs15 IP address.

AM: https://bbn-hn.exogeni.gpolab.bbn.com:11443/orca/xmlrpc
-->

<rspec xmlns="http://www.protogeni.net/resources/rspec/2"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:sharedvlan="http://www.protogeni.net/resources/rspec/ext/shared-vlan/1"
       xsi:schemaLocation="http://www.protogeni.net/resources/rspec/2
                           http://www.protogeni.net/resources/rspec/2/request.xsd"
       type="request" >

  <node client_id="bbn-orca-jbs15">

    <sliver_type name="m1.small">
      <disk_image name="http://geni-images.renci.org/images/standard/debian/debian-squeeze-amd64-neuca-2g.zfilesystem.sparse.v0.2.xml" version="397c431cb9249e1f361484b08674bc3381455bb9" />
    </sliver_type>

    <services>
      <execute shell="sh" command="hostname bbn-orca-jbs15"/>
    </services>

    <interface client_id="bbn-orca-jbs15:1">
      <ip address="10.42.15.31" netmask="255.255.255.0" />
    </interface>

  </node>

  <link client_id="mesoscale">
    <interface_ref client_id="bbn-orca-jbs15:1" />
    <sharedvlan:link_shared_vlan name="1750" />
  </link>

</rspec>

comment:2 Changed 10 years ago by lnevers@bbn.com

I was able to bypass one of the errors reported by Josh, when I added:

 component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm" 
}}
to the node definition.

comment:3 Changed 10 years ago by lnevers@bbn.com

Just cleared the insufficient resource failure, I had 6 failed slivers which were taking up resources.

I was just able to create a sliver successfully which resulted in geni_status "ready with the following RSpec:

<?xml version="1.0" encoding="UTF-8"?>
<rspec type="request"
xsi:schemaLocation="http://www.protogeni.net/resources/rspec/2
                    http://www.protogeni.net/resources/rspec/ext/shared-vlan/1
                    http://www.protogeni.net/resources/rspec/2/request.xsd"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:s="http://www.protogeni.net/resources/rspec/ext/shared-vlan/1"
    xmlns="http://www.protogeni.net/resources/rspec/2">

  <node client_id="VM" component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm" >
    <sliver_type name="m1.small">
   <disk_image name="http://geni-images.renci.org/images/standard/debian/debian-squeeze-amd64-neuca-2g.zfilesystem.sparse.v0.2.xml" version="397c431cb9249e1f361484b08674bc3381455bb9" />
    </sliver_type>
    <interface client_id="VM:if0">
   <ip address="172.16.1.1" netmask="255.255.0.0" />
    </interface>
  </node>
  <node client_id="VM-0" component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm" >
    <sliver_type name="m1.small">
   <disk_image name="http://geni-images.renci.org/images/standard/debian/debian-squeeze-amd64-neuca-2g.zfilesystem.sparse.v0.2.xml" version="397c431cb9249e1f361484b08674bc3381455bb9" />
    </sliver_type>
    <interface client_id="VM-0:if0">
   <ip address="172.16.1.2" netmask="255.255.0.0" />
    </interface>
  </node>
  <link client_id="lan0">
    <interface_ref client_id="VM:if0"/>
    <interface_ref client_id="VM-0:if0"/>
    <s:link_shared_vlan name="1750"/>
  </link>
</rspec>

comment:4 Changed 10 years ago by lnevers@bbn.com

Using the RSpec shown in previous comment, with modified IP Addresses for my OpenFlow? experiment, I was able to request two host on VLAN 1750 at BBN ExoGENI and connect them to a Clemson MyPLC host.

I am not sure why the failures from last night are not occurring this morning, last night I had 3 consecutive slivers that resulted in a geni_status failed. But just now, I was able to create two consecutive Slivers using OF VLAN 1750 and both worked.

I am going to run a few more experiments and capture results.

comment:5 Changed 10 years ago by jbs@bbn.com

I tried again, and got the "Embedding workflow ERROR: 1:Insufficient resources or Unknown domain: http://geni-orca.renci.org/owl/UnboundDomain/vlan:vlan:0! Requested: vlan:1" error again. I tried adding

component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm" 

to my jbs16 rspec, and that seemed to work.

Is that necessary in general, or only sometimes? It didn't seem to be necessary when I created my jbs15 sliver.

comment:6 Changed 10 years ago by lnevers@bbn.com

I believe the component_manager_id is required when you request resources from the local rack SM.

comment:7 in reply to:  6 Changed 10 years ago by jbs@bbn.com

Replying to lnevers@bbn.com:

I believe the component_manager_id is required when you request resources from the local rack SM.

It wasn't when I created my jbs15 sliver; or when I created either of them previously.

But, if it's supposed to be there in general, we can certainly do that (and advise people accordingly).

comment:8 Changed 10 years ago by lnevers@bbn.com

According to Ilia:

On 6/18/12 6:33 PM, Ilia Baldine wrote:
Your request should be referencing bbnvmsite and be directed at bbn SM (bbn-hn.exogeni.net:11443/orca/xmlrpc)

Without the component_manager_id, my RSpecs do no work.

I am not sure, why the jbs15 rspec seems to work without the component_manager_id, since both are jbs16 and jbs15 RSpecs are similar (other than names and addresses). Are you sure you didn't submit the jbs15 request to the ExoSM?

comment:9 Changed 10 years ago by lnevers@bbn.com

Created a sliver named "mn123" with 3 nodes on VLAN 1750, the sliver was created, but the geni_status was "failed". RSpec used was is attached.

Changed 10 years ago by lnevers@bbn.com

Attachment: exo-3vm-shared-vlan.rspec added

comment:10 Changed 10 years ago by lnevers@bbn.com

It has been 1 hour and 50 minutes since the sliver creation command and the geni_status for the "mn123" sliver is still "configuring". It seems ORCA did not kill the VM and try again after the initial 30 minutes timeout.

This behavior is similar to one of the failed scenarios listed in ticket #50 (18VM linear, where the sliver status remained in "configuring" state.

comment:11 in reply to:  10 Changed 10 years ago by lnevers@bbn.com

Replying to lnevers@bbn.com:

It has been 1 hour and 50 minutes since the sliver creation command and the geni_status for the "mn123" sliver is still "configuring".

Correction:

  • The sliver that is still "configuring" ~2 hours after creation is named "5vm".
  • The sliver "mn123" has a "geni_status failed", which occurred immediately after sliver creation.

comment:12 Changed 10 years ago by lnevers@bbn.com

The sliver "5vm" became ready after ~2 hours and 20 minutes. Able to login and communicate among the nodes.

comment:13 in reply to:  8 Changed 10 years ago by jbs@bbn.com

Replying to lnevers@bbn.com:

I am not sure, why the jbs15 rspec seems to work without the component_manager_id, since both are jbs16 and jbs15 RSpecs are similar (other than names and addresses). Are you sure you didn't submit the jbs15 request to the ExoSM?

I'm sure; and I created both of them in before the outage yesterday, without a component_manager_id, and both worked.

Come to think of it, I also created slivers on the RENCI rack (via https://rci-hn.exogeni.net:11443/orca/xmlrpc), and those VMs came up (and are still up) fine. Should I need a component_manager_id there? If so, what should it be? (I see some component_manager_id strings in the rack AMs' ad rspecs, but they don't like the one that Luisa suggested for the request rspec.)

comment:14 Changed 10 years ago by lnevers@bbn.com

I think the need to specify the component_manager_id is new as part of the deployment last night, Ilia will have to confirm.

Josh, you are right, the advertisement RSpec shows the following syntax for the component_manager_id:

component_manager_id="urn:publicid:IDN+geni-orca.renci.org+authority+bbnvmsite.rdf#bbnvmsite/Domain/vlan+orca-sm"     

Examples in the past have used the following component_manager_id for the request RSpec:

component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm"

I have been using the latter, based on earlier examples. I will try the syntax from the advertisement RSpec after the current experiment completes.

comment:15 Changed 10 years ago by lnevers@bbn.com

Tested two scenario with multiple concurrent slivers on VLAN 1750 and they all worked.

=> Scenario 1:

Included 3 BBN ExoGENI concurrent Sliver each including:

  • 1 VM on VLAN 1750
  • 2 VMs on VLAN 1750
  • 2 VMs on VLAN 1750

Results: All 5 ExoGENI nodes exchanged traffic with Meso-scale OpenFlow remote at Clemson.

=> Scenario 2:

Included 3 BBN ExoGENI concurrent Sliver each including:

  • 1 VM on VLAN 1750
  • 2 VMs on VLAN 1750
  • 3 VMs on VLAN 1750 (*)see Note 1

Results: All 6 ExoGENI node exchanged traffic with Meso-scale OpenFlow remote at Clemson.

(*)Note 1:

The sliver named "june2012" which included 3 VMs on VLAN 1750 reported geni_status "configuring" for over 2 hour and 30 minutes before becoming "ready". This is the second occurrence today of the resource allocation taking hours, earlier sliver named "5vm" also took 2 and 1/2 hours to have resources ready for use.

comment:16 Changed 10 years ago by ibaldin@renci.org

For now it is better to specify a component manager when using shared vlans. We will remove this restriction with the new code we plan to put in production on 06/23 or thereabouts.

Right now there is a small bug in topology embedding code that makes specifying cm a requirement.

comment:17 Changed 10 years ago by ibaldin@renci.org

Other problems experiences in this ticket appear to be OpenStack? related. To deal with those we just added code to properly pass error messages about slivers to the user. I still need to check whether RSPec conversion for this feature works properly but at least in Flukes and in ORCA portal there will be detailed messages related to the reason for failure.

Unfortunately now from your perspective it is different to distinguish an embedding failure in the SM from a slivering failure (typically related to the image or OpenStack? vagaries).

comment:18 Changed 10 years ago by ibaldin@renci.org

Owner: changed from somebody to ibaldin@renci.org
Status: newassigned

comment:19 Changed 10 years ago by ibaldin@renci.org

This should work more reliably now and when it doesn't the error messages should indicate the problem more precisely.

comment:20 Changed 10 years ago by lnevers@bbn.com

Resolution: fixed
Status: assignedclosed

Set up 12 shared VLAN slivers and each came up without a problem. Closing ticket.

Note: See TracTickets for help on using tickets.