Opened 12 years ago
Closed 12 years ago
#51 closed (fixed)
shared vlan requests fail to complete and result in geni_status failed
Reported by: | lnevers@bbn.com | Owned by: | ibaldin@renci.org |
---|---|---|---|
Priority: | major | Milestone: | EG-EXP-5 |
Component: | Experiment | Version: | SPIRAL4 |
Keywords: | vlan | Cc: | |
Dependencies: |
Description
Using several RSpecs that included a shared VLAN request
<s:link_shared_vlan name="1750"/>
all resulted in geni_status failed.
Attachments (1)
Change History (21)
comment:1 Changed 12 years ago by
comment:2 Changed 12 years ago by
I was able to bypass one of the errors reported by Josh, when I added:
component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm" }} to the node definition.
comment:3 Changed 12 years ago by
Just cleared the insufficient resource failure, I had 6 failed slivers which were taking up resources.
I was just able to create a sliver successfully which resulted in geni_status "ready with the following RSpec:
<?xml version="1.0" encoding="UTF-8"?> <rspec type="request" xsi:schemaLocation="http://www.protogeni.net/resources/rspec/2 http://www.protogeni.net/resources/rspec/ext/shared-vlan/1 http://www.protogeni.net/resources/rspec/2/request.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:s="http://www.protogeni.net/resources/rspec/ext/shared-vlan/1" xmlns="http://www.protogeni.net/resources/rspec/2"> <node client_id="VM" component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm" > <sliver_type name="m1.small"> <disk_image name="http://geni-images.renci.org/images/standard/debian/debian-squeeze-amd64-neuca-2g.zfilesystem.sparse.v0.2.xml" version="397c431cb9249e1f361484b08674bc3381455bb9" /> </sliver_type> <interface client_id="VM:if0"> <ip address="172.16.1.1" netmask="255.255.0.0" /> </interface> </node> <node client_id="VM-0" component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm" > <sliver_type name="m1.small"> <disk_image name="http://geni-images.renci.org/images/standard/debian/debian-squeeze-amd64-neuca-2g.zfilesystem.sparse.v0.2.xml" version="397c431cb9249e1f361484b08674bc3381455bb9" /> </sliver_type> <interface client_id="VM-0:if0"> <ip address="172.16.1.2" netmask="255.255.0.0" /> </interface> </node> <link client_id="lan0"> <interface_ref client_id="VM:if0"/> <interface_ref client_id="VM-0:if0"/> <s:link_shared_vlan name="1750"/> </link> </rspec>
comment:4 Changed 12 years ago by
Using the RSpec shown in previous comment, with modified IP Addresses for my OpenFlow? experiment, I was able to request two host on VLAN 1750 at BBN ExoGENI and connect them to a Clemson MyPLC host.
I am not sure why the failures from last night are not occurring this morning, last night I had 3 consecutive slivers that resulted in a geni_status failed. But just now, I was able to create two consecutive Slivers using OF VLAN 1750 and both worked.
I am going to run a few more experiments and capture results.
comment:5 Changed 12 years ago by
I tried again, and got the "Embedding workflow ERROR: 1:Insufficient resources or Unknown domain: http://geni-orca.renci.org/owl/UnboundDomain/vlan:vlan:0! Requested: vlan:1" error again. I tried adding
component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm"
to my jbs16 rspec, and that seemed to work.
Is that necessary in general, or only sometimes? It didn't seem to be necessary when I created my jbs15 sliver.
comment:6 follow-up: 7 Changed 12 years ago by
I believe the component_manager_id is required when you request resources from the local rack SM.
comment:7 Changed 12 years ago by
Replying to lnevers@bbn.com:
I believe the component_manager_id is required when you request resources from the local rack SM.
It wasn't when I created my jbs15 sliver; or when I created either of them previously.
But, if it's supposed to be there in general, we can certainly do that (and advise people accordingly).
comment:8 follow-up: 13 Changed 12 years ago by
According to Ilia:
On 6/18/12 6:33 PM, Ilia Baldine wrote: Your request should be referencing bbnvmsite and be directed at bbn SM (bbn-hn.exogeni.net:11443/orca/xmlrpc)
Without the component_manager_id, my RSpecs do no work.
I am not sure, why the jbs15 rspec seems to work without the component_manager_id, since both are jbs16 and jbs15 RSpecs are similar (other than names and addresses). Are you sure you didn't submit the jbs15 request to the ExoSM?
comment:9 Changed 12 years ago by
Created a sliver named "mn123" with 3 nodes on VLAN 1750, the sliver was created, but the geni_status was "failed". RSpec used was is attached.
Changed 12 years ago by
Attachment: | exo-3vm-shared-vlan.rspec added |
---|
comment:10 follow-up: 11 Changed 12 years ago by
It has been 1 hour and 50 minutes since the sliver creation command and the geni_status for the "mn123" sliver is still "configuring". It seems ORCA did not kill the VM and try again after the initial 30 minutes timeout.
This behavior is similar to one of the failed scenarios listed in ticket #50 (18VM linear, where the sliver status remained in "configuring" state.
comment:11 Changed 12 years ago by
Replying to lnevers@bbn.com:
It has been 1 hour and 50 minutes since the sliver creation command and the geni_status for the "mn123" sliver is still "configuring".
Correction:
- The sliver that is still "configuring" ~2 hours after creation is named "5vm".
- The sliver "mn123" has a "geni_status failed", which occurred immediately after sliver creation.
comment:12 Changed 12 years ago by
The sliver "5vm" became ready after ~2 hours and 20 minutes. Able to login and communicate among the nodes.
comment:13 Changed 12 years ago by
Replying to lnevers@bbn.com:
I am not sure, why the jbs15 rspec seems to work without the component_manager_id, since both are jbs16 and jbs15 RSpecs are similar (other than names and addresses). Are you sure you didn't submit the jbs15 request to the ExoSM?
I'm sure; and I created both of them in before the outage yesterday, without a component_manager_id, and both worked.
Come to think of it, I also created slivers on the RENCI rack (via https://rci-hn.exogeni.net:11443/orca/xmlrpc), and those VMs came up (and are still up) fine. Should I need a component_manager_id there? If so, what should it be? (I see some component_manager_id strings in the rack AMs' ad rspecs, but they don't like the one that Luisa suggested for the request rspec.)
comment:14 Changed 12 years ago by
I think the need to specify the component_manager_id is new as part of the deployment last night, Ilia will have to confirm.
Josh, you are right, the advertisement RSpec shows the following syntax for the component_manager_id:
component_manager_id="urn:publicid:IDN+geni-orca.renci.org+authority+bbnvmsite.rdf#bbnvmsite/Domain/vlan+orca-sm"
Examples in the past have used the following component_manager_id for the request RSpec:
component_manager_id="urn:publicid:IDN+bbnvmsite+authority+cm"
I have been using the latter, based on earlier examples. I will try the syntax from the advertisement RSpec after the current experiment completes.
comment:15 Changed 12 years ago by
Tested two scenario with multiple concurrent slivers on VLAN 1750 and they all worked.
=> Scenario 1:
Included 3 BBN ExoGENI concurrent Sliver each including:
- 1 VM on VLAN 1750
- 2 VMs on VLAN 1750
- 2 VMs on VLAN 1750
Results: All 5 ExoGENI nodes exchanged traffic with Meso-scale OpenFlow remote at Clemson.
=> Scenario 2:
Included 3 BBN ExoGENI concurrent Sliver each including:
- 1 VM on VLAN 1750
- 2 VMs on VLAN 1750
- 3 VMs on VLAN 1750 (*)see Note 1
Results: All 6 ExoGENI node exchanged traffic with Meso-scale OpenFlow remote at Clemson.
(*)Note 1:
The sliver named "june2012" which included 3 VMs on VLAN 1750 reported geni_status "configuring" for over 2 hour and 30 minutes before becoming "ready". This is the second occurrence today of the resource allocation taking hours, earlier sliver named "5vm" also took 2 and 1/2 hours to have resources ready for use.
comment:16 Changed 12 years ago by
For now it is better to specify a component manager when using shared vlans. We will remove this restriction with the new code we plan to put in production on 06/23 or thereabouts.
Right now there is a small bug in topology embedding code that makes specifying cm a requirement.
comment:17 Changed 12 years ago by
Other problems experiences in this ticket appear to be OpenStack? related. To deal with those we just added code to properly pass error messages about slivers to the user. I still need to check whether RSPec conversion for this feature works properly but at least in Flukes and in ORCA portal there will be detailed messages related to the reason for failure.
Unfortunately now from your perspective it is different to distinguish an embedding failure in the SM from a slivering failure (typically related to the image or OpenStack? vagaries).
comment:18 Changed 12 years ago by
Owner: | changed from somebody to ibaldin@renci.org |
---|---|
Status: | new → assigned |
comment:19 Changed 12 years ago by
This should work more reliably now and when it doesn't the error messages should indicate the problem more precisely.
comment:20 Changed 12 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Set up 12 shared VLAN slivers and each came up without a problem. Closing ticket.
I was able to create a sliver with a VM on VLAN 1750, in slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+jbs15. However, I couldn't create a second one, in jbs16; I got a couple of different errors, including
and
Here's the rspec for the one that failed:
The one that succeeded is essentially identical: