Opened 10 years ago
Closed 10 years ago
#188 closed (fixed)
U of Houston bare metal node in "configuring" state for 30 min and then fail
Reported by: | lnevers@bbn.com | Owned by: | vjo@duke.edu |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | Experiment | Version: | SPIRAL5 |
Keywords: | confirmation tests | Cc: | |
Dependencies: |
Description
Created a sliver EG-CT-2a at the University of Utah with 1 VM and 1 Bare Metal node. Thirty minutes after the completed createsliver, the bare metal node is still in configuring state:
INFO:omni:Sliver status for Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2a at AM URL https://geni.renci.org:11443/orca/xmlrpc INFO:omni:{ "geni_status": "configuring", "geni_urn": "urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2a", "geni_resources": [ { "orca_expires": "Thu Jun 20 21:33:28 EDT 2013", "geni_urn": "urn:publicid:IDN+exogeni.net+sliver+69eaa65e-1e1d-408b-8c32-f894f0181555:lan0", "geni_error": "", "geni_status": "ready" }, { "orca_expires": "Thu Jun 20 21:33:28 EDT 2013", "geni_urn": "urn:publicid:IDN+exogeni.net+sliver+69eaa65e-1e1d-408b-8c32-f894f0181555:BM-1", "geni_error": "", "geni_status": "configuring" }, { "orca_expires": "Thu Jun 20 21:33:28 EDT 2013", "geni_urn": "urn:publicid:IDN+exogeni.net+sliver+69eaa65e-1e1d-408b-8c32-f894f0181555:VM-1", "geni_error": "", "geni_status": "ready" } ]
Change History (22)
comment:1 Changed 10 years ago by
comment:2 Changed 10 years ago by
Even though the bare metal node became ready, it was not possible to ping the VM in the same sliver in the same rack.
comment:3 Changed 10 years ago by
Created a another sliver at the Houston rack with 1 VM and 1 bare metal node. Again the bare metal node has geni_status "configuring" for the past 30 minutes.
comment:4 Changed 10 years ago by
For the second sliver that showed this problem it took 33 minutes for the bare metal to become ready, and once again the two nodes in the sliver cannot exchange traffic over their dataplane connection.
comment:5 Changed 10 years ago by
Luisa,
The connectivity issue appears limited to UH; I have verified that a slice between a VM and a bare metal node pings successfully at RCI and FIU.
Could you please release one of the bare metal nodes, so that I can debug this, please?
Best, Victor
comment:6 Changed 10 years ago by
Sorry Victor for keeping both slivers up. Both slivers are gone, so I assume they were deleted.
comment:7 Changed 10 years ago by
Summary: | Bare metal node stuck in "configuring" state for new site confirmation test → U of Houston bare metal node in "configuring" state for 30 min and then fail |
---|
Checked the status of this ticket found with the EG-CT-2 scenario (1 VM and 1 Bare Metal node). The bare metal node still takes over 30 minutes to go from "configuring" to "ready" state. Also 15 minutes after the bare metal node became ready, it still does not have an interface configured.
comment:8 Changed 10 years ago by
Owner: | changed from somebody to vjo@duke.edu |
---|---|
Status: | new → assigned |
comment:9 Changed 10 years ago by
Luisa,
The 30 minute issue should be resolved; it was a change that occurred between versions of xCAT. We're working on resolving the connectivity issue; there's some oddity regarding how we're inserting flow rules...
Best, Victor
comment:11 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Was able to verify resolution. Bare metal node is now ready in 14 minutes and it can exchange traffic with other nodes in experiment. Closing ticket. Thank you Victor!
comment:12 Changed 10 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
I was able to exchange traffic when I set up the experiment EG-CT-2 last Friday, but when I re-created the same sliver (1 Bare Metal and one VM) this morning I have found again that traffic cannot be exchanged between the two nodes.
Reopening ticket.
comment:13 Changed 10 years ago by
Luisa,
I'm unable to replicate this, and I have just tested with both bare metal nodes at Houston. Could you please retry?
Best, Victor
comment:14 Changed 10 years ago by
Hi Victor, I was actually just trying this test a few times to see if reproducible.
First try was successful and was able to ping between the VM and BM. Deleted sliver.
Second try, the create sliver was successful and I got a manifest (below), but when I did a sliver status 900 seconds later, I got a " ERROR: There are no reservations" failure.
+ omni.py -a eg-sm createsliver EG-CT-2 EG-CT-2-uh.rspec INFO:omni:Loading config file /home/lnevers/.gcf/omni_config INFO:omni:Using control framework portal INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN INFO:omni:Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2 expires within 1 day on 2013-07-03 19:50:44 UTC INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN INFO:omni:Creating sliver(s) from rspec file EG-CT-2-uh.rspec for slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2 INFO:omni:Got return from CreateSliver for slice EG-CT-2 at https://geni.renci.org:11443/orca/xmlrpc: INFO:omni:<?xml version="1.0" encoding="UTF-8" standalone="yes"?> INFO:omni: <!-- Reserved resources for: Slice: EG-CT-2 at AM: URN: unspecified_AM_URN URL: https://geni.renci.org:11443/orca/xmlrpc --> INFO:omni: <rspec type="manifest" xsi:schemaLocation="http://www.geni.net/resources/rspec/3 http://www.geni.net/resources/rspec/3/manifest.xsd http://hpn.east.isi.edu/rspec/ext/stitch/0.1/ http://hpn.east.isi.edu/rspec/ext/stitch/0.1/stitch-schema.xsd http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/slice-info/1 http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/slice-info/1/slice_info.xsd?format=raw http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/sliver-info/1 http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/sliver-info/1/sliver_info.xsd?format=raw http://www.geni.net/resources/rspec/ext/postBootScript/1 http://www.geni.net/resources/rspec/ext/postBootScript/1/request.xsd" xmlns:ns2="http://hpn.east.isi.edu/rspec/ext/stitch/0.1/" xmlns="http://www.geni.net/resources/rspec/3" xmlns:ns4="http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/sliver-info/1" xmlns:ns3="http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/slice-info/1" xmlns:ns5="http://www.geni.net/resources/rspec/ext/postBootScript/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <node sliver_id="urn:publicid:IDN+exogeni.net:uhvmsite+sliver+f2fd2254-f2b5-4279-9e24-e68190505a68:BM-1" exclusive="true" component_name="orca-xcat-cloud" component_manager_id="urn:publicid:IDN+exogeni.net:uhvmsite+authority+am" component_id="urn:publicid:IDN+exogeni.net:uhvmsite+node+orca-xcat-cloud" client_id="BM-1"> <location latitude="29.72327" longitude="-95.34269" country="Unspecified"/> <sliver_type name="ExoGENI-M4"> <disk_image version="d1044d9162bd7851e3fc2c57a8251ad6b3641c0c" name="http://geni-images.renci.org/images/standard/debian/deb6-neuca-v1.0.8.xml"/> </sliver_type> <services/> <interface mac_address="fe:16:3e:00:5a:a8" client_id="BM-1:if0"> <ip type="ipv4" netmask="255.255.255.0" address="172.16.1.1"/> </interface> <ns4:geni_sliver_info error="Reservation 31b86490-0462-418b-9161-90e97c832d64 (Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2) is in state [Failed,None] Last ticket update: Insufficient resources for specified start time, Failing reservation:31b86490-0462-418b-9161-90e97c832d64 Ticket events Insufficient resources for specified start time, Failing reservation:31b86490-0462-418b-9161-90e97c832d64" destroyed="true" state="failed" start_time="2013-07-03T14:06:54.000Z" expiration_time="2013-07-03T19:50:44.000Z" creation_time="2013-07-03T14:06:54.000Z" creator_urn="lnevers@bbn.com, urn:publicid:IDN+ch.geni.net+user+lnevers"/> </node> <node sliver_id="urn:publicid:IDN+exogeni.net:uhvmsite+sliver+f2fd2254-f2b5-4279-9e24-e68190505a68:VM-1" exclusive="false" component_name="orca-vm-cloud" component_manager_id="urn:publicid:IDN+exogeni.net:uhvmsite+authority+am" component_id="urn:publicid:IDN+exogeni.net:uhvmsite+node+orca-vm-cloud" client_id="VM-1"> <location latitude="29.72327" longitude="-95.34269" country="Unspecified"/> <sliver_type name="m1.small"> <disk_image version="64ad567ce3b1c0dbaa15bad673bbf556a9593e1c" name="http://geni-images.renci.org/images/standard/debian/deb6-neuca-v1.0.6.xml"/> </sliver_type> <services> <ns5:services_post_boot_script type="velocity">#!/bin/bash # Automatically generated boot script execString=&quot;/bin/sh -c \&quot;sudo yum install iperf -y\&quot;&quot; eval $execString </ns5:services_post_boot_script> </services> <interface mac_address="fe:16:3e:00:67:66" client_id="VM-1:if0"> <ip type="ipv4" netmask="255.255.255.0" address="172.16.1.2"/> </interface> <ns4:geni_sliver_info state="configuring" start_time="2013-07-03T14:06:54.000Z" expiration_time="2013-07-03T19:50:44.000Z" creation_time="2013-07-03T14:06:54.000Z" creator_urn="lnevers@bbn.com, urn:publicid:IDN+ch.geni.net+user+lnevers"/> </node> <link vlantag="unknown" sliver_id="urn:publicid:IDN+exogeni.net:uhvmsite+sliver+f2fd2254-f2b5-4279-9e24-e68190505a68:lan0" client_id="lan0"> <interface_ref client_id="BM-1:if0"/> <interface_ref client_id="VM-1:if0"/> <ns4:geni_sliver_info state="configuring" start_time="2013-07-03T14:06:54.000Z" expiration_time="2013-07-03T19:50:44.000Z" creation_time="2013-07-03T14:06:54.000Z" creator_urn="lnevers@bbn.com, urn:publicid:IDN+ch.geni.net+user+lnevers"/> </link> <ns3:geni_slice_info state="configuring" uuid="250428c3-7266-4111-a29a-f4682a90ec22" urn="urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2"/> </rspec> INFO:omni: ------------------------------------------------------------ INFO:omni: Completed createsliver: Options as run: aggregate: ['eg-sm'] framework: portal project: ln-prj Args: createsliver EG-CT-2 EG-CT-2-uh.rspec Result Summary: Got Reserved resources RSpec from geni-renci-org-11443-orca INFO:omni: ============================================================ + sleep 900 + omni.py -a eg-sm sliverstatus EG-CT-2 INFO:omni:Loading config file /home/lnevers/.gcf/omni_config INFO:omni:Using control framework portal INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN INFO:omni:Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2 expires within 1 day on 2013-07-03 19:50:44 UTC INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN INFO:omni:Status of Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2: INFO:omni: ------------------------------------------------------------ INFO:omni: Completed sliverstatus: Options as run: aggregate: ['eg-sm'] framework: portal project: ln-prj Args: sliverstatus EG-CT-2 Result Summary: Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2 expires within 1 day(s) on 2013-07-03 19:50:44 UTC Failed to get SliverStatus on EG-CT-2 at AM https://geni.renci.org:11443/orca/xmlrpc: Error from Aggregate: code 2: ERROR: There are no reservations in the slice with sliceId = urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2. Returned status of slivers on 0 of 1 possible aggregates.
comment:15 follow-up: 16 Changed 10 years ago by
Luisa,
Yes, this sounds right. On the first run, omni returned the information from the SM that a resource could not be reserved from the AM; I suspect that the problem was my reservation of both bare metal nodes.
On the subsequent run, 900 seconds later, your slice was reported as "empty of reservations" (in truth, does not exist) - because the SM was unable to reserve one of the slivers (the bare metal node), and therefore did not proceed any further with the slice.
Best, Victor
comment:16 Changed 10 years ago by
Replying to vjo@duke.edu:
On the first run, omni returned the information from the SM that a resource could not be reserved from the AM; I suspect that the problem was my reservation of both bare metal nodes.
Normally I check the overall createsliver result which was successful because the ExoSM returned a sliver manifest rather than an error result:
Result Summary: Got Reserved resources RSpec from geni-renci-org-11443-orca
The content of the manifest includes a "Last ticket update: Insufficient resources" message, but shouldn't there be an overall failure indication from the Aggregate?
comment:17 follow-up: 18 Changed 10 years ago by
Luisa,
That's probably a question better addressed by Ilya.
Best, Victor
comment:18 follow-up: 19 Changed 10 years ago by
Replying to vjo@duke.edu:
That's probably a question better addressed by Ilya.
Probably!
Ilya, should the ExoSM confirm that the Bare metal node is available before reporting a manifest Rspec that shows it being configured?
comment:19 Changed 10 years ago by
Replying to lnevers@bbn.com:
Ilya, should the ExoSM confirm that the Bare metal node is available before reporting a manifest Rspec that shows it being configured?
The issue with ticket update failure not being handled is being tracked in a separate ticket (http://groups.geni.net/exogeni/ticket/190).
comment:20 Changed 10 years ago by
Reproduced "nodes ready but no connectivity" scenario. Slice is EG-CT-2b, will be available on chat to continue....
comment:21 Changed 10 years ago by
After Victor applied a fix to deal with a race condition. Created the sliver 4 times and had 100% success. This issue is resolved. Closing ticket.
comment:22 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Oops, forgot to close this one.
Checked 30 minutes later and found the Bare Metal node is finally ready. Hmmm, over 30 minutes to become ready is slow!