Opened 6 years ago

Closed 6 years ago

#188 closed (fixed)

U of Houston bare metal node in "configuring" state for 30 min and then fail

Reported by: lnevers@bbn.com Owned by: vjo@duke.edu
Priority: major Milestone:
Component: Experiment Version: SPIRAL5
Keywords: confirmation tests Cc:
Dependencies:

Description

Created a sliver EG-CT-2a at the University of Utah with 1 VM and 1 Bare Metal node. Thirty minutes after the completed createsliver, the bare metal node is still in configuring state:

INFO:omni:Sliver status for Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2a at AM URL https://geni.renci.org:11443/orca/xmlrpc
INFO:omni:{
  "geni_status": "configuring", 
  "geni_urn": "urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2a", 
  "geni_resources": [
    {
      "orca_expires": "Thu Jun 20 21:33:28 EDT 2013", 
      "geni_urn": "urn:publicid:IDN+exogeni.net+sliver+69eaa65e-1e1d-408b-8c32-f894f0181555:lan0", 
      "geni_error": "", 
      "geni_status": "ready"
    }, 
    {
      "orca_expires": "Thu Jun 20 21:33:28 EDT 2013", 
      "geni_urn": "urn:publicid:IDN+exogeni.net+sliver+69eaa65e-1e1d-408b-8c32-f894f0181555:BM-1", 
      "geni_error": "", 
      "geni_status": "configuring"
    }, 
    {
      "orca_expires": "Thu Jun 20 21:33:28 EDT 2013", 
      "geni_urn": "urn:publicid:IDN+exogeni.net+sliver+69eaa65e-1e1d-408b-8c32-f894f0181555:VM-1", 
      "geni_error": "", 
      "geni_status": "ready"
    }
  ]

Change History (22)

comment:1 Changed 6 years ago by lnevers@bbn.com

Checked 30 minutes later and found the Bare Metal node is finally ready. Hmmm, over 30 minutes to become ready is slow!

comment:2 Changed 6 years ago by lnevers@bbn.com

Even though the bare metal node became ready, it was not possible to ping the VM in the same sliver in the same rack.

comment:3 Changed 6 years ago by lnevers@bbn.com

Created a another sliver at the Houston rack with 1 VM and 1 bare metal node. Again the bare metal node has geni_status "configuring" for the past 30 minutes.

comment:4 Changed 6 years ago by lnevers@bbn.com

For the second sliver that showed this problem it took 33 minutes for the bare metal to become ready, and once again the two nodes in the sliver cannot exchange traffic over their dataplane connection.

comment:5 Changed 6 years ago by vjo@duke.edu

Luisa,

The connectivity issue appears limited to UH; I have verified that a slice between a VM and a bare metal node pings successfully at RCI and FIU.

Could you please release one of the bare metal nodes, so that I can debug this, please?

Best, Victor

comment:6 Changed 6 years ago by lnevers@bbn.com

Sorry Victor for keeping both slivers up. Both slivers are gone, so I assume they were deleted.

comment:7 Changed 6 years ago by lnevers@bbn.com

Summary: Bare metal node stuck in "configuring" state for new site confirmation testU of Houston bare metal node in "configuring" state for 30 min and then fail

Checked the status of this ticket found with the EG-CT-2 scenario (1 VM and 1 Bare Metal node). The bare metal node still takes over 30 minutes to go from "configuring" to "ready" state. Also 15 minutes after the bare metal node became ready, it still does not have an interface configured.

comment:8 Changed 6 years ago by vjo@duke.edu

Owner: changed from somebody to vjo@duke.edu
Status: newassigned

comment:9 Changed 6 years ago by vjo@duke.edu

Luisa,

The 30 minute issue should be resolved; it was a change that occurred between versions of xCAT. We're working on resolving the connectivity issue; there's some oddity regarding how we're inserting flow rules...

Best, Victor

comment:10 Changed 6 years ago by vjo@duke.edu

Luisa,

This should be resolved; try it now.

Best, Victor

comment:11 Changed 6 years ago by lnevers@bbn.com

Resolution: fixed
Status: assignedclosed

Was able to verify resolution. Bare metal node is now ready in 14 minutes and it can exchange traffic with other nodes in experiment. Closing ticket. Thank you Victor!

comment:12 Changed 6 years ago by lnevers@bbn.com

Resolution: fixed
Status: closedreopened

I was able to exchange traffic when I set up the experiment EG-CT-2 last Friday, but when I re-created the same sliver (1 Bare Metal and one VM) this morning I have found again that traffic cannot be exchanged between the two nodes.

Reopening ticket.

comment:13 Changed 6 years ago by vjo@duke.edu

Luisa,

I'm unable to replicate this, and I have just tested with both bare metal nodes at Houston. Could you please retry?

Best, Victor

comment:14 Changed 6 years ago by lnevers@bbn.com

Hi Victor, I was actually just trying this test a few times to see if reproducible.

First try was successful and was able to ping between the VM and BM. Deleted sliver.

Second try, the create sliver was successful and I got a manifest (below), but when I did a sliver status 900 seconds later, I got a " ERROR: There are no reservations" failure.

+ omni.py -a eg-sm createsliver EG-CT-2 EG-CT-2-uh.rspec
INFO:omni:Loading config file /home/lnevers/.gcf/omni_config
INFO:omni:Using control framework portal
INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN
INFO:omni:Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2 expires within 1 day on 2013-07-03 19:50:44 UTC
INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN
INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN
INFO:omni:Creating sliver(s) from rspec file EG-CT-2-uh.rspec for slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2
INFO:omni:Got return from CreateSliver for slice EG-CT-2 at https://geni.renci.org:11443/orca/xmlrpc:
INFO:omni:<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
INFO:omni:  <!-- Reserved resources for:
	Slice: EG-CT-2
	at AM:
	URN: unspecified_AM_URN
	URL: https://geni.renci.org:11443/orca/xmlrpc
 -->
INFO:omni:  
<rspec type="manifest" xsi:schemaLocation="http://www.geni.net/resources/rspec/3 http://www.geni.net/resources/rspec/3/manifest.xsd http://hpn.east.isi.edu/rspec/ext/stitch/0.1/ http://hpn.east.isi.edu/rspec/ext/stitch/0.1/stitch-schema.xsd http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/slice-info/1 http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/slice-info/1/slice_info.xsd?format=raw http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/sliver-info/1 http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/sliver-info/1/sliver_info.xsd?format=raw http://www.geni.net/resources/rspec/ext/postBootScript/1 http://www.geni.net/resources/rspec/ext/postBootScript/1/request.xsd" xmlns:ns2="http://hpn.east.isi.edu/rspec/ext/stitch/0.1/" xmlns="http://www.geni.net/resources/rspec/3" xmlns:ns4="http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/sliver-info/1" xmlns:ns3="http://groups.geni.net/exogeni/attachment/wiki/RspecExtensions/slice-info/1" xmlns:ns5="http://www.geni.net/resources/rspec/ext/postBootScript/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <node sliver_id="urn:publicid:IDN+exogeni.net:uhvmsite+sliver+f2fd2254-f2b5-4279-9e24-e68190505a68:BM-1" exclusive="true" component_name="orca-xcat-cloud" component_manager_id="urn:publicid:IDN+exogeni.net:uhvmsite+authority+am" component_id="urn:publicid:IDN+exogeni.net:uhvmsite+node+orca-xcat-cloud" client_id="BM-1">
        <location latitude="29.72327" longitude="-95.34269" country="Unspecified"/>
        <sliver_type name="ExoGENI-M4">
            <disk_image version="d1044d9162bd7851e3fc2c57a8251ad6b3641c0c" name="http://geni-images.renci.org/images/standard/debian/deb6-neuca-v1.0.8.xml"/>
        </sliver_type>
        <services/>
        <interface mac_address="fe:16:3e:00:5a:a8" client_id="BM-1:if0">
            <ip type="ipv4" netmask="255.255.255.0" address="172.16.1.1"/>
        </interface>
        <ns4:geni_sliver_info error="Reservation 31b86490-0462-418b-9161-90e97c832d64 (Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2) is in state [Failed,None]

Last ticket update: Insufficient resources for specified start time, Failing reservation:31b86490-0462-418b-9161-90e97c832d64

Ticket events
Insufficient resources for specified start time, Failing reservation:31b86490-0462-418b-9161-90e97c832d64" destroyed="true" state="failed" start_time="2013-07-03T14:06:54.000Z" expiration_time="2013-07-03T19:50:44.000Z" creation_time="2013-07-03T14:06:54.000Z" creator_urn="lnevers@bbn.com, urn:publicid:IDN+ch.geni.net+user+lnevers"/>
    </node>
    <node sliver_id="urn:publicid:IDN+exogeni.net:uhvmsite+sliver+f2fd2254-f2b5-4279-9e24-e68190505a68:VM-1" exclusive="false" component_name="orca-vm-cloud" component_manager_id="urn:publicid:IDN+exogeni.net:uhvmsite+authority+am" component_id="urn:publicid:IDN+exogeni.net:uhvmsite+node+orca-vm-cloud" client_id="VM-1">
        <location latitude="29.72327" longitude="-95.34269" country="Unspecified"/>
        <sliver_type name="m1.small">
            <disk_image version="64ad567ce3b1c0dbaa15bad673bbf556a9593e1c" name="http://geni-images.renci.org/images/standard/debian/deb6-neuca-v1.0.6.xml"/>
        </sliver_type>
        <services>
            <ns5:services_post_boot_script type="velocity">#!/bin/bash
# Automatically generated boot script
execString=&amp;quot;/bin/sh -c \&amp;quot;sudo yum install iperf -y\&amp;quot;&amp;quot;
eval $execString

</ns5:services_post_boot_script>
        </services>
        <interface mac_address="fe:16:3e:00:67:66" client_id="VM-1:if0">
            <ip type="ipv4" netmask="255.255.255.0" address="172.16.1.2"/>
        </interface>
        <ns4:geni_sliver_info state="configuring" start_time="2013-07-03T14:06:54.000Z" expiration_time="2013-07-03T19:50:44.000Z" creation_time="2013-07-03T14:06:54.000Z" creator_urn="lnevers@bbn.com, urn:publicid:IDN+ch.geni.net+user+lnevers"/>
    </node>
    <link vlantag="unknown" sliver_id="urn:publicid:IDN+exogeni.net:uhvmsite+sliver+f2fd2254-f2b5-4279-9e24-e68190505a68:lan0" client_id="lan0">
        <interface_ref client_id="BM-1:if0"/>
        <interface_ref client_id="VM-1:if0"/>
        <ns4:geni_sliver_info state="configuring" start_time="2013-07-03T14:06:54.000Z" expiration_time="2013-07-03T19:50:44.000Z" creation_time="2013-07-03T14:06:54.000Z" creator_urn="lnevers@bbn.com, urn:publicid:IDN+ch.geni.net+user+lnevers"/>
    </link>
    <ns3:geni_slice_info state="configuring" uuid="250428c3-7266-4111-a29a-f4682a90ec22" urn="urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2"/>
</rspec>

INFO:omni: ------------------------------------------------------------
INFO:omni: Completed createsliver:

  Options as run:
		aggregate: ['eg-sm']
		framework: portal
		project: ln-prj

  Args: createsliver EG-CT-2 EG-CT-2-uh.rspec

  Result Summary: Got Reserved resources RSpec from geni-renci-org-11443-orca 
INFO:omni: ============================================================
+ sleep 900
+ omni.py -a eg-sm sliverstatus EG-CT-2
INFO:omni:Loading config file /home/lnevers/.gcf/omni_config
INFO:omni:Using control framework portal
INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN
INFO:omni:Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2 expires within 1 day on 2013-07-03 19:50:44 UTC
INFO:omni:Substituting AM nickname eg-sm with URL https://geni.renci.org:11443/orca/xmlrpc, URN unspecified_AM_URN
INFO:omni:Status of Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2:
INFO:omni: ------------------------------------------------------------
INFO:omni: Completed sliverstatus:

  Options as run:
		aggregate: ['eg-sm']
		framework: portal
		project: ln-prj

  Args: sliverstatus EG-CT-2

  Result Summary: Slice urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2 expires within 1 day(s) on 2013-07-03 19:50:44 UTC

Failed to get SliverStatus on EG-CT-2 at AM https://geni.renci.org:11443/orca/xmlrpc: 
Error from Aggregate: code 2: ERROR: There are no reservations in the slice with 
sliceId = urn:publicid:IDN+ch.geni.net:ln-prj+slice+EG-CT-2.
Returned status of slivers on 0 of 1 possible aggregates. 

comment:15 Changed 6 years ago by vjo@duke.edu

Luisa,

Yes, this sounds right. On the first run, omni returned the information from the SM that a resource could not be reserved from the AM; I suspect that the problem was my reservation of both bare metal nodes.

On the subsequent run, 900 seconds later, your slice was reported as "empty of reservations" (in truth, does not exist) - because the SM was unable to reserve one of the slivers (the bare metal node), and therefore did not proceed any further with the slice.

Best, Victor

comment:16 in reply to:  15 Changed 6 years ago by lnevers@bbn.com

Replying to vjo@duke.edu:

On the first run, omni returned the information from the SM that a resource could not be reserved from the AM; I suspect that the problem was my reservation of both bare metal nodes.

Normally I check the overall createsliver result which was successful because the ExoSM returned a sliver manifest rather than an error result:

  Result Summary: Got Reserved resources RSpec from geni-renci-org-11443-orca 

The content of the manifest includes a "Last ticket update: Insufficient resources" message, but shouldn't there be an overall failure indication from the Aggregate?

comment:17 Changed 6 years ago by vjo@duke.edu

Luisa,

That's probably a question better addressed by Ilya.

Best, Victor

comment:18 in reply to:  17 ; Changed 6 years ago by lnevers@bbn.com

Replying to vjo@duke.edu:

That's probably a question better addressed by Ilya.

Probably!

Ilya, should the ExoSM confirm that the Bare metal node is available before reporting a manifest Rspec that shows it being configured?

comment:19 in reply to:  18 Changed 6 years ago by lnevers@bbn.com

Replying to lnevers@bbn.com:

Ilya, should the ExoSM confirm that the Bare metal node is available before reporting a manifest Rspec that shows it being configured?

The issue with ticket update failure not being handled is being tracked in a separate ticket (http://groups.geni.net/exogeni/ticket/190).

comment:20 Changed 6 years ago by lnevers@bbn.com

Reproduced "nodes ready but no connectivity" scenario. Slice is EG-CT-2b, will be available on chat to continue....

comment:21 Changed 6 years ago by lnevers@bbn.com

After Victor applied a fix to deal with a race condition. Created the sliver 4 times and had 100% success. This issue is resolved. Closing ticket.

comment:22 Changed 6 years ago by lnevers@bbn.com

Resolution: fixed
Status: reopenedclosed

Oops, forgot to close this one.

Note: See TracTickets for help on using tickets.