Opened 10 years ago

Closed 9 years ago

#89 closed (fixed)

Click router sliver fails at IG Utah 5 out of 6 attempts

Reported by: lnevers@bbn.com Owned by: somebody
Priority: major Milestone:
Component: Experiment Version: SPIRAL5
Keywords: Cc:
Dependencies:

Description

While trying to set up a click router scenario with 3 nodes in the GPO rack and 3 nodes in the Utah rack, 5 out of 6 create sliver attempts result in all nodes in the Utah rack with an geni_status of "notready".

The last attempt was at 8:13 this morning (eastern time), the sliver name is IG-EXP-7. These are the nodes assigned in the Utah sliver:

  • pc5.utah.geniracks.net 31548
  • pc5.utah.geniracks.net 31546
  • pc5.utah.geniracks.net 31547

Attempts at the GPO rack have all been successful.

Attaching the RSpec for the topology as well as a diagram that shows the topology of the sliver.

Attachments (2)

IG-EXP-7.jpg (81.9 KB) - added by lnevers@bbn.com 10 years ago.
IG-EXP-7.rspec (6.0 KB) - added by lnevers@bbn.com 10 years ago.

Download all attachments as: .zip

Change History (5)

Changed 10 years ago by lnevers@bbn.com

Attachment: IG-EXP-7.jpg added

Changed 10 years ago by lnevers@bbn.com

Attachment: IG-EXP-7.rspec added

comment:1 Changed 9 years ago by lnevers@bbn.com

Capturing resolution information which was exchanged outside the ticket:

On 2/7/13 10:58 AM, Jonathon Duerig wrote:

At last! I think I've finally figured this out.

This has several components:

(1) Some Cisco device on the network is using IP addresses in the 172.17.1.* space. It is responding to arppings sent out by the vnodes. We are using these addresses as unroutable control node IPs. So when we happen to pick one that is in use, the network fails to set up properly because it detects that the address is already in use.

(2) The rc scripts running in the container and the network setup outside of the container are run at the same time. So sometimes one will finish first and sometimes the other one will.

(3) If the outside finishes first, a bridge is established by the time the network sends an arpping to make sure it isn't using a duplicate IP. This will cause network setup to fail.

(4) If the inside finishes first, the IP is established because arppings will not make it outside until the bridge is set up and therefore things work ok.

I've installed a hotfix for this so I can test this out on pc1. I'll be running a sliver many times today to make sure that this is in fact a fix.

In the mean time, perhaps you could investigate why the upstream device is responding to arppings and claiming to have our unroutable IPs.

comment:2 Changed 9 years ago by lnevers@bbn.com

I have another sliver which has a node stuck in geni_status "changing". The sliver was created yesterday on the GPO rack and it is named IG-CT-1. The host is:

pc1.instageni.gpolab.bbn.com port 30779

Not sure if this is related to this ticket, but symptoms are similar.

comment:3 Changed 9 years ago by lnevers@bbn.com

Resolution: fixed
Status: newclosed

The experiment described in the original ticket description has been created several times without any of the nodes failing with a "not_ready" status. This issue is deemed solved, closing ticket.

Note: See TracTickets for help on using tickets.