Opened 4 years ago

Last modified 4 years ago

#1357 accepted

SCS does not suggest the same VLAN tag for cross-connect hops that connect AL2S to ION

Reported by: lnevers@bbn.com Owned by: xyang@maxgigapop.net
Priority: major Milestone:
Component: STITCHING Version: SPIRAL7
Keywords: GENI Network Stitching Cc:
Dependencies:

Description

The SCS does not suggest the same VLAN tag for hops that connect AL2S to ION. This problem was first found by Tim, and is reproducible (2 out of 5 requests show this problem).

Following is an Email from Aaron which describes the problem.

On 12/10/14 12:58 PM, Aaron Helsinger wrote:

I believe there is a bug in the SCS, in the way it calculates dependencies when a circuit has an AL2S/ION connection.

First, I'm assuming that if 2 routers both support VLAN translation and are connected to each other, that it remains true that they must agree on the VLAN tag that will be used for the connection between them.

More concretely:

When an AL2S hop connects to an ION hop both must use the same VLAN tag.

Therefore, one of those hops should import VAN tags from the other. One should depend on the other.

But that is not what the SCS does. The SCS makes each of those hops independent of the other. As a result, the SCS sometimes picks different VLAN tags for those 2 hops. Even if it happens to pick matching VLAN tags for those 2 hops, when stitcher has to retry a reservation at one of those AMs because that VLAN tag failed, stitcher does not know that the other AM needs to change. As a result, the 2 hops end up with different VLAN tags. And traffic cannot flow on the circuit.

For example:

Luisa requested a single circuit from IG GPO via AL2S to IG Missouri

12/10 11:07:30 DEBUG    stitchhandler.py:1405 Calling SCS with options 
{'geni_workflow_paths_merged': True, 'geni_routing_profile': {'Olink':
{'hop_exclusion_list': ['urn:publicid:IDN+instageni.rnet.missouri.edu+interface+procurve2:1.19', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.atla:xe-0/3/0:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.chic:et-10/0/0:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.clev:et-5/0/0:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.hous:xe-0/1/3:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.kans:xe-0/0/3:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.losa:et-10/0/0:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.newy:et-5/0/0:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.salt:xe-0/1/1:al2s', 
'urn:publicid:IDN+ion.internet2.edu+interface+rtr.seat:et-5/0/0:al2s']}}}

The SCS return shows a circuit in 2 separate segments: ION to GPO IG and AL2S to Missouri IG.

12/10 11:07:31 DEBUG    stitchhandler.py:1649 SCS gave hop urn:publicid:IDN+instageni.gpolab.bbn.com+interface+procurve2:5.24.gpo-ig suggested VLAN 3721, avail: '3706-3732,3746-3749'
12/10 11:07:31 DEBUG    stitchhandler.py:1649 SCS gave hop urn:publicid:IDN+ion.internet2.edu+interface+rtr.newy:et-5/0/0:gpo-ig suggested VLAN 3721, avail: '3706-3732,3746-3749'
12/10 11:07:31 DEBUG    stitchhandler.py:1649 SCS gave hop urn:publicid:IDN+ion.internet2.edu+interface+rtr.wash:et-9/0/0:al2s suggested VLAN 3930, avail: '3900-4000'
12/10 11:07:31 DEBUG    stitchhandler.py:1649 SCS gave hop urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.wash.net.internet2.edu:eth5/2:* suggested VLAN 3995, avail: '3950-4000'

The above 2 lines indicate the bug. These 2 hops face each other, but the SCS picked a different VLAN tag for each.

Tim: I also see that the available range does not agree. This is a little surprising.

12/10 11:07:31 DEBUG    stitchhandler.py:1649 SCS gave hop urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.colu4.net.internet2.edu:eth1/2:missouri-ig suggested VLAN 1165, avail: '1161-1165'
12/10 11:07:31 DEBUG    stitchhandler.py:1649 SCS gave hop urn:publicid:IDN+instageni.rnet.missouri.edu+interface+procurve2:1.19.al2s suggested VLAN 1165, avail: '1161-1165'

The bug is further seen in the workflow/dependencies from the SCS:

12/10 11:07:31 DEBUG    stitchhandler.py:1664 SCS workflow:
{ '##all_paths_merged##': { 'dependencies': [ { 'aggregate_url': 'http://geni-am.net.internet2.edu:12346',
'aggregate_urn': 'urn:publicid:IDN+ion.internet2.edu+authority+am',
'dependencies': [ { 'aggregate_url': 'https://boss.instageni.gpolab.bbn.com:12369/protogeni/xmlrpc/am',
'aggregate_urn': 'urn:publicid:IDN+instageni.gpolab.bbn.com+authority+cm',
'hop_urn': 'urn:publicid:IDN+instageni.gpolab.bbn.com+interface+procurve2:5.24.gpo-ig',
'import_vlans': False}],
'hop_urn': 'urn:publicid:IDN+ion.internet2.edu+interface+rtr.newy:et-5/0/0:gpo-ig',
'import_vlans': True},
{ 'aggregate_url': 'http://geni-am.net.internet2.edu:12346',
'aggregate_urn': 'urn:publicid:IDN+ion.internet2.edu+authority+am',
 'hop_urn': 'urn:publicid:IDN+ion.internet2.edu+interface+rtr.wash:et-9/0/0:al2s',
'import_vlans': False},
{ 'aggregate_url': 'http://foam-oess-stage.grnoc.iu.edu:3626/foam/gapi/2',
'aggregate_urn': 'urn:publicid:IDN+al2s.internet2.edu+authority+am',
'hop_urn': 'urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.wash.net.internet2.edu:eth5/2:*',
'import_vlans': False},

Above we see that the ION hop facing AL2S does not import VLANs from anyone. This tells stitcher that it is free to change the VLAN tag used here independent of what tag is used by any other hop whatsoever. Similarly, the AL2S hop facing ION has no dependencies and picks its own VLAN tag. This seems wrong.

{ 'aggregate_url': 'http://foam-oess-stage.grnoc.iu.edu:3626/foam/gapi/2',
  'aggregate_urn': 'urn:publicid:IDN+al2s.internet2.edu+authority+am',
  'dependencies': [ { 'aggregate_url': 'https://www.instageni.rnet.missouri.edu:12369/protogeni/xmlrpc/am',
'aggregate_urn': 'urn:publicid:IDN+instageni.rnet.missouri.edu+authority+cm',
'hop_urn': 'urn:publicid:IDN+instageni.rnet.missouri.edu+interface+procurve2:1.19.al2s',                                                                   
'import_vlans': False}],
'hop_urn': 'urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.colu4.net.internet2.edu:eth1/2:missouri-ig',
'import_vlans': True}]},

Now, there remains a question of what heuristics we should use to decide which AM's hop imports VLANs from which other. Presumably we would use some rule based on the size of the available VLAN tag range, or the # of hops being requested at that AM, or whether other AMs depend on that AM?

I leave that to you.

Attachments (2)

IG-ST-2.rspec (2.0 KB) - added by lnevers@bbn.com 4 years ago.
IG-ST-2-SCSReply.rspec (10.4 KB) - added by xyang@maxgigapop.net 4 years ago.

Download all attachments as: .zip

Change History (9)

comment:1 Changed 4 years ago by xyang@maxgigapop.net

Status: newaccepted

comment:2 Changed 4 years ago by lnevers@bbn.com

This problem can be seen on every sliver attempt if a 2 node topology is used that includes 2 links between two endpoint such as the attached RSpec used. Note the the excludhop ION is used for the Missouri site.

On every createsliver attempt, one of two links has mismatched VLANs suggested.

Changed 4 years ago by lnevers@bbn.com

Attachment: IG-ST-2.rspec added

Changed 4 years ago by xyang@maxgigapop.net

Attachment: IG-ST-2-SCSReply.rspec added

comment:3 Changed 4 years ago by xyang@maxgigapop.net

Tried a few times with SCS on nutshell.maxgigapop.net but failed to reproduce the problem.

Attached (IG-ST-2-SCSReply.rspec) is the typical result rspec replied from SCS.

comment:4 Changed 4 years ago by lnevers@bbn.com

Tim and I cleaned up several orphaned VLANs over the cross-connects. That may have played a role?

Re-running tests today with the 2 link topology that failed consistently before the cleanup. Will update with results when completed.

comment:5 in reply to:  4 ; Changed 4 years ago by xyang@maxgigapop.net

Replying to lnevers@…:

Tim and I cleaned up several orphaned VLANs over the cross-connects. That may have played a role?

This is SCS computation and should have nothing to do with the actual provisioned VLANs.

comment:6 in reply to:  5 Changed 4 years ago by lnevers@bbn.com

Replying to xyang@…:

Replying to lnevers@…:

Tim and I cleaned up several orphaned VLANs over the cross-connects. That may have played a role?

This is SCS computation and should have nothing to do with the actual provisioned VLANs.

I was able to see the problem on 5 attempts before we cleaned all VLANs. I have tried about 20 times to recreates this topology since last night, but most attempts (15) have failed with ION circuit failures and only 5 have been successfully created. For the 5 successful slivers both links could exchange traffic.

Just noting what is taking place, sorry I did not fully explain that this has nothing to do with the SCS allocating VLANs.

comment:7 Changed 4 years ago by lnevers@bbn.com

I have run into another instance of the mis-matched VLANs for a cross-connect.

Slice urn:publicid:IDN+ch.geni.net:ln-test+slice+IG-ST-3c includes:

GPO IG <-ION-AL2S-> Missouri IG<-AL2S-ION-> IG Utah

The GPO to Missouri link has the mismatched VLANs:

urn:publicid:IDN+instageni.gpolab.bbn.com+interface+procurve2:5.24.gpo-ig                     -> VLAN 3749 Hop  "1"
urn:publicid:IDN+ion.internet2.edu+interface+rtr.newy:et-5/0/0:gpo-ig                         -> VLAN 3749 Hop  "2"
urn:publicid:IDN+ion.internet2.edu+interface+rtr.newy:et-5/0/0:al2s                           -> VLAN 3991 Hop  "3"
urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.newy32aoa.net.internet2.edu:eth3/2:*     -> VLAN 3964 Hop  "4"
urn:publicid:IDN+al2s.internet2.edu+interface+sdn-sw.colu4.net.internet2.edu:eth1/2:missouri-ig -> VLAN 1165 Hop  "5"
urn:publicid:IDN+instageni.rnet.missouri.edu+interface+procurve2:1.19.al2s                    -> VLAN 1165 Hop  "6"

Working on reproducing the problem.

Note: See TracTickets for help on using tickets.