Opened 9 years ago

Closed 9 years ago

#1353 closed (fixed)

Unknown path reported for complex 9 site topology

Reported by: lnevers@bbn.com Owned by: xyang@maxgigapop.net
Priority: major Milestone:
Component: STITCHING Version: SPIRAL7
Keywords: GENI Network Stitching Cc:
Dependencies:

Description

A 9 node topology that uses both AL2S and ION sites reports "Cannot find the set of paths" error even though all sites are known.

A diagram is attached for the topology along with the RSpec used.

According to Xi, the link between Illinois and Chicago (ill-chic-4) caused the failure, and the problem is a known issue which is fixed in 2.0 branch which is running in production and development SCS.

After switching to an SCS with the fix, there is still a problem with the link between Chicago and Rutgers (chic-rut-7). Xi is investigating this problem.

Attachments (5)

topology-9sites.jpg (51.3 KB) - added by lnevers@bbn.com 9 years ago.
stitch-9sites.rspec (5.1 KB) - added by lnevers@bbn.com 9 years ago.
9sites-createsliver-request-11-geni-uchicago-edu.xml (36.7 KB) - added by lnevers@bbn.com 9 years ago.
stitch-6sites.rspec (3.5 KB) - added by lnevers@bbn.com 9 years ago.
stitcher-log-6star.txt (443.8 KB) - added by lnevers@bbn.com 9 years ago.

Download all attachments as: .zip

Change History (26)

Changed 9 years ago by lnevers@bbn.com

Attachment: topology-9sites.jpg added

Changed 9 years ago by lnevers@bbn.com

Attachment: stitch-9sites.rspec added

comment:1 Changed 9 years ago by lnevers@bbn.com

After removing the Illinois and Chicago (ill-chic-4) link and the Chicago and Rutgers (chic-rut-7) link, there is still an error reported by Chicago IG:

11:25:06 INFO    : Stitcher doing createsliver at <Aggregate chicago-ig>...
 11:25:23 ERROR   :  {'output': 'pks-chic-3: no edge hop', 'code':
{'protogeni_error_log':
 'urn:publicid:IDN+geni.uchicago.edu+log+4af3e531e479beaffa79e5c753787c07',
'am_type': 'protogeni',
 'geni_code': 2, 'am_code': 2, 'protogeni_error_url':
 'https://www.geni.uchicago.edu/spewlogfile.php3?logfile=4af3e531e479beaffa79e5c753787c07'},
'value': 0}

The Manifest RSpec for this stitched link (attached) shows only the AL2S part of the path and is missing ION.

Changed 9 years ago by lnevers@bbn.com

comment:2 Changed 9 years ago by lnevers@bbn.com

I did some further testing and simplified the topology:

  1. removed all end-point not directly connected to Chicago
  2. and kept removing failing links (stan-chic-6, pks-chic-3, ill-chic-4, gpo-chic-5) up to a working topology.
  3. I was eventually able to create:
       <gpo-eg><-ION-AL2S-><chicago-ig> <-AL2S-ION-><ukypks2-ig>
    

I then tried some tests with some of the links that had failed, so tried these topologies:

  1. UKYPKS2<-ION-AL2S->Chicago
  2. Illinois<-ION-AL2S->Chicago
  3. UKYPKS2<-ION-AL2S->Chicago<-AL2S-ION->Illinois

Both "a" and "b" worked.

Topology "c" shows the "no edge hop" failure. Also the link order matters, the second link in the request RSpec will always get the failure "no edge hop".

comment:3 Changed 9 years ago by lnevers@bbn.com

A fix was applied to the Test SCS, re-running initial 9 site test topology.

comment:4 in reply to:  3 Changed 9 years ago by lnevers@bbn.com

Replying to lnevers@…:

A fix was applied to the Test SCS, re-running initial 9 site test topology.

According to Xi:

Removed the logic that guarantees mutual exclusion of resources (bandwidth and VLANs) between multiple paths.

So there could be a small risk that you double allocate VLAN on some link(s). However the chance is small due to randomization.

comment:5 Changed 9 years ago by lnevers@bbn.com

Have been trying to verify the fix, but ION keeps failing with "requested VLAN was unavailable" error.

Also trying to run the simpler three node topology that also showed the problem (UKYPKS2<-ION-AL2S->Chicago<-AL2S-ION->Illinois), but again ION has been failing.

Will keep trying both topologies, but not at the same time.

comment:6 Changed 9 years ago by lnevers@bbn.com

Still trying to verify fix, the 9 node topology failed to allocate 2 circuits at ion, (to tamu and to illinois-ig).

Fall back to 3 node linear topology to focus on the fix and bypass ion circuit issue, but the stitcher has tried 8 times and the sliver fails each time at Illinois. After looking at stitcher log, it seem that Illinois 20 VLANs are all used. Asked Xi to verify since I cannot check with "listresources --available".

comment:7 Changed 9 years ago by lnevers@bbn.com

Still trying to verify fix by running the simpler 3 nodes linear sliver. All attempts yesterday and this morning are failing with:

VLAN PCE(PCE_CREATE_FAILED): 'There are no VLANs available on link ion.internet2.edu:rtr.chic:et-10/0/0:illinois-ig

Even though only 3 VLANs are in use as verified with the ION router proxy.

comment:8 Changed 9 years ago by lnevers@bbn.com

While waiting for the issue with the Illinois VLANs availability to be resolved, modified original RSpec to remove Illinois so I could verify the SCS fix.

All attempt to re-run the topology, even with less nodes (8, 7, 6), failed with a known issue:

http://groups.geni.net/geni/ticket/1346 (MAX and ION aggregates do not deletesliver even though success is returned for deletesliver)

Xi will be looking at this tonight.

comment:9 Changed 9 years ago by lnevers@bbn.com

I re-ran a simpler version of the 9 node topology where only 6 nodes were left in the RSpec (star topology with Chicago as center node. The problem happened on the first try:

  08:40:18 INFO    : DCN AM <Aggregate ion>: must wait for status ready....
  08:40:18 INFO    : Pausing 30 seconds to let circuit become ready...
  08:40:56 INFO    : Pausing 30 seconds to let circuit become ready...
  08:41:34 WARNING : sliverstatus: 135951 is (still) changing at <Aggregate ion>. Delete and retry.
  08:41:34 WARNING : sliverstatus: 135971 is (still) changing at <Aggregate ion>. Delete and retry.
  08:41:34 WARNING : sliverstatus: 135941 is (still) failed at <Aggregate ion>. Delete and retry.
  08:41:34 WARNING : sliverstatus: 135961 is (still) changing at <Aggregate ion>. Delete and retry.
  08:41:34 WARNING : sliverstatus: 135931 is (still) failed at <Aggregate ion>. Delete and retry.
  08:41:40 WARNING : <Aggregate ion> says requested VLAN was unavailable at
  <Hop u'urn:publicid:IDN+ion.internet2.edu+interface+rtr.atla:xe-0/1/3:ukypks2-ig' on path u'pks-ucd-1'>
  08:41:40 INFO    : Deleting some reservations to retry, avoiding failed VLAN...
  08:41:40 INFO    : Doing deletesliver at <Aggregate ukypks2-ig>...
  08:42:27 INFO    : Will put <Aggregate ion> back in the pool to allocate. Got: Retrying reservations
  at earlier AMs to avoid unavailable VLAN tag at <Aggregate ion>....
  08:42:27 INFO    : Pausing for 30 seconds for Aggregates to free up resources...
 
  08:42:57 INFO    : Stitcher doing createsliver at <Aggregate ukypks2-ig>...
  08:43:14 INFO    : ... Allocation at <Aggregate ukypks2-ig> complete.
  08:43:14 INFO    : Stitcher doing createsliver at <Aggregate ion>...
  08:43:22 ERROR   :  {'output': ': CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:ln-test+slice+6sites, ',
  'geni_api': 2, 'code': {'am_type': 'sfa', 'geni_code': 7, 'am_code': 7}, 'value': ''}
  08:43:22 WARNING : Stitching failed but will retry: Reservation request impossible at <Aggregate ion>. You already
  have a reservation here in this slice: AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: : CreateSliver:
  Existing record: urn:publicid:IDN+ch.geni.net:...

I am attaching the stitcher log plus the modified 6 node RSpec. The stitcher log shows at line 2063 (08:41:39) that the sliver was successfully deleted at ION and a failure is returned on the next attempt on line 2354 (08:43:22).

Changed 9 years ago by lnevers@bbn.com

Attachment: stitch-6sites.rspec added

Changed 9 years ago by lnevers@bbn.com

Attachment: stitcher-log-6star.txt added

comment:10 in reply to:  2 Changed 9 years ago by lnevers@bbn.com

Capturing update to comment:2:

  1. UKYPKS2<-ION-AL2S->Chicago<-AL2S-ION->Illinois

The above topology continuously reported that Illinois had not VLAN, when only 3 out of 20 where in use. Reported it to Chad, who had the following update:

On 11/6/14 10:23 AM, Chad Kotil wrote:

I think we're hitting a bug in ION Oscars. I'm able to reproduce the
issue locally.

It seems that if you specify vlan endpoints for a circuit with the same
interface (port), the Path Calculation Engine (PCE) throws an error 'no
vlans available', which is wrong.

I've even gone so far as letting oscars choose vlans, create the circuit
which goes ACTIVE. Then cancel the circuit, only to clone it again so
I'm using the same vlans. But it still fails with the exact same error.

I am contact Oscars support now to try and get resolution for this
issue. In the meantime I think a workaround would be to try and create a
circuit with different ports.
--Chad

comment:11 Changed 9 years ago by lnevers@bbn.com

Update on the OSCARS problem with the Path Calculation Engine:

On 11/11/14 10:06 AM, Chad Kotil wrote:

The es.net Oscars team has patched the bug. There is a maintenance
window Thursday morning to upgrade ION Oscars at 7am EST.

The MAX/ION problem with deleting slivers (http://groups.geni.net/geni/ticket/1346) is solved at MAX, the fix needs to be addressed at ION.

Need to try original topology when fix is installed at ION.

comment:12 in reply to:  11 Changed 9 years ago by lnevers@bbn.com

In addition to the updates to:

There needs to be an update to the production SCS to introduce the two fixes that are needed to get the initial 9 nodes topology in this ticket to work:

  • Initial fix addressed "Cannot find the set of paths" error.
  • Second fix addressed incomplete paths. For links that included both AL2s and ION as part of one path, the ION part of the path is always missing.

comment:13 Changed 9 years ago by lnevers@bbn.com

Chad installed a fix for OSCARS problem with the Path Calculation Engine that results in a 'no vlans available' error when a request specifies VLAN endpoints for a circuit with the same interface (port).

Fix seem to not have solved the problem; reported fix failure.

comment:14 Changed 9 years ago by lnevers@bbn.com

OSCAR circuit clean up for the following active circuits which were associate with slivers that did not exist:

  1. Slice "9sites", according to sliverstatus there is no sliver at ION:
    • ion.internet2.edu-138261
    • ion.internet2.edu-138061
    • ion.internet2.edu-137901
    • ion.internet2.edu-137961
    • ion.internet2.edu-137831 -
  2. Slice "iozcelik3", according to experimenter he has no slice:
    • ion.internet2.edu-138141
    • ion.internet2.edu-137991
  1. Slice "iozcelik", according to experimenter he has no slice:
    • ion.internet2.edu-137301
  1. Slice "IOTest", according to experimenter he has no slice:
    • ion.internet2.edu-137711

Once cleanup was completed, was able to create the initial topology. The OSCARS PCE fix is deemed validated.

Still need the SCS fixes to be installed at Production SCS, will close ticket when this occurs.

comment:15 Changed 9 years ago by lnevers@bbn.com

The fix for the ION deletesliver problem was installed on November 25th. Extensive testing has taken place and have not been able to reproduce the error scenario. Fix is deemed successful, closing ticket.

comment:16 Changed 9 years ago by lnevers@bbn.com

Resolution: fixed
Status: newclosed

comment:17 Changed 9 years ago by lnevers@bbn.com

Resolution: fixed
Status: closedreopened

The fix installed November 25 did not work, the one experimenter (İlker Özçelik) that also saw the problem with ION not deleting the sliver ran into it several times yesterday. I have been cleaning up his circuits to help him set up a scheduled disruptive experiments, which is scheduled for today.

I used his RSpec when I tested the fix, but had seen the problem. I will talk to Ilker to understand how he is getting to the failure. Re-opening the ticket.

comment:18 Changed 9 years ago by lnevers@bbn.com

Was able to reproduce the problem where ION does not delete a sliver even though stitcher submits a deletesliver.

The sequence using the RSpec from Ilker Özçelik (6 node star w/UKYPKS2 at center node, also PKS2 uses an OVS image):

  1. In GENI portal, create stitched sliver "iozcelik1" and leave it running.
  2. With omni create a stitched sliver "lnilker" using the same RSpec as portal.
  3. The omni sliver fails due to lack of resources at GPO, stitcher request delete at all sites.
  4. Deleted portal sliver "iozcelik1"
  5. Create the omni sliver again "lnilker", at this point I can see that ION has the old sliver record for the sliver that should have been deleted.
  6. Checked sliver status for the ION "lnilker" sliver:
          "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143451",
          "geni_error": "",
          "geni_status": "ready"
    
          "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143461",
          "geni_error": "VLAN cancelled by rollback from contingent failure",
          "geni_status": "failed"
    
          "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143471",
          "geni_error": "VLAN PCE(PCE_CREATE_FAILED): 'There are no VLANs available on link ion.internet2.edu:rtr.newy:et-5/0/0:gpo-ig  on reservation ion.internet2.edu-143471 in VLAN PCE'",
          "geni_status": "failed"
    
          "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143481",
          "geni_error": "VLAN cancelled by rollback from contingent failure",
          "geni_status": "failed"
    
          "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143491",
          "geni_error": "VLAN cancelled by rollback from contingent failure",
          "geni_status": "failed"
    

comment:19 Changed 9 years ago by xyang@maxgigapop.net

Circuit ion.internet2.edu-143451 was cancelled when I checked. Was that done manually or automatically? Is the sliver 'lnilker' still there and still showing "ready" for that circuit?

comment:20 Changed 9 years ago by lnevers@bbn.com

The circuit was active, and I cancelled it to clean up at the end of the test (or step 10) as described in the ticket http://groups.geni.net/geni/ticket/1356.

comment:21 Changed 9 years ago by lnevers@bbn.com

Resolution: fixed
Status: reopenedclosed

I have been able to set up the 9 node complex topology without any problems a few times in the past week. Closing ticket.

Note: See TracTickets for help on using tickets.