Opened 9 years ago
Closed 9 years ago
#1353 closed (fixed)
Unknown path reported for complex 9 site topology
Reported by: | lnevers@bbn.com | Owned by: | xyang@maxgigapop.net |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | STITCHING | Version: | SPIRAL7 |
Keywords: | GENI Network Stitching | Cc: | |
Dependencies: |
Description
A 9 node topology that uses both AL2S and ION sites reports "Cannot find the set of paths" error even though all sites are known.
A diagram is attached for the topology along with the RSpec used.
According to Xi, the link between Illinois and Chicago (ill-chic-4) caused the failure, and the problem is a known issue which is fixed in 2.0 branch which is running in production and development SCS.
After switching to an SCS with the fix, there is still a problem with the link between Chicago and Rutgers (chic-rut-7). Xi is investigating this problem.
Attachments (5)
Change History (26)
Changed 9 years ago by
Attachment: | topology-9sites.jpg added |
---|
Changed 9 years ago by
Attachment: | stitch-9sites.rspec added |
---|
comment:1 Changed 9 years ago by
Changed 9 years ago by
Attachment: | 9sites-createsliver-request-11-geni-uchicago-edu.xml added |
---|
comment:2 follow-up: 10 Changed 9 years ago by
I did some further testing and simplified the topology:
- removed all end-point not directly connected to Chicago
- and kept removing failing links (stan-chic-6, pks-chic-3, ill-chic-4, gpo-chic-5) up to a working topology.
- I was eventually able to create:
<gpo-eg><-ION-AL2S-><chicago-ig> <-AL2S-ION-><ukypks2-ig>
I then tried some tests with some of the links that had failed, so tried these topologies:
- UKYPKS2<-ION-AL2S->Chicago
- Illinois<-ION-AL2S->Chicago
- UKYPKS2<-ION-AL2S->Chicago<-AL2S-ION->Illinois
Both "a" and "b" worked.
Topology "c" shows the "no edge hop" failure. Also the link order matters, the second link in the request RSpec will always get the failure "no edge hop".
comment:3 follow-up: 4 Changed 9 years ago by
A fix was applied to the Test SCS, re-running initial 9 site test topology.
comment:4 Changed 9 years ago by
Replying to lnevers@…:
A fix was applied to the Test SCS, re-running initial 9 site test topology.
According to Xi:
Removed the logic that guarantees mutual exclusion of resources (bandwidth and VLANs) between multiple paths.
So there could be a small risk that you double allocate VLAN on some link(s). However the chance is small due to randomization.
comment:5 Changed 9 years ago by
Have been trying to verify the fix, but ION keeps failing with "requested VLAN was unavailable" error.
Also trying to run the simpler three node topology that also showed the problem (UKYPKS2<-ION-AL2S->Chicago<-AL2S-ION->Illinois), but again ION has been failing.
Will keep trying both topologies, but not at the same time.
comment:6 Changed 9 years ago by
Still trying to verify fix, the 9 node topology failed to allocate 2 circuits at ion, (to tamu and to illinois-ig).
Fall back to 3 node linear topology to focus on the fix and bypass ion circuit issue, but the stitcher has tried 8 times and the sliver fails each time at Illinois. After looking at stitcher log, it seem that Illinois 20 VLANs are all used. Asked Xi to verify since I cannot check with "listresources --available".
comment:7 Changed 9 years ago by
Still trying to verify fix by running the simpler 3 nodes linear sliver. All attempts yesterday and this morning are failing with:
VLAN PCE(PCE_CREATE_FAILED): 'There are no VLANs available on link ion.internet2.edu:rtr.chic:et-10/0/0:illinois-ig
Even though only 3 VLANs are in use as verified with the ION router proxy.
comment:8 Changed 9 years ago by
While waiting for the issue with the Illinois VLANs availability to be resolved, modified original RSpec to remove Illinois so I could verify the SCS fix.
All attempt to re-run the topology, even with less nodes (8, 7, 6), failed with a known issue:
http://groups.geni.net/geni/ticket/1346 (MAX and ION aggregates do not deletesliver even though success is returned for deletesliver)
Xi will be looking at this tonight.
comment:9 Changed 9 years ago by
I re-ran a simpler version of the 9 node topology where only 6 nodes were left in the RSpec (star topology with Chicago as center node. The problem happened on the first try:
08:40:18 INFO : DCN AM <Aggregate ion>: must wait for status ready.... 08:40:18 INFO : Pausing 30 seconds to let circuit become ready... 08:40:56 INFO : Pausing 30 seconds to let circuit become ready... 08:41:34 WARNING : sliverstatus: 135951 is (still) changing at <Aggregate ion>. Delete and retry. 08:41:34 WARNING : sliverstatus: 135971 is (still) changing at <Aggregate ion>. Delete and retry. 08:41:34 WARNING : sliverstatus: 135941 is (still) failed at <Aggregate ion>. Delete and retry. 08:41:34 WARNING : sliverstatus: 135961 is (still) changing at <Aggregate ion>. Delete and retry. 08:41:34 WARNING : sliverstatus: 135931 is (still) failed at <Aggregate ion>. Delete and retry. 08:41:40 WARNING : <Aggregate ion> says requested VLAN was unavailable at <Hop u'urn:publicid:IDN+ion.internet2.edu+interface+rtr.atla:xe-0/1/3:ukypks2-ig' on path u'pks-ucd-1'> 08:41:40 INFO : Deleting some reservations to retry, avoiding failed VLAN... 08:41:40 INFO : Doing deletesliver at <Aggregate ukypks2-ig>... 08:42:27 INFO : Will put <Aggregate ion> back in the pool to allocate. Got: Retrying reservations at earlier AMs to avoid unavailable VLAN tag at <Aggregate ion>.... 08:42:27 INFO : Pausing for 30 seconds for Aggregates to free up resources... 08:42:57 INFO : Stitcher doing createsliver at <Aggregate ukypks2-ig>... 08:43:14 INFO : ... Allocation at <Aggregate ukypks2-ig> complete. 08:43:14 INFO : Stitcher doing createsliver at <Aggregate ion>... 08:43:22 ERROR : {'output': ': CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:ln-test+slice+6sites, ', 'geni_api': 2, 'code': {'am_type': 'sfa', 'geni_code': 7, 'am_code': 7}, 'value': ''} 08:43:22 WARNING : Stitching failed but will retry: Reservation request impossible at <Aggregate ion>. You already have a reservation here in this slice: AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: : CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:...
I am attaching the stitcher log plus the modified 6 node RSpec. The stitcher log shows at line 2063 (08:41:39) that the sliver was successfully deleted at ION and a failure is returned on the next attempt on line 2354 (08:43:22).
Changed 9 years ago by
Attachment: | stitch-6sites.rspec added |
---|
Changed 9 years ago by
Attachment: | stitcher-log-6star.txt added |
---|
comment:10 Changed 9 years ago by
Capturing update to comment:2:
- UKYPKS2<-ION-AL2S->Chicago<-AL2S-ION->Illinois
The above topology continuously reported that Illinois had not VLAN, when only 3 out of 20 where in use. Reported it to Chad, who had the following update:
On 11/6/14 10:23 AM, Chad Kotil wrote:
I think we're hitting a bug in ION Oscars. I'm able to reproduce the issue locally. It seems that if you specify vlan endpoints for a circuit with the same interface (port), the Path Calculation Engine (PCE) throws an error 'no vlans available', which is wrong. I've even gone so far as letting oscars choose vlans, create the circuit which goes ACTIVE. Then cancel the circuit, only to clone it again so I'm using the same vlans. But it still fails with the exact same error. I am contact Oscars support now to try and get resolution for this issue. In the meantime I think a workaround would be to try and create a circuit with different ports. --Chad
comment:11 follow-up: 12 Changed 9 years ago by
Update on the OSCARS problem with the Path Calculation Engine:
On 11/11/14 10:06 AM, Chad Kotil wrote:
The es.net Oscars team has patched the bug. There is a maintenance window Thursday morning to upgrade ION Oscars at 7am EST.
The MAX/ION problem with deleting slivers (http://groups.geni.net/geni/ticket/1346) is solved at MAX, the fix needs to be addressed at ION.
Need to try original topology when fix is installed at ION.
comment:12 Changed 9 years ago by
In addition to the updates to:
- OSCARS to address problem with the Path Calculation Engine (Thursday morning 7am EST)
- ION to address the delete slivers bug (http://groups.geni.net/geni/ticket/1346)
There needs to be an update to the production SCS to introduce the two fixes that are needed to get the initial 9 nodes topology in this ticket to work:
- Initial fix addressed "Cannot find the set of paths" error.
- Second fix addressed incomplete paths. For links that included both AL2s and ION as part of one path, the ION part of the path is always missing.
comment:13 Changed 9 years ago by
Chad installed a fix for OSCARS problem with the Path Calculation Engine that results in a 'no vlans available' error when a request specifies VLAN endpoints for a circuit with the same interface (port).
Fix seem to not have solved the problem; reported fix failure.
comment:14 Changed 9 years ago by
OSCAR circuit clean up for the following active circuits which were associate with slivers that did not exist:
- Slice "9sites", according to sliverstatus there is no sliver at ION:
- ion.internet2.edu-138261
- ion.internet2.edu-138061
- ion.internet2.edu-137901
- ion.internet2.edu-137961
- ion.internet2.edu-137831 -
- Slice "iozcelik3", according to experimenter he has no slice:
- ion.internet2.edu-138141
- ion.internet2.edu-137991
- Slice "iozcelik", according to experimenter he has no slice:
- ion.internet2.edu-137301
- Slice "IOTest", according to experimenter he has no slice:
- ion.internet2.edu-137711
Once cleanup was completed, was able to create the initial topology. The OSCARS PCE fix is deemed validated.
Still need the SCS fixes to be installed at Production SCS, will close ticket when this occurs.
comment:15 Changed 9 years ago by
The fix for the ION deletesliver problem was installed on November 25th. Extensive testing has taken place and have not been able to reproduce the error scenario. Fix is deemed successful, closing ticket.
comment:16 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:17 Changed 9 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
The fix installed November 25 did not work, the one experimenter (İlker Özçelik) that also saw the problem with ION not deleting the sliver ran into it several times yesterday. I have been cleaning up his circuits to help him set up a scheduled disruptive experiments, which is scheduled for today.
I used his RSpec when I tested the fix, but had seen the problem. I will talk to Ilker to understand how he is getting to the failure. Re-opening the ticket.
comment:18 Changed 9 years ago by
Was able to reproduce the problem where ION does not delete a sliver even though stitcher submits a deletesliver.
The sequence using the RSpec from Ilker Özçelik (6 node star w/UKYPKS2 at center node, also PKS2 uses an OVS image):
- In GENI portal, create stitched sliver "iozcelik1" and leave it running.
- With omni create a stitched sliver "lnilker" using the same RSpec as portal.
- The omni sliver fails due to lack of resources at GPO, stitcher request delete at all sites.
- Deleted portal sliver "iozcelik1"
- Create the omni sliver again "lnilker", at this point I can see that ION has the old sliver record for the sliver that should have been deleted.
- Checked sliver status for the ION "lnilker" sliver:
"geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143451", "geni_error": "", "geni_status": "ready" "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143461", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143471", "geni_error": "VLAN PCE(PCE_CREATE_FAILED): 'There are no VLANs available on link ion.internet2.edu:rtr.newy:et-5/0/0:gpo-ig on reservation ion.internet2.edu-143471 in VLAN PCE'", "geni_status": "failed" "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143481", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+lnilker_vlan_ion.internet2.edu-143491", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed"
comment:19 Changed 9 years ago by
Circuit ion.internet2.edu-143451 was cancelled when I checked. Was that done manually or automatically? Is the sliver 'lnilker' still there and still showing "ready" for that circuit?
comment:20 Changed 9 years ago by
The circuit was active, and I cancelled it to clean up at the end of the test (or step 10) as described in the ticket http://groups.geni.net/geni/ticket/1356.
comment:21 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
I have been able to set up the 9 node complex topology without any problems a few times in the past week. Closing ticket.
After removing the Illinois and Chicago (ill-chic-4) link and the Chicago and Rutgers (chic-rut-7) link, there is still an error reported by Chicago IG:
The Manifest RSpec for this stitched link (attached) shows only the AL2S part of the path and is missing ION.