Opened 9 years ago
Last modified 9 years ago
#1346 accepted
MAX and ION aggregates do not deletesliver even though success is returned for deletesliver
Reported by: | lnevers@bbn.com | Owned by: | xyang@maxgigapop.net |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | STITCHING | Version: | SPIRAL7 |
Keywords: | GENI Network Stitching | Cc: | |
Dependencies: |
Description
In a 9 node grid scenario the MAX aggregate reports to have deleted a sliver, but on the next stitcher createsliver attempt the command fails due to a reservation already existing at MAX for the slice:
Problem is reproducible.
$ stitcher.py createsliver 9grid /home/lnevers/gcf-test/stitch-test/9grid/stitch-9grid.rspec ... 10:59:31 INFO : Using SCS at http://nutshell.maxgigapop.net:8081/geni/xmlrpc 10:59:31 INFO : Calling SCS... 10:59:41 INFO : Multi-AM reservation will include resources from these aggregates: 10:59:41 INFO : <Aggregate nysernet-ig> 10:59:41 INFO : <Aggregate oess> 10:59:41 INFO : <Aggregate max-ig> 10:59:41 INFO : <Aggregate utah-ig> 10:59:41 INFO : <Aggregate gpo-ig> 10:59:41 INFO : <Aggregate missouri-ig> 10:59:41 INFO : <Aggregate utahddc-ig> 10:59:41 INFO : <Aggregate ion> 10:59:41 INFO : <Aggregate stanford-ig> 10:59:41 INFO : <Aggregate max> 10:59:41 INFO : <Aggregate wisconsin-ig> 10:59:41 INFO : <Aggregate illinois-ig> 10:59:41 INFO : <Aggregate utah-pg> 10:59:41 INFO : Stitcher doing createsliver at <Aggregate nysernet-ig>... 11:00:03 INFO : ... Allocation at <Aggregate nysernet-ig> complete. 11:00:03 INFO : Stitcher doing createsliver at <Aggregate gpo-ig>... 11:00:34 INFO : ... Allocation at <Aggregate gpo-ig> complete. 11:00:34 INFO : Stitcher doing createsliver at <Aggregate missouri-ig>... 11:01:12 INFO : ... Allocation at <Aggregate missouri-ig> complete. 11:01:13 INFO : Stitcher doing createsliver at <Aggregate stanford-ig>... 11:01:49 INFO : ... Allocation at <Aggregate stanford-ig> complete. 11:01:49 INFO : Stitcher doing createsliver at <Aggregate max>... 11:02:41 INFO : DCN AM <Aggregate max>: must wait for status ready.... 11:02:41 INFO : Pausing 30 seconds to let circuit become ready... 11:03:19 INFO : Pausing 30 seconds to let circuit become ready... 11:03:59 WARNING : sliverstatus: 4084 is (still) failed at <Aggregate max>. Delete and retry. 11:03:59 WARNING : sliverstatus: 4085 is (still) changing at <Aggregate max>. Delete and retry. 11:03:59 WARNING : sliverstatus: 4086 is (still) changing at <Aggregate max>. Delete and retry. 11:04:07 WARNING : <Aggregate max> says requested VLAN was unavailable at <Hop u'urn:publicid:IDN+dragon.maxgigapop.net+interface+clpk:1-1-2:*' on path u'stitched11'> 11:04:07 INFO : Will put <Aggregate max> back in the pool to allocate. Got: VLAN was unavailable. Retry <Aggregate max> 2nd time with <Hop u'urn:publicid:IDN+dragon.maxgigapop.net+interface+clpk:1-1-2:*' on path u'stitched11'> new suggested 3108 (not 3214) 11:04:07 INFO : Pausing for 600 seconds for Aggregates to free up resources... 11:14:07 INFO : Stitcher doing createsliver at <Aggregate oess>... 11:14:13 INFO : ... Allocation at <Aggregate oess> complete. 11:14:13 INFO : Stitcher doing createsliver at <Aggregate max>... 11:14:24 ERROR : {'output': ': CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:ln-test+slice+9grid, ', 'geni_api': 2, 'code': {'am_type': 'sfa', 'geni_code': 7, 'am_code': 7}, 'value': ''} 11:14:24 WARNING : Stitching failed but will retry: Reservation request impossible at <Aggregate max>. You already have a reservation here in this slice: AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: : CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:... 11:14:24 INFO : Deleting existing reservations... 11:14:24 INFO : Doing deletesliver at <Aggregate nysernet-ig>... 11:15:20 INFO : Deleted reservation at <Aggregate nysernet-ig>. 11:15:20 INFO : Doing deletesliver at <Aggregate oess>... 11:15:42 INFO : Deleted reservation at <Aggregate oess>. 11:15:42 INFO : Doing deletesliver at <Aggregate gpo-ig>... 11:16:37 INFO : Deleted reservation at <Aggregate gpo-ig>. 11:16:37 INFO : Doing deletesliver at <Aggregate missouri-ig>... 11:17:32 INFO : Deleted reservation at <Aggregate missouri-ig>. 11:17:32 INFO : Doing deletesliver at <Aggregate stanford-ig>... 11:18:28 INFO : Deleted reservation at <Aggregate stanford-ig>. 11:18:28 INFO : Calling SCS for the 2nd time... 11:18:29 ERROR : Root cause error: Reservation request impossible at <Aggregate max>. You already have a reservation here in this slice: AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: : CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:... Reservation request impossible at <Aggregate max>. You already have a reservation here in this slice: AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: : CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:... which caused StitchingServiceFailedError: Error from Stitching Service: code 3: MxTCE ComputeWorker return error message ' Action_ProcessRequestTopology_MP2P::Finish() Cannot find the set of paths for the RequestTopology. '.
Attaching stitcher log and request rspec for MAX
Attachments (5)
Change History (12)
Changed 9 years ago by
Attachment: | 9grid-createsliver-request-12-dragon-maxgigapop-net.xml added |
---|
Changed 9 years ago by
Attachment: | stitcher.log added |
---|
Changed 9 years ago by
Attachment: | 9grid-createsliver-request-11-dragon-maxgigapop-net.xml added |
---|
comment:1 Changed 9 years ago by
Summary: | MAX aggregate does not deletesliver even though success is returned for deletesliver → MAX and ION aggregates do not deletesliver even though success is returned for deletesliver |
---|
Changed 9 years ago by
Attachment: | star-pks2-lot.txt added |
---|
Changed 9 years ago by
Attachment: | stitcher.log-ion.txt added |
---|
comment:2 Changed 9 years ago by
Owner: | changed from somebody to xyang@maxgigapop.net |
---|---|
Status: | new → assigned |
comment:3 Changed 9 years ago by
Status: | assigned → accepted |
---|
Tried the attached rspec manually with similar scenarios that one of the three circuits failed. However, I was not able to reproduce the problem. Every time the sliver is deleted cleanly.
comment:4 Changed 9 years ago by
Using the 9node grid RSpec attached to this ticket, I see the failure at ION.
Six minutes after the deletesliver, it is still possible to get status from ION for the sliver:
09:16:41 INFO : Sliver status for Slice urn:publicid:IDN+ch.geni.net:ln-test+slice+9grid at AM ion 09:16:41 INFO : { "geni_resources": [ { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-135981", "geni_error": "VLAN PCE(PCE_CREATE_FAILED): 'There are no VLANs available on link ion.internet2.edu:rtr.salt:xe-0/1/1:stanford-ig on reservation ion.internet2.edu-135981 in VLAN PCE'", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-135991", "geni_error": "", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136001", "geni_error": "", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136011", "geni_error": "", "geni_status": "ready" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136021", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136031", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136041", "geni_error": "", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136051", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136061", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136071", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" }, { "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136081", "geni_error": "VLAN cancelled by rollback from contingent failure", "geni_status": "failed" } ], "geni_expires": "2014-11-10T13:48:05.000Z", "geni_urn": "urn:publicid:IDN+ch.geni.net:ln-test+slice+9grid", "geni_status": "failed" }
Could the deletesliver not be taking place because of the failure on the circuits? or because the aggregate is too busy and ignores the request?
comment:5 Changed 9 years ago by
Using the Development SCS version was able to verify a fix to this problem at the MAX aggregate.
This fix needs to be installed at the ION aggregate.
comment:6 Changed 9 years ago by
The fix was applied to ION to address this problem, and have run every topology that used to get this failure, but have been unsuccessful in duplicating the sequence of events today to verify the fix.
Last week, a create sliver operation resulted in the following sequence of events:
- Createsliver at rack aggregates
- Createsliver at ION
- Createsliver at AL2s (which failed which caused cleanup of sliver)
- Deletesliver at rack and ION aggregates, with ION showing the problem.
Today, a create sliver operation results in the following sequence of events:
- Createsliver at rack aggregates
- Createsliver at AL2s (which failed which causes cleanup of sliver)
- Deletesliver at rack aggregates.
The ION createsliver is now not taking place, something changed. Sent Xi email to about this change in behavior.
comment:7 Changed 9 years ago by
Should we close this ticket now that ION aggregate has been decommissioned?
This problem also occurs at ION, where after a deletesliver, the next attempt to re-create the sliver fails with "Existing Record". Excerpt from command line output:
Tried again 23 minute later and still got the existing record error. Attaching full command output and stitcher log for this run which was a star topology with UKYPKS2 as the center node.