Opened 5 years ago

Last modified 4 years ago

#1346 accepted

MAX and ION aggregates do not deletesliver even though success is returned for deletesliver

Reported by: lnevers@bbn.com Owned by: xyang@maxgigapop.net
Priority: major Milestone:
Component: STITCHING Version: SPIRAL7
Keywords: GENI Network Stitching Cc:
Dependencies:

Description

In a 9 node grid scenario the MAX aggregate reports to have deleted a sliver, but on the next stitcher createsliver attempt the command fails due to a reservation already existing at MAX for the slice:

Problem is reproducible.

$ stitcher.py createsliver 9grid
/home/lnevers/gcf-test/stitch-test/9grid/stitch-9grid.rspec
...

10:59:31 INFO    : Using SCS at
http://nutshell.maxgigapop.net:8081/geni/xmlrpc
10:59:31 INFO    : Calling SCS...
10:59:41 INFO    : Multi-AM reservation will include resources from
these aggregates:
10:59:41 INFO    :     <Aggregate nysernet-ig>
10:59:41 INFO    :     <Aggregate oess>
10:59:41 INFO    :     <Aggregate max-ig>
10:59:41 INFO    :     <Aggregate utah-ig>
10:59:41 INFO    :     <Aggregate gpo-ig>
10:59:41 INFO    :     <Aggregate missouri-ig>
10:59:41 INFO    :     <Aggregate utahddc-ig>
10:59:41 INFO    :     <Aggregate ion>
10:59:41 INFO    :     <Aggregate stanford-ig>
10:59:41 INFO    :     <Aggregate max>
10:59:41 INFO    :     <Aggregate wisconsin-ig>
10:59:41 INFO    :     <Aggregate illinois-ig>
10:59:41 INFO    :     <Aggregate utah-pg>
10:59:41 INFO    : Stitcher doing createsliver at <Aggregate nysernet-ig>...
11:00:03 INFO    : ... Allocation at <Aggregate nysernet-ig> complete.
11:00:03 INFO    : Stitcher doing createsliver at <Aggregate gpo-ig>...
11:00:34 INFO    : ... Allocation at <Aggregate gpo-ig> complete.
11:00:34 INFO    : Stitcher doing createsliver at <Aggregate missouri-ig>...
11:01:12 INFO    : ... Allocation at <Aggregate missouri-ig> complete.
11:01:13 INFO    : Stitcher doing createsliver at <Aggregate stanford-ig>...
11:01:49 INFO    : ... Allocation at <Aggregate stanford-ig> complete.
11:01:49 INFO    : Stitcher doing createsliver at <Aggregate max>...
11:02:41 INFO    : DCN AM <Aggregate max>: must wait for status ready....
11:02:41 INFO    : Pausing 30 seconds to let circuit become ready...
11:03:19 INFO    : Pausing 30 seconds to let circuit become ready...
11:03:59 WARNING : sliverstatus: 4084 is (still) failed at <Aggregate
max>. Delete and retry.
11:03:59 WARNING : sliverstatus: 4085 is (still) changing at <Aggregate
max>. Delete and retry.
11:03:59 WARNING : sliverstatus: 4086 is (still) changing at <Aggregate
max>. Delete and retry.
11:04:07 WARNING : <Aggregate max> says requested VLAN was unavailable
at <Hop u'urn:publicid:IDN+dragon.maxgigapop.net+interface+clpk:1-1-2:*'
on path u'stitched11'>
11:04:07 INFO    : Will put <Aggregate max> back in the pool to
allocate. Got: VLAN was unavailable. Retry <Aggregate max> 2nd time with
<Hop u'urn:publicid:IDN+dragon.maxgigapop.net+interface+clpk:1-1-2:*' on
path u'stitched11'> new suggested 3108 (not 3214)
11:04:07 INFO    : Pausing for 600 seconds for Aggregates to free up
resources...


11:14:07 INFO    : Stitcher doing createsliver at <Aggregate oess>...
11:14:13 INFO    : ... Allocation at <Aggregate oess> complete.
11:14:13 INFO    : Stitcher doing createsliver at <Aggregate max>...
11:14:24 ERROR   :  {'output': ': CreateSliver: Existing record:
urn:publicid:IDN+ch.geni.net:ln-test+slice+9grid, ', 'geni_api': 2,
'code': {'am_type': 'sfa', 'geni_code': 7, 'am_code': 7}, 'value': ''}
11:14:24 WARNING : Stitching failed but will retry: Reservation request
impossible at <Aggregate max>. You already have a reservation here in
this slice: AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: :
CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:...
11:14:24 INFO    : Deleting existing reservations...
11:14:24 INFO    : Doing deletesliver at <Aggregate nysernet-ig>...
11:15:20 INFO    : Deleted reservation at <Aggregate nysernet-ig>.
11:15:20 INFO    : Doing deletesliver at <Aggregate oess>...
11:15:42 INFO    : Deleted reservation at <Aggregate oess>.
11:15:42 INFO    : Doing deletesliver at <Aggregate gpo-ig>...
11:16:37 INFO    : Deleted reservation at <Aggregate gpo-ig>.
11:16:37 INFO    : Doing deletesliver at <Aggregate missouri-ig>...
11:17:32 INFO    : Deleted reservation at <Aggregate missouri-ig>.
11:17:32 INFO    : Doing deletesliver at <Aggregate stanford-ig>...
11:18:28 INFO    : Deleted reservation at <Aggregate stanford-ig>.
11:18:28 INFO    : Calling SCS for the 2nd time...
11:18:29 ERROR   : Root cause error: Reservation request impossible at
<Aggregate max>. You already have a reservation here in this slice:
AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: :
CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:...
Reservation request impossible at <Aggregate max>. You already have a
reservation here in this slice: AMAPIError: Error from Aggregate: code
7. sfa AM code: 7: : CreateSliver: Existing record:
urn:publicid:IDN+ch.geni.net:... which caused
StitchingServiceFailedError: Error from Stitching Service: code 3: MxTCE
ComputeWorker return error message '
Action_ProcessRequestTopology_MP2P::Finish() Cannot find the set of
paths for the RequestTopology. '.

Attaching stitcher log and request rspec for MAX

Attachments (5)

9grid-createsliver-request-12-dragon-maxgigapop-net.xml (94.7 KB) - added by lnevers@bbn.com 5 years ago.
stitcher.log (542.3 KB) - added by lnevers@bbn.com 5 years ago.
9grid-createsliver-request-11-dragon-maxgigapop-net.xml (94.7 KB) - added by lnevers@bbn.com 5 years ago.
star-pks2-lot.txt (5.3 KB) - added by lnevers@bbn.com 5 years ago.
stitcher.log-ion.txt (307.5 KB) - added by lnevers@bbn.com 5 years ago.

Download all attachments as: .zip

Change History (12)

Changed 5 years ago by lnevers@bbn.com

Changed 5 years ago by lnevers@bbn.com

Attachment: stitcher.log added

Changed 5 years ago by lnevers@bbn.com

comment:1 Changed 5 years ago by lnevers@bbn.com

Summary: MAX aggregate does not deletesliver even though success is returned for deletesliverMAX and ION aggregates do not deletesliver even though success is returned for deletesliver

This problem also occurs at ION, where after a deletesliver, the next attempt to re-create the sliver fails with "Existing Record". Excerpt from command line output:

s08:46:28 INFO    : Stitcher doing createsliver at <Aggregate ion>...
08:47:15 INFO    : DCN AM <Aggregate ion>: must wait for status ready....
08:47:15 INFO    : Pausing 60 seconds to let circuit become ready...
08:48:22 WARNING : sliverstatus: 125451 is (still) failed at <Aggregate ion>. Delete and retry.
08:48:22 WARNING : sliverstatus: 125471 is (still) changing at <Aggregate ion>. Delete and retry.
08:48:22 WARNING : sliverstatus: 125441 is (still) ready at <Aggregate ion>. Delete and retry.
08:48:22 WARNING : sliverstatus: 125461 is (still) changing at <Aggregate ion>. Delete and retry.
08:48:28 WARNING : <Aggregate ion> says requested VLAN was unavailable at <Hop u'urn:publicid:IDN+ion.internet2.edu+interface+rtr.newy:ae0:bbn' on path u'link-2-iggpo'>
08:48:28 INFO    : Deleting some reservations to retry, avoiding failed VLAN...
08:48:28 INFO    : Doing deletesliver at <Aggregate gpo-ig>...
08:49:21 INFO    : Will put <Aggregate ion> back in the pool to allocate. Got: Retrying reservations at earlier AMs to avoid unavailable VLAN tag at <Aggregate ion>....
08:49:21 INFO    : Pausing for 30 seconds for Aggregates to free up resources...

08:49:51 INFO    : Stitcher doing createsliver at <Aggregate gpo-ig>...
08:50:06 INFO    : ... Allocation at <Aggregate gpo-ig> complete.
08:50:07 INFO    : Stitcher doing createsliver at <Aggregate ion>...
08:50:15 ERROR   :  {'output': ': CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:ln-test+slice+star-pks2, ', 'geni_api': 2, 'code': {'am_type': 'sfa', 'geni_code': 7, 'am_code': 7}, 'value': ''}
08:50:15 WARNING : Stitching failed but will retry: Reservation request impossible at <Aggregate ion>. You already have a reservation here in this slice: AMAPIError: Error from Aggregate: code 7. sfa AM code: 7: : CreateSliver: Existing record: urn:publicid:IDN+ch.geni.net:...
08:50:15 INFO    : Deleting existing reservations...

Tried again 23 minute later and still got the existing record error. Attaching full command output and stitcher log for this run which was a star topology with UKYPKS2 as the center node.

Changed 5 years ago by lnevers@bbn.com

Attachment: star-pks2-lot.txt added

Changed 5 years ago by lnevers@bbn.com

Attachment: stitcher.log-ion.txt added

comment:2 Changed 5 years ago by xyang@maxgigapop.net

Owner: changed from somebody to xyang@maxgigapop.net
Status: newassigned

comment:3 Changed 5 years ago by xyang@maxgigapop.net

Status: assignedaccepted

Tried the attached rspec manually with similar scenarios that one of the three circuits failed. However, I was not able to reproduce the problem. Every time the sliver is deleted cleanly.

comment:4 Changed 5 years ago by lnevers@bbn.com

Using the 9node grid RSpec attached to this ticket, I see the failure at ION.

Six minutes after the deletesliver, it is still possible to get status from ION for the sliver:

09:16:41 INFO    : Sliver status for Slice urn:publicid:IDN+ch.geni.net:ln-test+slice+9grid at AM ion
09:16:41 INFO    : {
  "geni_resources": [
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-135981",
      "geni_error": "VLAN PCE(PCE_CREATE_FAILED): 'There are no VLANs available on link ion.internet2.edu:rtr.salt:xe-0/1/1:stanford-ig  on reservation ion.internet2.edu-135981 in VLAN PCE'",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-135991",
      "geni_error": "",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136001",
      "geni_error": "",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136011",
      "geni_error": "",
      "geni_status": "ready"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136021",
      "geni_error": "VLAN cancelled by rollback from contingent failure",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136031",
      "geni_error": "VLAN cancelled by rollback from contingent failure",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136041",
      "geni_error": "",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136051",
      "geni_error": "VLAN cancelled by rollback from contingent failure",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136061",
      "geni_error": "VLAN cancelled by rollback from contingent failure",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136071",
      "geni_error": "VLAN cancelled by rollback from contingent failure",
      "geni_status": "failed"
    },
    {
      "geni_urn": "urn:publicid:IDN+ion.internet2.edu+sliver+9grid_vlan_ion.internet2.edu-136081",
      "geni_error": "VLAN cancelled by rollback from contingent failure",
      "geni_status": "failed"
    }
  ],
  "geni_expires": "2014-11-10T13:48:05.000Z",
  "geni_urn": "urn:publicid:IDN+ch.geni.net:ln-test+slice+9grid",
  "geni_status": "failed"
}

Could the deletesliver not be taking place because of the failure on the circuits? or because the aggregate is too busy and ignores the request?

comment:5 Changed 5 years ago by lnevers@bbn.com

Using the Development SCS version was able to verify a fix to this problem at the MAX aggregate.

This fix needs to be installed at the ION aggregate.

comment:6 Changed 4 years ago by lnevers@bbn.com

The fix was applied to ION to address this problem, and have run every topology that used to get this failure, but have been unsuccessful in duplicating the sequence of events today to verify the fix.

Last week, a create sliver operation resulted in the following sequence of events:

  1. Createsliver at rack aggregates
  2. Createsliver at ION
  3. Createsliver at AL2s (which failed which caused cleanup of sliver)
  4. Deletesliver at rack and ION aggregates, with ION showing the problem.

Today, a create sliver operation results in the following sequence of events:

  1. Createsliver at rack aggregates
  2. Createsliver at AL2s (which failed which causes cleanup of sliver)
  3. Deletesliver at rack aggregates.

The ION createsliver is now not taking place, something changed. Sent Xi email to about this change in behavior.

comment:7 Changed 4 years ago by hdempsey@bbn.com

Should we close this ticket now that ION aggregate has been decommissioned?

Note: See TracTickets for help on using tickets.