Opened 7 years ago

Closed 6 years ago

#1045 closed (fixed)

ION/I2 Aggregate fails to create circuit 51 times for one create sliver

Reported by: lnevers@bbn.com Owned by: ckotil@grnoc.iu.edu
Priority: blocker Milestone:
Component: I2AM Version: SPIRAL5
Keywords: Network Stitching Cc: Aaron Helsinger, tlehman@maxgigapop.net, xyang@maxgigapop.net
Dependencies:

Description

A stitcher createsliver from IG GPO to IG Utah was started 7:45 which generated 51 attempts at getting resources from ION/Internet2 aggregate. All 51 attempts failed and the sliver eventually failed at 9:16.

Following are the warnings for the 51 attempts and the final error, full log is also available:

WARNING:stitch.Aggregate:sliverstatus 18711 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18721 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18731 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18741 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18751 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18761 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18771 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18781 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18791 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18801 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:Doing allocate on <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> for 10th time!
WARNING:stitch.Aggregate:sliverstatus 18811 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18821 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18831 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18841 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18851 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18861 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18871 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18881 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18891 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18901 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18911 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18921 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18931 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18941 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18951 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18961 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18971 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18981 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 18991 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19001 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19011 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19021 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19031 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19041 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19051 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19061 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19071 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19081 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19091 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19101 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19111 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19121 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19131 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19141 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19151 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19161 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19171 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19181 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19191 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19201 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitch.Aggregate:sliverstatus 19211 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
WARNING:stitcher:Stitching failed but will retry: Circuit reservation failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Try again from the SCS
WARNING:stitcher:Had reservation at https://boss.instageni.gpolab.bbn.com:12369/protogeni/xmlrpc/am
WARNING:stitcher:Deleted reservation at https://boss.instageni.gpolab.bbn.com:12369/protogeni/xmlrpc/am
WARNING:stitcher:Had reservation at https://www.emulab.net:12369/protogeni/xmlrpc/am
WARNING:stitcher:Deleted reservation at https://www.emulab.net:12369/protogeni/xmlrpc/am
WARNING:stitcher:Had reservation at https://boss.utah.geniracks.net:12369/protogeni/xmlrpc/am
WARNING:stitcher:Deleted reservation at https://boss.utah.geniracks.net:12369/protogeni/xmlrpc/am
StitchingServiceFailedError: Error from Stitching Service: code 3: MxTCE ComputeWorker return error message ' Action_ProcessRequestTopology_MP2P::Finish() Cannot find the set of paths for the RequestTopology. '.

The sliver status for the last failed request:

{
  "geni_resources": [
    {
      "geni_urn":
"urn:publicid:IDN+pgeni.gpolab.bbn.com+sliver+1sl-2aggr-1lnk_vlan_ion.internet2.edu-19211",

      "geni_error": "",
      "geni_status": "failed"
    }
  ],
  "geni_expires": "2013-06-08T12:45:56.000Z",
  "geni_urn":
"urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+1sl-2aggr-1lnk",
  "geni_status": "failed"
}

Attachments (4)

stitch-pg-utah-ig-gpo-20130611.txt (63.2 KB) - added by lnevers@bbn.com 6 years ago.
failed createsliver with 10 min delay
lnstitch1-createsliver-request-11-ion-internet2-edu.xml (4.3 KB) - added by lnevers@bbn.com 6 years ago.
stitch-ig-gpo-ig-utah.rspec (1.6 KB) - added by lnevers@bbn.com 6 years ago.
stitch-pg-utah-ig-gpo-20130612.txt (64.1 KB) - added by lnevers@bbn.com 6 years ago.

Download all attachments as: .zip

Change History (15)

comment:1 Changed 6 years ago by Aaron Helsinger

Cc: Aaron Helsinger added

Email thread suggests that ION was creating the circuits, and then the AM was deleting them. Why?

And then all the requests starting making the routers too busy. So they've asked us to back off.

But we still don't know why the circuits were failing.

comment:2 Changed 6 years ago by xyang@maxgigapop.net

Status: newassigned

This still needs investigation. One theory is that OSCARS creates the circuit. However, ION AM had failure after trying too many times of status query to OSCARS. This may not due to circuit failure but OSCARS API failure. Once a failed status is established ION AM will no longer query OSCARS.

I suggest increasing the query interval to 30 secs or 1 minute to see if that helps.

comment:3 Changed 6 years ago by lnevers@bbn.com

Priority: majorblocker

Stitcher has been modified to wait a full 10 minutes between ION AM calls, which slows testing down drastically.

Using this new version of stitcher, this problem still exists and no slivers can be created with the ION aggregate in the path.

Changed 6 years ago by lnevers@bbn.com

failed createsliver with 10 min delay

comment:4 Changed 6 years ago by xyang@maxgigapop.net

Cc: tlehman@maxgigapop.net added

How about manually submitting the rspec to ION AM and then manually check sliverstatus using omni.py ?

Please attach the request rspec. I will try it with my test AM.

comment:5 Changed 6 years ago by lnevers@bbn.com

Tried manual create sliver 3 times today and all 3 worked. Tried 3 stitcher createsliver attempts and they all failed. Attaching the 2 RSpecs:

  • stitch-ig-gpo-ig-utah.rspec - RSpec generated by me and used with stitcher to request the failure scenario being investigated by this ticket.
  • lnstitch1-createsliver-request-11-ion-internet2-edu.xml - RSpec generated by stitcher when it attempts requests for I2/ION aggregate and used for the manual request.

Changed 6 years ago by lnevers@bbn.com

Changed 6 years ago by lnevers@bbn.com

Attachment: stitch-ig-gpo-ig-utah.rspec added

comment:6 Changed 6 years ago by lnevers@bbn.com

Attaching log of latest test run which which uses:

  • 300 seconds sleep between reservation attempts at ION or another DCN based aggregate
  • 60 seconds sleep between sliver status calls at ION or another DCN based aggregate

See the attached file stitch-pg-utah-ig-gpo-20130612.txt for full details. This is a summary of the ION error and warning reported during the test run:

07:37:18 WARNING  stitch.Aggregate: sliverstatus 20241 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
07:37:23 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20241 was (still): failed
07:43:59 WARNING  stitch.Aggregate: sliverstatus 20251 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
07:44:05 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20251 was (still): failed
07:50:21 WARNING  stitch.Aggregate: sliverstatus 20271 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
07:50:27 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20271 was (still): failed
07:56:46 WARNING  stitch.Aggregate: sliverstatus 20281 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
07:56:52 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20281 was (still): failed
08:03:09 WARNING  stitch.Aggregate: sliverstatus 20291 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
08:03:15 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20291 was (still): failed
08:09:32 WARNING  stitch.Aggregate: sliverstatus 20301 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
08:09:38 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20301 was (still): failed
08:15:56 WARNING  stitch.Aggregate: sliverstatus 20311 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
08:16:02 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20311 was (still): failed
08:22:19 WARNING  stitch.Aggregate: sliverstatus 20321 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
08:22:25 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20321 was (still): failed
08:28:42 WARNING  stitch.Aggregate: sliverstatus 20331 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
08:28:48 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20331 was (still): failed
08:35:05 WARNING  stitch.Aggregate: sliverstatus 20341 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
08:35:11 INFO     stitch.launcher: Will put <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> back in the pool to allocate. Got Sliver status for circuit 20341 was (still): failed
08:40:11 WARNING  stitch.Aggregate: Doing allocate on <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm> for 10th time!
08:41:29 WARNING  stitch.Aggregate: sliverstatus 20351 is (still) failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Delete and retry.
08:41:35 WARNING  stitcher: Stitching failed but will retry: Circuit reservation failed at <Aggregate urn:publicid:IDN+ion.internet2.edu+authority+cm>. Try again from the SCS

Changed 6 years ago by lnevers@bbn.com

comment:7 Changed 6 years ago by xyang@maxgigapop.net

Cc: xyang@maxgigapop.net added; ckotil@grnoc.iu.edu removed
Owner: changed from xyang@maxgigapop.net to ckotil@grnoc.iu.edu,
Status: assignednew

Fixed a bug in AM Java core code that could report false "FAILED" VLAN status.

This only happens by special timing of sliverstatus query.

BTW, the stitcher does not seem to respect the option "--ionStatusIntervalSecs 60". It still tried SliverStatus every couple of seconds.

Reassign to Chad to update the Java code via SVN and restart tomcat on ION AM.

comment:8 Changed 6 years ago by Aaron Helsinger

Huh - you're right, that was a bug in stitcher, that I've now fixed.

comment:9 Changed 6 years ago by ckotil@grnoc.iu.edu

Owner: changed from ckotil@grnoc.iu.edu, to ckotil@grnoc.iu.edu
Status: newassigned

I've updated the AM code and restarted tomcat6.

comment:10 Changed 6 years ago by lnevers@bbn.com

I have re-ran the test with the stitched sliver from IG GPO to IG Utah using using the original sliver status timing that used to cause the failure. The sliver was created successfully and the problem seems resolved. Is there anything else that needs to be done? If not, this ticket can be closed.

comment:11 Changed 6 years ago by Aaron Helsinger

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.