Opened 8 years ago
Last modified 8 years ago
#1355 accepted
SCS hang with timed out threads when a request is submitted with large number of nodes
Reported by: | lnevers@bbn.com | Owned by: | xyang@maxgigapop.net |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | STITCHING | Version: | SPIRAL7 |
Keywords: | GENI Network Stitching | Cc: | |
Dependencies: |
Description
In a scenario where an 11 node Al2S star topology is submitted, the SCS becomes unresponsive and subsequent requests are not addressed.
Here is an example where the initial request is submitted:
10:11:32 INFO : Using SCS at http://nutshell.maxgigapop.net:8081/geni/xmlrpc 10:11:32 INFO : Calling SCS... StitchingServiceFailedError: Error from Stitching Service: code 2: Timeout: no response received from computing core!
and a subsequent requests fail:
10:16:30 INFO : Using SCS at http://nutshell.maxgigapop.net:8081/geni/xmlrpc 10:16:30 INFO : Calling SCS... ERROR <Fault -500: "Unexpected error executing code for particular method, detected by Xmlrpc-c method registry code. Method did not fail; rather, it did not complete at all. Xmlrpc-c user's xmlrpc_c::method object's 'execute method' failed to set the RPC result value.">
According to Xi:
On 11/17/14 12:46 PM, Xi Yang wrote: When this happens, SCS has become unstable. This is usually because of a number of timed out compute threads stuck internally. I need to fix this as a bug. Or monitor this and try restart SCS when this occurs.
Change History (5)
comment:1 follow-up: 2 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:2 Changed 8 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Replying to xyang@…:
Fixed in trunk code. Will be passed into branch r2.5 that NOC will be using.
Is this version running at the Internet2 SCS?
Just ran into the problem with a 13 node topology:
11:15:35 INFO : Using SCS at http://geni-scs.net.internet2.edu:8081/geni/xmlrpc 11:15:35 INFO : Calling SCS... ERROR <ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 > 11:15:38 ERROR : Exception from slice computation service: <ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 > SCS gave error: <ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 >
According to Chad, a segfault took place. Excerpt from chat:
11:17 Dec 11 16:15:34 The segfault 11:17 Dec 11 16:15:35 geni-scs.net.internet2.edu kernel: mxtce[19870]: segfault at 13 ip 000000000048d8a9 sp 00007f0ecfffdc10 error 4 in mxtce[400000+118000] 11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrt[19871]: Saved core dump of pid 17552 (/usr/local/geni-scs-r2.5/src/main/mxtce) to /var/spool/abrt/ccpp-2014-12-11-16:15:35-17552 (399491072 bytes) 11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: Directory 'ccpp-2014-12-11-16:15:35-17552' creation detected 11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: Executable '/usr/local/geni-scs-r2.5/src/main/mxtce' doesn't belong to any package and ProcessUnpackaged is set to 'no' 11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: 'post-create' on '/var/spool/abrt/ccpp-2014-12-11-16:15:35-17552' exited with 1 11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2014-12-11-16:15:35-17552' 11:17 but ive got Xi's script to start it up again (runs every minute via cron). if the process has died it gets started up again 11:17 not ideal, but it works
comment:3 follow-up: 4 Changed 8 years ago by
Status: | reopened → accepted |
---|
Does this happen to nutshell or oingo server ?
cc Chad: Run the following and then restart SCS. See if that helps. ulimit -s 65536
comment:4 Changed 8 years ago by
Replying to xyang@…:
Does this happen to nutshell or oingo server ?
cc Chad: Run the following and then restart SCS. See if that helps. ulimit -s 65536
This is a result of the Internet2 SCS. Sorry I did not say it explicitly, but it was int he command output.
I am also seeing crashes when running simpler 3-nodes linear scenarios. Here is a sequence:
- 1st request at 11:36:03 -ok
- 2nd at 11:38:21 - failed
- 3rd at 11:41:40 - failed
- 4th request at 11:44:17 - ok
Note: Only one request made at the time of the attempts above, no other test activity.
Also, failed =
<ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 >
comment:5 Changed 8 years ago by
After the SCS was restarted tried to duplicate the problem unsuccessfully. Re-ran all tests and also significantly increased the load of requests, but could not reproduce.
Will try to reproduce again after the SCS service has been up for a while.
Fixed in trunk code. Will be passed into branch r2.5 that NOC will be using.