Opened 9 years ago

Last modified 9 years ago

#1355 accepted

SCS hang with timed out threads when a request is submitted with large number of nodes

Reported by: lnevers@bbn.com Owned by: xyang@maxgigapop.net
Priority: major Milestone:
Component: STITCHING Version: SPIRAL7
Keywords: GENI Network Stitching Cc:
Dependencies:

Description

In a scenario where an 11 node Al2S star topology is submitted, the SCS becomes unresponsive and subsequent requests are not addressed.

Here is an example where the initial request is submitted:

10:11:32 INFO    : Using SCS at
http://nutshell.maxgigapop.net:8081/geni/xmlrpc
10:11:32 INFO    : Calling SCS...
StitchingServiceFailedError: Error from Stitching Service: code 2:
Timeout: no response received from computing core!

and a subsequent requests fail:

10:16:30 INFO    : Using SCS at
http://nutshell.maxgigapop.net:8081/geni/xmlrpc
10:16:30 INFO    : Calling SCS...
ERROR <Fault -500: "Unexpected error executing code for particular
method, detected by Xmlrpc-c method registry code.  Method did not fail;
rather, it did not complete at all. Xmlrpc-c user's xmlrpc_c::method 
object's 'execute method' failed to set the RPC result value.">

According to Xi:

On 11/17/14 12:46 PM, Xi Yang wrote:
When this happens, SCS has become unstable. This is usually because 
of a number of timed out compute threads stuck internally. 
I need to fix this as a bug. Or monitor this and try restart SCS when this occurs. 

Change History (5)

comment:1 Changed 9 years ago by xyang@maxgigapop.net

Resolution: fixed
Status: newclosed

Fixed in trunk code. Will be passed into branch r2.5 that NOC will be using.

comment:2 in reply to:  1 Changed 9 years ago by lnevers@bbn.com

Resolution: fixed
Status: closedreopened

Replying to xyang@…:

Fixed in trunk code. Will be passed into branch r2.5 that NOC will be using.

Is this version running at the Internet2 SCS?

Just ran into the problem with a 13 node topology:

11:15:35 INFO    : Using SCS at http://geni-scs.net.internet2.edu:8081/geni/xmlrpc
11:15:35 INFO    : Calling SCS...
ERROR <ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 >
11:15:38 ERROR   : Exception from slice computation service: <ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 >
SCS gave error: <ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 >

According to Chad, a segfault took place. Excerpt from chat:

11:17 Dec 11 16:15:34 The segfault
11:17 Dec 11 16:15:35 geni-scs.net.internet2.edu kernel: mxtce[19870]: segfault at 13 ip 000000000048d8a9 sp 00007f0ecfffdc10 error 4 in mxtce[400000+118000]
11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrt[19871]: Saved core dump of pid 17552 (/usr/local/geni-scs-r2.5/src/main/mxtce) to /var/spool/abrt/ccpp-2014-12-11-16:15:35-17552 (399491072 bytes)
11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: Directory 'ccpp-2014-12-11-16:15:35-17552' creation detected
11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: Executable '/usr/local/geni-scs-r2.5/src/main/mxtce' doesn't belong to any package and ProcessUnpackaged is set to 'no'
11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: 'post-create' on '/var/spool/abrt/ccpp-2014-12-11-16:15:35-17552' exited with 1
11:17 Dec 11 16:15:38 geni-scs.net.internet2.edu abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2014-12-11-16:15:35-17552'
11:17 but ive got Xi's script to start it up again (runs every minute via cron). if the process has died it gets started up again
11:17 not ideal, but it works

comment:3 Changed 9 years ago by xyang@maxgigapop.net

Status: reopenedaccepted

Does this happen to nutshell or oingo server ?

cc Chad: Run the following and then restart SCS. See if that helps. ulimit -s 65536

comment:4 in reply to:  3 Changed 9 years ago by lnevers@bbn.com

Replying to xyang@…:

Does this happen to nutshell or oingo server ?

cc Chad: Run the following and then restart SCS. See if that helps. ulimit -s 65536

This is a result of the Internet2 SCS. Sorry I did not say it explicitly, but it was int he command output.

I am also seeing crashes when running simpler 3-nodes linear scenarios. Here is a sequence:

  • 1st request at 11:36:03 -ok
  • 2nd at 11:38:21 - failed
  • 3rd at 11:41:40 - failed
  • 4th request at 11:44:17 - ok

Note: Only one request made at the time of the attempts above, no other test activity.

Also, failed = <ProtocolError for geni-scs.net.internet2.edu:8081/geni/xmlrpc: -1 >

comment:5 Changed 9 years ago by lnevers@bbn.com

After the SCS was restarted tried to duplicate the problem unsuccessfully. Re-ran all tests and also significantly increased the load of requests, but could not reproduce.

Will try to reproduce again after the SCS service has been up for a while.

Note: See TracTickets for help on using tickets.