Opened 8 years ago

Closed 8 years ago

#84 closed task (fixed)

Job service stalls

Reported by: divyashri.bhat@gmail.com Owned by: divyashri.bhat@gmail.com
Priority: major Milestone: GEC20
Component: Job Service Version: Sprint4
Keywords: Cc:
Dependencies:

Description

While using the job service running at http://emmy9.casa.umass.edu:8003, I ran into the following problem:

  1. When there are many "Running" processes, the job service service stalls and puts the jobs in "Pending" status. To try to identify the source of this problem, I looked at the logs of the "Running" Processes.

The EC tries to connect to an RC which is either not up or does not exist and stays in that state while still showing the job status as "Running".

STDOUT: 11:26:21 INFO  OmfEc::Experiment: Experiment: dbhat-2014-04-11T10-18-13-05-00 starts
STDOUT: 11:26:21 INFO  OmfEc::Experiment: Configure 'nodea-labwikicrashtest' to join 'Source1'
STDOUT: 11:26:21 INFO  OmfEc::Experiment: Configure 'nodeb-labwikicrashtest' to join 'Source2'
STDOUT: 11:26:21 INFO  OmfEc::Experiment: Configure 'nodec-labwikicrashtest' to join 'Source3'

To resolve this problem, I tried:

  1. delete all jobs with status as "Running" but they were only waiting for an RC to connect.
  2. restart the job service on emmy9.

After this the experiments were ran successfully.

I am not sure if all of these resources are listed in the AMQP database.

But, suppose these resources are listed in the AMQP database and are deleted by the experimenter or Aggregate Manager, and at a later time, the experimenter tries to connect to these resources that do not actually exist:

  1. How long will the EC wait for these RCs to connect?
  2. With several such jobs, will job service continue to block and thus, prevent other experiments from running?

Change History (11)

comment:1 Changed 8 years ago by divyashri.bhat@gmail.com

Component: Authentication/AuthorizationOther

comment:2 Changed 8 years ago by johren@bbn.com

Component: OtherJob Service
Milestone: GEC20
Owner: changed from somebody to jack.hong@nicta.com.au
Version: SPIRAL6Sprint1

comment:3 Changed 8 years ago by johren@bbn.com

Thierry will check with Max on status of this issue. EC does not currently timeout if resources are not available. Proposed solution is to add functionality to scheduler to timeout.

Possible to do a timeout in OEDL. Mike requested to send an example of how this is done. Divya will try this out.

comment:4 Changed 8 years ago by johren@bbn.com

On 4/16/14, 7:28 PM, Thierry Rakotoarivelo wrote:

I have added 2 features to the EC that go towards this. http://mytestbed.net/issues/1736 http://mytestbed.net/issues/1737

Using an EC with these 2 commit changes in place, you can now have a timeout timer which will abort the experiment after a given time if one or more resources have not joined all the groups defined in your experiments. In other words, if a resource is not available or cannot communicate with the EC, the experiment will stop after a given time.

comment:5 Changed 8 years ago by johren@bbn.com

Version: Sprint1Sprint3

Pre-release version of EC gem deployed. Divya ran some tests. Timeout worked. Will need to add this time out to example scripts. Update the EC when Thierry gives the okay. Divya will do another week of testing with pre-release version.

comment:6 Changed 8 years ago by divyashri.bhat@gmail.com

For Job Service cleanup:

  1. Leave timeout OEDL script implemented by Thierry running in the background for every experiment.
  2. When timeout occurs, transition job to "Failed" status (?)

What should be the default timeout value for RCs to connect?

comment:7 Changed 8 years ago by johren@bbn.com

Jack will check to see if a default timeout can be added to the scheduler that can then be overridden by the script. He will update the ticket.

comment:8 Changed 8 years ago by jack.hong@nicta.com.au

See: http://mytestbed.net/projects/omf6/wiki/OEDLOMF6#omf_ecbackwardtimeout_resources

  • This OEDL file timeout_resources contains a prop definition, and logic to stop exp.
  • To use it, "loadOEDL('omf_ec/backward/timeout_resources')" has to be included in the OEDL.
  • The OEDL file is not included in EC by default.
  • Loading additional OEDL files is not configurable via EC config files yet...

Therefore:

TWO options:

  1. Extend EC to allow configuring loading additional files via config file or cmd line option. Then include this timeout_resource files on the instance.
  1. As Max pointed out in the email, add @hard_timeout support to the scheduler. i.e. maximum duration allowed for all exps.

J.

comment:9 Changed 8 years ago by jack.hong@nicta.com.au

hard_timeout cfg option now added to job service as in commit #ccd688b

comment:10 Changed 8 years ago by johren@bbn.com

Owner: changed from jack.hong@nicta.com.au to divyashri.bhat@gmail.com
Status: newassigned
Version: Sprint3Sprint4

Reassign to Divya to test and close if this resolves the issue.

comment:11 Changed 8 years ago by johren@bbn.com

Resolution: fixed
Status: assignedclosed

Divya has not seen this issue.

Note: See TracTickets for help on using tickets.