Opened 9 years ago
Closed 9 years ago
#84 closed task (fixed)
Job service stalls
Reported by: | divyashri.bhat@gmail.com | Owned by: | divyashri.bhat@gmail.com |
---|---|---|---|
Priority: | major | Milestone: | GEC20 |
Component: | Job Service | Version: | Sprint4 |
Keywords: | Cc: | ||
Dependencies: |
Description
While using the job service running at http://emmy9.casa.umass.edu:8003, I ran into the following problem:
- When there are many "Running" processes, the job service service stalls and puts the jobs in "Pending" status. To try to identify the source of this problem, I looked at the logs of the "Running" Processes.
The EC tries to connect to an RC which is either not up or does not exist and stays in that state while still showing the job status as "Running".
STDOUT: 11:26:21 INFO OmfEc::Experiment: Experiment: dbhat-2014-04-11T10-18-13-05-00 starts STDOUT: 11:26:21 INFO OmfEc::Experiment: Configure 'nodea-labwikicrashtest' to join 'Source1' STDOUT: 11:26:21 INFO OmfEc::Experiment: Configure 'nodeb-labwikicrashtest' to join 'Source2' STDOUT: 11:26:21 INFO OmfEc::Experiment: Configure 'nodec-labwikicrashtest' to join 'Source3'
To resolve this problem, I tried:
- delete all jobs with status as "Running" but they were only waiting for an RC to connect.
- restart the job service on emmy9.
After this the experiments were ran successfully.
I am not sure if all of these resources are listed in the AMQP database.
But, suppose these resources are listed in the AMQP database and are deleted by the experimenter or Aggregate Manager, and at a later time, the experimenter tries to connect to these resources that do not actually exist:
- How long will the EC wait for these RCs to connect?
- With several such jobs, will job service continue to block and thus, prevent other experiments from running?
Change History (11)
comment:1 Changed 9 years ago by
Component: | Authentication/Authorization → Other |
---|
comment:2 Changed 9 years ago by
Component: | Other → Job Service |
---|---|
Milestone: | → GEC20 |
Owner: | changed from somebody to jack.hong@nicta.com.au |
Version: | SPIRAL6 → Sprint1 |
comment:3 Changed 9 years ago by
comment:4 Changed 9 years ago by
On 4/16/14, 7:28 PM, Thierry Rakotoarivelo wrote:
I have added 2 features to the EC that go towards this. http://mytestbed.net/issues/1736 http://mytestbed.net/issues/1737
Using an EC with these 2 commit changes in place, you can now have a timeout timer which will abort the experiment after a given time if one or more resources have not joined all the groups defined in your experiments. In other words, if a resource is not available or cannot communicate with the EC, the experiment will stop after a given time.
comment:5 Changed 9 years ago by
Version: | Sprint1 → Sprint3 |
---|
Pre-release version of EC gem deployed. Divya ran some tests. Timeout worked. Will need to add this time out to example scripts. Update the EC when Thierry gives the okay. Divya will do another week of testing with pre-release version.
comment:6 Changed 9 years ago by
For Job Service cleanup:
- Leave timeout OEDL script implemented by Thierry running in the background for every experiment.
- When timeout occurs, transition job to "Failed" status (?)
What should be the default timeout value for RCs to connect?
comment:7 Changed 9 years ago by
Jack will check to see if a default timeout can be added to the scheduler that can then be overridden by the script. He will update the ticket.
comment:8 Changed 9 years ago by
See: http://mytestbed.net/projects/omf6/wiki/OEDLOMF6#omf_ecbackwardtimeout_resources
- This OEDL file timeout_resources contains a prop definition, and logic to stop exp.
- To use it, "loadOEDL('omf_ec/backward/timeout_resources')" has to be included in the OEDL.
- The OEDL file is not included in EC by default.
- Loading additional OEDL files is not configurable via EC config files yet...
Therefore:
TWO options:
- Extend EC to allow configuring loading additional files via config file or cmd line option. Then include this timeout_resource files on the instance.
- As Max pointed out in the email, add @hard_timeout support to the scheduler. i.e. maximum duration allowed for all exps.
J.
comment:9 Changed 9 years ago by
hard_timeout cfg option now added to job service as in commit #ccd688b
comment:10 Changed 9 years ago by
Owner: | changed from jack.hong@nicta.com.au to divyashri.bhat@gmail.com |
---|---|
Status: | new → assigned |
Version: | Sprint3 → Sprint4 |
Reassign to Divya to test and close if this resolves the issue.
comment:11 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Divya has not seen this issue.
Thierry will check with Max on status of this issue. EC does not currently timeout if resources are not available. Proposed solution is to add functionality to scheduler to timeout.
Possible to do a timeout in OEDL. Mike requested to send an example of how this is done. Divya will try this out.