Opened 10 years ago
Closed 10 years ago
#111 closed (fixed)
FOAM sliver not cleared at expiration time
Reported by: | lnevers@bbn.com | Owned by: | vjo@duke.edu |
---|---|---|---|
Priority: | major | Milestone: | EG-EXP-6 |
Component: | AM | Version: | SPIRAL4 |
Keywords: | foam | Cc: | |
Dependencies: |
Description
Expired FOAM slivers are not cleared at the ExoGENI FOAM. In a scenario where no sliver existed for my user in the ExoGENI rack, I created a new slice named "EG-EXP-6-exp1" and then tried to create a sliver at the ExoGENI FOAM.
The sliver create reported that the sliver already existed. The sliverstatus shows that the sliver expired 26 days ago:
$ omni.py -a of-exobbn sliverstatus EG-EXP-6-exp1 -V1 INFO:omni:Loading config file omni_config INFO:omni:Using control framework pgeni INFO:omni:Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 expires on 2012-09-27 14:01:23 UTC INFO:omni:Substituting AM nickname of-exobbn with URL https://bbn-hn.exogeni.net:3626/foam/gapi/1, URN unspecified_AM_URN INFO:omni:Status of Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1: INFO:omni:Sliver status for Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 at AM URL https://bbn-hn.exogeni.net:3626/foam/gapi/1 INFO:omni:{ "geni_status": "ready", "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:21095873-0df9-4254-ba15-9af4e817d3a2", "foam_pend_reason": "Request has underspecified VLAN requests", "foam_expires": "2012-08-30 00:00:00+00:00", "geni_resources": [ { "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:21095873-0df9-4254-ba15-9af4e817d3a2", "geni_error": "", "geni_status": "ready" } ], "foam_status": "Approved" } INFO:omni: ------------------------------------------------------------ INFO:omni: Completed sliverstatus:
Change History (17)
comment:1 Changed 10 years ago by
comment:2 Changed 10 years ago by
Owner: | changed from somebody to jbs@bbn.com |
---|---|
Status: | new → assigned |
I'm taking a look at this now, since we're the FOAM admins for bbn-hn.
At a glance, it looks like the cron jobs are there:
[09:40:18] jbs@bbn-hn:/home/jbs +$ cat /etc/cron.d/foam.cron # # Regular cron jobs for the foam package # 0 * * * * nginx [ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire 40 0,6,12,18 * * * nginx [ -x /opt/foam/bin/expire-emails ] && /opt/foam/bin/expire-emails 0 3 * * * nginx [ -x /opt/foam/bin/daily-queue ] && /opt/foam/bin/daily-queue
I'll check the logs to see if there's anything interesting there.
comment:3 Changed 10 years ago by
Luisa, is this also happening on rci-hn? I can check if you're not sure.
comment:4 follow-up: 5 Changed 10 years ago by
Josh, same problem at RENCI:
lnevers@arendia:~/gcf-2.0$ omni.py -a of-exorci sliverstatus EG-EXP-6-exp1 -V1 INFO:omni:Loading config file omni_config INFO:omni:Using control framework pgeni INFO:omni:Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 expires on 2012-09-27 14:01:23 UTC INFO:omni:Substituting AM nickname of-exorci with URL https://rci-hn.exogeni.net:3626/foam/gapi/1, URN unspecified_AM_URN INFO:omni:Status of Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1: INFO:omni:Sliver status for Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 at AM URL https://rci-hn.exogeni.net:3626/foam/gapi/1 INFO:omni:{ "geni_status": "ready", "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:aa5e1141-a4b6-4e02-ab01-75cf84d74382", "foam_pend_reason": "Request has underspecified VLAN requests", "foam_expires": "2012-08-30 00:00:00+00:00", "geni_resources": [ { "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:aa5e1141-a4b6-4e02-ab01-75cf84d74382", "geni_error": "", "geni_status": "ready" } ], "foam_status": "Approved" }
This type of problem has also happened at the InstaGENI rack. Are all GENI racks running the same FOAM and FlowVisor? versions?
comment:5 follow-up: 6 Changed 10 years ago by
Replying to lnevers@bbn.com:
Josh, same problem at RENCI:
Ok, good to know.
This type of problem has also happened at the InstaGENI rack. Are all GENI racks running the same FOAM and FlowVisor? versions?
Similar, but not identical, no.
comment:6 Changed 10 years ago by
Replying to jbs@bbn.com:
Similar, but not identical, no.
From an experimenter perspective, I have had 5 instances were my slices were expired but the OF slivers still existed. Looks the same from the outside, some cleanup is not taking place.
Instances have been seen at ExoGENI BBN, ExoGENI RCI, InstaGENI Utah. I believe, this also occurred with a OF sliver at Rutgers, but was not able to get the information to support this.
comment:7 Changed 10 years ago by
Hmm, so this is interesting:
[10:25:35] jbs@bbn-hn:/opt/foam/log/tasks +$ ls -Flart expire-slivers-* | tail -rw-r--r-- 1 nginx nginx 131 Aug 9 15:00 expire-slivers-1344538801.955813.log -rw-r--r-- 1 nginx nginx 131 Aug 9 16:00 expire-slivers-1344542401.785265.log -rw-r--r-- 1 nginx nginx 131 Aug 9 17:00 expire-slivers-1344546001.735620.log -rw-r--r-- 1 nginx nginx 131 Aug 9 18:00 expire-slivers-1344549602.129684.log -rw-r--r-- 1 nginx nginx 131 Aug 9 19:00 expire-slivers-1344553201.299020.log -rw-r--r-- 1 nginx nginx 131 Aug 9 20:00 expire-slivers-1344556801.688796.log -rw-r--r-- 1 nginx nginx 131 Aug 9 21:00 expire-slivers-1344560401.808102.log -rw-r--r-- 1 nginx nginx 131 Aug 9 22:00 expire-slivers-1344564001.284641.log -rw-r--r-- 1 nginx nginx 131 Aug 9 23:00 expire-slivers-1344567601.695698.log -rw-r--r-- 1 nginx nginx 131 Aug 10 00:00 expire-slivers-1344571202.168120.log [10:25:37] jbs@bbn-hn:/opt/foam/log/tasks +$ cat expire-slivers-1344571202.168120.log ------------------------------ Version: 0.6.4 ------------------------------ '{"msg": "", "retcode": 0, "value": {"expired": []}}'
So indeed, expire-slivers hasn't run at all since the upgrade to 0.8. Why not?
Hmm, well, in the Ubuntu package, the cron job file is called /etc/cron.d/foam; in Victor's CentOS package, it's called /etc/cron.d/foam.d. Does that not work as a name? Let's try it:
sudo cp /etc/cron.d/foam.cron /etc/cron.d/foam
That should next trigger on
0 * * * * nginx [ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire
at 12:00, so let's see what happens.
comment:8 Changed 10 years ago by
Hmm, copying the cronjob file to /etc/cron.d/foam didn't change anything; and ah, duh, right there in /var/log/cron, it says it was already being run, and now it's being run twice:
[12:04:24] jbs@bbn-hn:/opt/foam/log/tasks +$ sudo egrep foam.+expire /var/log/cron | tail Sep 26 09:00:01 bbn-hn CROND[3955]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 10:00:01 bbn-hn CROND[10071]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 11:00:01 bbn-hn CROND[18965]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 12:00:01 bbn-hn CROND[24889]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 12:40:01 bbn-hn CROND[20672]: (nginx) CMD ([ -x /opt/foam/bin/expire-emails ] && /opt/foam/bin/expire-emails) Sep 26 13:00:01 bbn-hn CROND[1117]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 14:00:01 bbn-hn CROND[30243]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 15:00:01 bbn-hn CROND[7547]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 16:00:01 bbn-hn CROND[17018]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire) Sep 26 16:00:01 bbn-hn CROND[17024]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Ok, so let's get rid of the duplicate job:
sudo rm /etc/cron.d/foam
Now, what's not working here. Is it not executable? I think it is:
[12:06:07] jbs@bbn-hn:/opt/foam/log/tasks +$ ls -Fla /opt/foam/bin/expire -rwxr-xr-x 1 root root 302 Aug 3 11:19 /opt/foam/bin/expire*
Does nginx think so? Yep:
[12:07:56] jbs@bbn-hn:/opt/foam/log/tasks +$ sudo sudo -u nginx sh -c '[ -x /opt/foam/bin/expire ] && echo "yep"' yep
Well, let's actually run it:
[12:08:39] jbs@bbn-hn:/opt/foam/log/tasks +$ sudo sudo -u nginx sh -c '[ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire' Traceback (most recent call last): File "/opt/foam/bin/expire", line 7, in <module> import foam.task File "/opt/foam/lib/foam/task.py", line 25, in <module> from foam.core.configdb import ConfigDB File "/opt/foam/lib/foam/core/configdb.py", line 4, in <module> from sqlalchemy import (Table, Column, MetaData, ForeignKey, PickleType, String, ImportError: cannot import name event
Oho, so that's interesting. And we aren't seeing these errors because there's no alias for 'nginx' on the system, so mail to 'nginx' is landing locally:
[12:09:33] jbs@bbn-hn:/opt/foam/log/tasks +$ ls -Fla /var/mail/nginx -rw------- 1 nginx mail 1582340 Sep 26 12:00 /var/mail/nginx
I'll make a separate ticket for that, and talk to Nick about that exception.
comment:9 Changed 10 years ago by
Owner: | changed from jbs@bbn.com to vjo@cs.duke.edu |
---|---|
Status: | assigned → new |
Victor tracked down what's up with that exception: It's that expire can't find event in sqlalchemy, because Victor installs that as python-foamdep-sqlalchemy-0.7.5-exogeni6.el6.x86_64 and uses /opt/foam-deps as its base. So, we need to do one of a couple of things:
- Add a $PYTHONPATH modification to the cron job.
- Modify expire (and other FOAM things) to be able to find Python packages in non-standard places. Perhaps places defined by /etc/sysconfig/foam (Red Hat) and /etc/default/foam (Ubuntu)?
- Modify Victor's package to put its library in a place that FOAM is already looking (probably /opt/foam, since we want it to be specific to FOAM).
Victor's package already modifies the cron job from the stock one, so the first option is pleasantly simple, and good as a short-term fix. We should to Nick about which of the latter two seems better to him, and see if we agree. (I think Victor prefers the second, and we suspect that Nick would prefer the third.)
Victor's going to make the first change, and spin up a new RPM. (foam-0.8.2-exogeni2.el6.noarch presumably? Or is that not what that number after 'exogeni' is for? I've forgotten how RPM likes to specify "this is a revision to the RPM packaged version, not a change from the upstream code".)
comment:10 Changed 10 years ago by
Victor, two quick questions:
- We think this affects both expiration and expiration-notification (i.e. everything from cron), right?
- Do you have a timeframe for the fix?
Thanks!
comment:11 Changed 10 years ago by
Victor says:
The RPM has been re-built w/ the fix to the cron job. This was completed on the Monday before GEC. I need to push it out and test it, but I think this covers the problem. I just didn't want to push it out *during* GEC (for obvious reasons).
Ok, cool. Will you do it during the November ExoGENI outage?
comment:12 follow-up: 13 Changed 10 years ago by
Owner: | changed from vjo@cs.duke.edu to vjo@duke.edu |
---|---|
Status: | new → assigned |
Yes. This is complete.
FOAM is updated on both racks, and should be running at this time. Please feel free to validate the fix at your convenience.
comment:13 Changed 10 years ago by
Replying to vjo@duke.edu:
FOAM is updated on both racks, and should be running at this time. Please feel free to validate the fix at your convenience.
Cool, will do. Meanwhile, I noticed that https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=rci-hn.exogeni.net%3A3626 says that FOAM at rci-hn hasn't sent in monitoring data since Nov 11th; is that surprising? https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=bbn-hn.exogeni.gpolab.bbn.com%3A3626 says that bbn-hn has resumed normal reporting.
comment:14 Changed 10 years ago by
I've created a sliver due to expire at 2012-12-04 12:00 UTC, which should let us see the expiration-warning mail cron job, daily queue mail cron job, and actual expiration cron job, all in action. If those all work as expected, I'll close this out.
Meanwhile, I noticed that https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=rci-hn.exogeni.net%3A3626 says that FOAM at rci-hn hasn't sent in monitoring data since Nov 11th; is that surprising? https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=bbn-hn.exogeni.gpolab.bbn.com%3A3626 says that bbn-hn has resumed normal reporting.
This is also still true; if it's still not reporting tomorrow, I'll make a new ticket for that, since it's only tangentially related to this one.
comment:15 follow-up: 16 Changed 10 years ago by
Give a look at the reporting for RCI; that should (now) be re-enabled.
We'll reserve the right to disable it again, if it resumes the behavior of getting stuck and leaving oodles of python processes hanging out.
comment:16 Changed 10 years ago by
Replying to vjo@duke.edu:
Give a look at the reporting for RCI; that should (now) be re-enabled.
We'll reserve the right to disable it again, if it resumes the behavior of getting stuck and leaving oodles of python processes hanging out.
Yep, looks good, and sounds fine. (If that does happen, let us know.)
comment:17 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I've confirmed that after I created the sliver with a 2012-12-04 12:00 UTC expiration time, (a) I (experimenter) got an "expiring shortly" message on Monday afternoon; (b) We (admins) got a "pending queue" message on Monday night; (c) The sliver expired, and was deleted, and I (experimenter) and we (admins) got e-mail saying as much, on Tuesday.
So, I think this is all set.
Also just noticed that there is another foam sliver from another slice (EG-EXP-6-exp2) that no longer existing: