Opened 10 years ago

Closed 9 years ago

#111 closed (fixed)

FOAM sliver not cleared at expiration time

Reported by: lnevers@bbn.com Owned by: vjo@duke.edu
Priority: major Milestone: EG-EXP-6
Component: AM Version: SPIRAL4
Keywords: foam Cc:
Dependencies:

Description

Expired FOAM slivers are not cleared at the ExoGENI FOAM. In a scenario where no sliver existed for my user in the ExoGENI rack, I created a new slice named "EG-EXP-6-exp1" and then tried to create a sliver at the ExoGENI FOAM.

The sliver create reported that the sliver already existed. The sliverstatus shows that the sliver expired 26 days ago:

$ omni.py -a of-exobbn sliverstatus EG-EXP-6-exp1 -V1
INFO:omni:Loading config file omni_config
INFO:omni:Using control framework pgeni
INFO:omni:Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 expires on 2012-09-27 14:01:23 UTC
INFO:omni:Substituting AM nickname of-exobbn with URL https://bbn-hn.exogeni.net:3626/foam/gapi/1, URN unspecified_AM_URN
INFO:omni:Status of Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:
INFO:omni:Sliver status for Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 at AM URL https://bbn-hn.exogeni.net:3626/foam/gapi/1
INFO:omni:{
  "geni_status": "ready", 
  "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:21095873-0df9-4254-ba15-9af4e817d3a2", 
  "foam_pend_reason": "Request has underspecified VLAN requests", 
  "foam_expires": "2012-08-30 00:00:00+00:00", 
  "geni_resources": [
    {
      "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:21095873-0df9-4254-ba15-9af4e817d3a2", 
      "geni_error": "", 
      "geni_status": "ready"
    }
  ], 
  "foam_status": "Approved"
}
INFO:omni: ------------------------------------------------------------
INFO:omni: Completed sliverstatus:

Change History (17)

comment:1 Changed 10 years ago by lnevers@bbn.com

Also just noticed that there is another foam sliver from another slice (EG-EXP-6-exp2) that no longer existing:

  {
   "status": "Approved", 
   "sliver_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp2:4a1d909c-b0c7-45a7-b100-01e0b1d8bd3a", 
   "creation": "2012-08-14 22:05:36.797393+00:00", 
   "pend_reason": "Request has underspecified VLAN requests", 
   "expiration": "2012-08-22 00:00:00+00:00", 
   "deleted": "False", 
   "user": "urn:publicid:IDN+pgeni.gpolab.bbn.com+user+lnevers1", 
   "slice_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp2", 
   "enabled": true, 
   "email": "lnevers@bbn.com", 
   "flowvisor_slice": "4a1d909c-b0c7-45a7-b100-01e0b1d8bd3a", 
   "desc": "EG-EXP-5-scenario1 resources at BBN ExoGENI.", 
   "ref": null, 
   "id": 35, 
   "uuid": "4a1d909c-b0c7-45a7-b100-01e0b1d8bd3a"
  }, 

comment:2 Changed 10 years ago by jbs@bbn.com

Owner: changed from somebody to jbs@bbn.com
Status: newassigned

I'm taking a look at this now, since we're the FOAM admins for bbn-hn.

At a glance, it looks like the cron jobs are there:

[09:40:18] jbs@bbn-hn:/home/jbs
+$ cat  /etc/cron.d/foam.cron 
#
# Regular cron jobs for the foam package
#
0 *     * * *   nginx [ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire
40 0,6,12,18 * * *  nginx [ -x /opt/foam/bin/expire-emails ] && /opt/foam/bin/expire-emails
0 3 * * * nginx [ -x /opt/foam/bin/daily-queue ] && /opt/foam/bin/daily-queue

I'll check the logs to see if there's anything interesting there.

comment:3 Changed 10 years ago by jbs@bbn.com

Luisa, is this also happening on rci-hn? I can check if you're not sure.

comment:4 Changed 10 years ago by lnevers@bbn.com

Josh, same problem at RENCI:

lnevers@arendia:~/gcf-2.0$ omni.py -a of-exorci sliverstatus EG-EXP-6-exp1 -V1
INFO:omni:Loading config file omni_config
INFO:omni:Using control framework pgeni
INFO:omni:Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 expires on 2012-09-27 14:01:23 UTC
INFO:omni:Substituting AM nickname of-exorci with URL https://rci-hn.exogeni.net:3626/foam/gapi/1, URN unspecified_AM_URN
INFO:omni:Status of Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:
INFO:omni:Sliver status for Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1 at AM URL https://rci-hn.exogeni.net:3626/foam/gapi/1
INFO:omni:{
  "geni_status": "ready", 
  "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:aa5e1141-a4b6-4e02-ab01-75cf84d74382", 
  "foam_pend_reason": "Request has underspecified VLAN requests", 
  "foam_expires": "2012-08-30 00:00:00+00:00", 
  "geni_resources": [
    {
      "geni_urn": "urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+EG-EXP-6-exp1:aa5e1141-a4b6-4e02-ab01-75cf84d74382", 
      "geni_error": "", 
      "geni_status": "ready"
    }
  ], 
  "foam_status": "Approved"
}

This type of problem has also happened at the InstaGENI rack. Are all GENI racks running the same FOAM and FlowVisor? versions?

comment:5 in reply to:  4 ; Changed 10 years ago by jbs@bbn.com

Replying to lnevers@bbn.com:

Josh, same problem at RENCI:

Ok, good to know.

This type of problem has also happened at the InstaGENI rack. Are all GENI racks running the same FOAM and FlowVisor? versions?

Similar, but not identical, no.

comment:6 in reply to:  5 Changed 10 years ago by lnevers@bbn.com

Replying to jbs@bbn.com:

Similar, but not identical, no.

From an experimenter perspective, I have had 5 instances were my slices were expired but the OF slivers still existed. Looks the same from the outside, some cleanup is not taking place.

Instances have been seen at ExoGENI BBN, ExoGENI RCI, InstaGENI Utah. I believe, this also occurred with a OF sliver at Rutgers, but was not able to get the information to support this.

comment:7 Changed 10 years ago by jbs@bbn.com

Hmm, so this is interesting:

[10:25:35] jbs@bbn-hn:/opt/foam/log/tasks
+$ ls -Flart expire-slivers-* | tail
-rw-r--r-- 1 nginx nginx 131 Aug  9 15:00 expire-slivers-1344538801.955813.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 16:00 expire-slivers-1344542401.785265.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 17:00 expire-slivers-1344546001.735620.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 18:00 expire-slivers-1344549602.129684.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 19:00 expire-slivers-1344553201.299020.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 20:00 expire-slivers-1344556801.688796.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 21:00 expire-slivers-1344560401.808102.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 22:00 expire-slivers-1344564001.284641.log
-rw-r--r-- 1 nginx nginx 131 Aug  9 23:00 expire-slivers-1344567601.695698.log
-rw-r--r-- 1 nginx nginx 131 Aug 10 00:00 expire-slivers-1344571202.168120.log

[10:25:37] jbs@bbn-hn:/opt/foam/log/tasks
+$ cat expire-slivers-1344571202.168120.log
------------------------------
Version: 0.6.4
------------------------------
'{"msg": "", "retcode": 0, "value": {"expired": []}}'

So indeed, expire-slivers hasn't run at all since the upgrade to 0.8. Why not?

Hmm, well, in the Ubuntu package, the cron job file is called /etc/cron.d/foam; in Victor's CentOS package, it's called /etc/cron.d/foam.d. Does that not work as a name? Let's try it:

sudo cp /etc/cron.d/foam.cron /etc/cron.d/foam

That should next trigger on

0 *     * * *   nginx [ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire

at 12:00, so let's see what happens.

comment:8 Changed 10 years ago by jbs@bbn.com

Hmm, copying the cronjob file to /etc/cron.d/foam didn't change anything; and ah, duh, right there in /var/log/cron, it says it was already being run, and now it's being run twice:

[12:04:24] jbs@bbn-hn:/opt/foam/log/tasks
+$ sudo egrep foam.+expire /var/log/cron | tail
Sep 26 09:00:01 bbn-hn CROND[3955]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 10:00:01 bbn-hn CROND[10071]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 11:00:01 bbn-hn CROND[18965]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 12:00:01 bbn-hn CROND[24889]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 12:40:01 bbn-hn CROND[20672]: (nginx) CMD ([ -x /opt/foam/bin/expire-emails ] && /opt/foam/bin/expire-emails)
Sep 26 13:00:01 bbn-hn CROND[1117]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 14:00:01 bbn-hn CROND[30243]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 15:00:01 bbn-hn CROND[7547]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 16:00:01 bbn-hn CROND[17018]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)
Sep 26 16:00:01 bbn-hn CROND[17024]: (nginx) CMD ([ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire)

Ok, so let's get rid of the duplicate job:

sudo rm /etc/cron.d/foam

Now, what's not working here. Is it not executable? I think it is:

[12:06:07] jbs@bbn-hn:/opt/foam/log/tasks
+$ ls -Fla /opt/foam/bin/expire
-rwxr-xr-x 1 root root 302 Aug  3 11:19 /opt/foam/bin/expire*

Does nginx think so? Yep:

[12:07:56] jbs@bbn-hn:/opt/foam/log/tasks
+$ sudo sudo -u nginx sh -c '[ -x /opt/foam/bin/expire ] && echo "yep"'
yep

Well, let's actually run it:

[12:08:39] jbs@bbn-hn:/opt/foam/log/tasks
+$ sudo sudo -u nginx sh -c '[ -x /opt/foam/bin/expire ] && /opt/foam/bin/expire'
Traceback (most recent call last):
  File "/opt/foam/bin/expire", line 7, in <module>
    import foam.task
  File "/opt/foam/lib/foam/task.py", line 25, in <module>
    from foam.core.configdb import ConfigDB
  File "/opt/foam/lib/foam/core/configdb.py", line 4, in <module>
    from sqlalchemy import (Table, Column, MetaData, ForeignKey, PickleType, String,
ImportError: cannot import name event

Oho, so that's interesting. And we aren't seeing these errors because there's no alias for 'nginx' on the system, so mail to 'nginx' is landing locally:

[12:09:33] jbs@bbn-hn:/opt/foam/log/tasks
+$ ls -Fla /var/mail/nginx
-rw------- 1 nginx mail 1582340 Sep 26 12:00 /var/mail/nginx

I'll make a separate ticket for that, and talk to Nick about that exception.

comment:9 Changed 10 years ago by jbs@bbn.com

Owner: changed from jbs@bbn.com to vjo@cs.duke.edu
Status: assignednew

Victor tracked down what's up with that exception: It's that expire can't find event in sqlalchemy, because Victor installs that as python-foamdep-sqlalchemy-0.7.5-exogeni6.el6.x86_64 and uses /opt/foam-deps as its base. So, we need to do one of a couple of things:

  • Add a $PYTHONPATH modification to the cron job.
  • Modify expire (and other FOAM things) to be able to find Python packages in non-standard places. Perhaps places defined by /etc/sysconfig/foam (Red Hat) and /etc/default/foam (Ubuntu)?
  • Modify Victor's package to put its library in a place that FOAM is already looking (probably /opt/foam, since we want it to be specific to FOAM).

Victor's package already modifies the cron job from the stock one, so the first option is pleasantly simple, and good as a short-term fix. We should to Nick about which of the latter two seems better to him, and see if we agree. (I think Victor prefers the second, and we suspect that Nick would prefer the third.)

Victor's going to make the first change, and spin up a new RPM. (foam-0.8.2-exogeni2.el6.noarch presumably? Or is that not what that number after 'exogeni' is for? I've forgotten how RPM likes to specify "this is a revision to the RPM packaged version, not a change from the upstream code".)

comment:10 Changed 10 years ago by jbs@bbn.com

Victor, two quick questions:

  • We think this affects both expiration and expiration-notification (i.e. everything from cron), right?
  • Do you have a timeframe for the fix?

Thanks!

comment:11 Changed 10 years ago by jbs@bbn.com

Victor says:

The RPM has been re-built w/ the fix to the cron job. 
This was completed on the Monday before GEC. 
 
I need to push it out and test it, but I think this covers the problem. 
I just didn't want to push it out *during* GEC (for obvious reasons). 

Ok, cool. Will you do it during the November ExoGENI outage?

comment:12 Changed 9 years ago by vjo@duke.edu

Owner: changed from vjo@cs.duke.edu to vjo@duke.edu
Status: newassigned

Yes. This is complete.

FOAM is updated on both racks, and should be running at this time. Please feel free to validate the fix at your convenience.

comment:13 in reply to:  12 Changed 9 years ago by jbs@bbn.com

Replying to vjo@duke.edu:

FOAM is updated on both racks, and should be running at this time. Please feel free to validate the fix at your convenience.

Cool, will do. Meanwhile, I noticed that https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=rci-hn.exogeni.net%3A3626 says that FOAM at rci-hn hasn't sent in monitoring data since Nov 11th; is that surprising? https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=bbn-hn.exogeni.gpolab.bbn.com%3A3626 says that bbn-hn has resumed normal reporting.

comment:14 Changed 9 years ago by jbs@bbn.com

I've created a sliver due to expire at 2012-12-04 12:00 UTC, which should let us see the expiration-warning mail cron job, daily queue mail cron job, and actual expiration cron job, all in action. If those all work as expected, I'll close this out.

Meanwhile, I noticed that https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=rci-hn.exogeni.net%3A3626 says that FOAM at rci-hn hasn't sent in monitoring data since Nov 11th; is that surprising? https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=bbn-hn.exogeni.gpolab.bbn.com%3A3626 says that bbn-hn has resumed normal reporting.

This is also still true; if it's still not reporting tomorrow, I'll make a new ticket for that, since it's only tangentially related to this one.

comment:15 Changed 9 years ago by vjo@duke.edu

Give a look at the reporting for RCI; that should (now) be re-enabled.

We'll reserve the right to disable it again, if it resumes the behavior of getting stuck and leaving oodles of python processes hanging out.

comment:16 in reply to:  15 Changed 9 years ago by jbs@bbn.com

Replying to vjo@duke.edu:

Give a look at the reporting for RCI; that should (now) be re-enabled.

We'll reserve the right to disable it again, if it resumes the behavior of getting stuck and leaving oodles of python processes hanging out.

Yep, looks good, and sounds fine. (If that does happen, let us know.)

comment:17 Changed 9 years ago by jbs@bbn.com

Resolution: fixed
Status: assignedclosed

I've confirmed that after I created the sliver with a 2012-12-04 12:00 UTC expiration time, (a) I (experimenter) got an "expiring shortly" message on Monday afternoon; (b) We (admins) got a "pending queue" message on Monday night; (c) The sliver expired, and was deleted, and I (experimenter) and we (admins) got e-mail saying as much, on Tuesday.

So, I think this is all set.

Note: See TracTickets for help on using tickets.