Opened 6 years ago

Last modified 5 years ago

#191 reopened

ExoGENI Compute Aggregates do not have any data for Slivers in GMOC

Reported by: lnevers@bbn.com Owned by: somebody
Priority: major Milestone:
Component: AM Version: SPIRAL5
Keywords: confirmation tests Cc:
Dependencies:

Description

When compute resources slivers exist, there is no data being shown for "Slivers" and "Resources" for both FIU (fiu-hn.exogeni.net:11443):

https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=fiu-hn.exogeni.net%3A11443

and for University of Houston (uh-hn.exogeni.net:11443):

https://gmoc-db.grnoc.iu.edu/protected-openid/index.pl?method=aggregate_details;aggregate=uh-hn.exogeni.net%3A11443

Other ExoGENI rack do report information for "Slivers" and "Resources". Also note that this data has not shown up 2 hours after the sliver creation.

Change History (19)

comment:1 Changed 6 years ago by chaos@bbn.com

Note: there is at least some bitrot since Friday here, i'm afraid. Victor and i confirmed that resource data was being reported from the fiu-hn and uh-hn aggregates on Friday, and now i don't see any resource data in the protected UI. So something is amiss here.

comment:2 Changed 6 years ago by ibaldin@renci.org

This is likely a blowhole configuration issue on control.exogeni.net

I believe each rack needs to be explicitly configured in the python plugin provided by the GPO to blowhole.

comment:3 Changed 6 years ago by chaos@bbn.com

Ilya: it's not quite that. That would explain lack of sliver reporting. But Luisa is right that resources are not reported either. The enumeration of vmserver node resources should be simply done by build_gmoc_xml.py on each head node, and shouldn't depend on control.exogeni.net at all.

comment:4 Changed 6 years ago by chaos@bbn.com

I added a nagios check for the number of resources reported by each aggregate, and it claims uh-hn ORCA is reporting 8 resources. Scan of downloaded nagios data corroborates:

>>> a = data['aggregate']['uh-hn.exogeni.net:11443']

>>> print "\n".join(["%s: %s" % (r.id, r.last_modified) for r in a.resources])
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w8: 2013-08-02 11:27:03
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w7: 2013-08-02 11:27:03
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w6: 2013-08-02 11:27:03
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w5: 2013-08-02 11:27:03
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w4: 2013-08-02 11:27:03
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w3: 2013-08-02 11:27:03
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w2: 2013-08-02 11:27:03
urn:publicid:IDN+exogeni.net:uhvmsite+vmserver+uh-w1: 2013-08-02 11:27:03

and likewise for fiu-hn:

>>> a = data['aggregate']['fiu-hn.exogeni.net:11443']
>>> print "\n".join(["%s: %s" % (r.id, r.last_modified) for r in a.resources])
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w5: 2013-08-02 11:27:04
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w4: 2013-08-02 11:27:04
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w7: 2013-08-02 11:27:04
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w6: 2013-08-02 11:27:04
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w1: 2013-08-02 11:27:04
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w3: 2013-08-02 11:27:04
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w2: 2013-08-02 11:27:04
urn:publicid:IDN+exogeni.net:fiuvmsite+vmserver+fiu-w8: 2013-08-02 11:27:04

Oh, ugh, sorry, this is hard to keep track of --- there's a very old gmoc-db web UI bug, wherein resources don't show up unless they are being reported as part of slivers. :>(

So: the resource thing is a red herring. The problem is that Luisa's fiu-hn and uh-hn slivers weren't reported to gmoc-db. So that seems like bitrot in the fix Victor, Sarah, and i put in place at GEC. Thoughts?

comment:5 Changed 6 years ago by sedwards@bbn.com

I ran blowhole by hand and I got some errors but it seems to have generated at least one manifest for a slice that belongs to Luisa.

The slice I see is: urn:publicid:IDN+ch.geni.net:ln-prj+slice+eg-gpo-ig-utah

The status is "ready" and the start time looks reasonable.

One thing that looks funny is the creator_urn which includes an email address:

creator_urn="lnevers@bbn.com, urn:publicid:IDN+ch.geni.net+user+lnevers"/>

Also here is an error from when blowhole started up:

Unable to contact NDL converter at http://bbn-hn.exogeni.net:14080/ndl-conversion/ due to org.apache.xmlrpc.client.XmlRpcHttpTransportException: HTTP server returned unexpected status: Not Found
12:26:08,911 DEBUG ManifestSubscriber - Invoking NDL converter ndlConverter.manifestToRSpec3 at http://geni.renci.org:12080/ndl-conversion/

So that leads me to a couple of questions:

  1. Is the BBN NDL convertor down causing it to fall over to the RENCI one? If so, maybe the two convertors aren't running the same code and there is a subtle difference in how the creator_urn is determined?
  2. Perhaps the creator_urn value is causing reporting script to fail in a way we aren't noticing?
  3. If it's neither of the above, perhaps when Victor has a free minute he could look do what we did at the sprint and run the code by hand and see if things look ok.

comment:6 Changed 6 years ago by sedwards@bbn.com

Nick commented on the creator_urn issue: "Just a thought that this may be a result of the difference in the format from the GENI CH certs and the old gpolab PG certs (as an issue we also had with FOAM) in their SubjectAltName? contents."

And indeed I have some some manifests laying around from July and at least one of them has a similar slice_creator URN. So I think this isn't the problem.

comment:7 Changed 6 years ago by sedwards@bbn.com

Ok. I think the right thing to do here is for Victor to look at this when he has a chance.

We think that the slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+tuptymon is a good one to use to investigate as it usually has a compute resource at each aggregate.

comment:8 Changed 6 years ago by lnevers@bbn.com

As requested, I just verified that there is no change in the behavior of this bug. The sliver information is still missing for both FIU and UH, but is present for other Aggregates.

comment:9 Changed 6 years ago by vjo@duke.edu

Coming back to this: It seems that fiu-sm *is* reporting slices to blowhole:

2013-08-06 02:41:00,420 [Smack Listener Processor (1)] DEBUG ManifestSubscriber? - /orca/sm/fiu-sm---8CD0C505-B50A-4DF2-BF55-C880D4F7271C/urn:publicid:I DN+ch.geni.net:ln-prj+slice+EG-CT-1---436a06ce-4758-44e1-9c73-5af1b17bd43d/manifest

So - it *looks* like the script may not be reporting back to GMOC.

That said - the reporting script we modified at GEC (to remove the check based on: if "urn:publicid:IDN+" not in manifest_filename:) remains as it was when we modified it.

So...the question we now have: what's not being updated, and why...I'll look into it.

comment:10 Changed 6 years ago by lnevers@bbn.com

Summary: FIU and Houston Compute Aggregates do not have any data for Slivers and Resources in GMOCFIU, UFL and Houston Compute Aggregates do not have any data for Slivers and Resources in GMOC

This ticket status remains unchanged. There is still no sliver information for Houston and FIU rack. Additionally checked for UFL rack and it also does not include sliver details.

comment:11 Changed 6 years ago by jonmills@renci.org

FIU, UFL, and UH sites all have a functional 'build_gmoc_xml.py' script, which I have just tested moments ago. They all have Chaos's tango monitoring packages install -- all are configured -- all have root crontab entries like so:

[root@fiu-hn ~]# crontab -l */1 * * * * /usr/bin/metric_foam */5 * * * * /usr/bin/report_data_to_gmoc

Both these scripts execute and return exit status 0 on all three racks named in the ticket. So if they aren't reporting, I'm going to need your assistance in figuring out why.

comment:12 Changed 6 years ago by lnevers@bbn.com

Resolution: fixed
Status: newclosed

According to Chaos, the monitoring data for these sites is making it to the GMOC but not being displayed. Data can be downloaded from the GMOC, so it seems this is a GMOC GUI issue and not an ExoGENI issue. Closing ExoGENI ticket and opening a new ticket to track the GMOC GUI issue.

comment:13 Changed 6 years ago by lnevers@bbn.com

Resolution: fixed
Status: closedreopened

Re-opening this ticket, this is still an issue.

This ticket captures the fact that there is no "Sliver" and no "Resource" information for the Compute aggregates (ex. ufl-hn.exogeni.net:11443, uh-hn.exogeni.net:11443, etc). Other types of data is being collected for the same aggregate.

Although reporting scripts seem to be running without any proble, the "Sliver" and "Resource" information is not being reported to the GMOC.

comment:14 Changed 6 years ago by chaos@bbn.com

Sorry, i think i haven't been explaining this well (off-list) so i might as well actually comment on the ticket:

  • "Resource" information is being reported to GMOC
  • "Sliver" information is not being reported to GMOC from any ExoGENI racks except BBN.

comment:15 Changed 6 years ago by lnevers@bbn.com

Looking at the GMOC Compute Resources Aggregates for the ExoGENI site, both the "Sliver"(#tab2) and the "Resources" (#tab3) have no information. This is true for all compute aggregates(:11443) except GPO.

So it seems that we still have two problems:

  • Sliver information is not reported GMOC by ExoGENI Compute Aggregates.
  • Aggregate information is reported to GMOC, but not displayed in GMOC GUI.

comment:16 Changed 6 years ago by lnevers@bbn.com

GMOC GUI bug has been fixed, there are now resources in the ExoGENI compute aggregate details.

comment:17 Changed 6 years ago by lnevers@bbn.com

Summary: FIU, UFL and Houston Compute Aggregates do not have any data for Slivers and Resources in GMOCFIU, UFL and Houston Compute Aggregates do not have any data for Slivers data in GMOC

Updating Summary to reflect current state.

comment:18 Changed 6 years ago by lnevers@bbn.com

Summary: FIU, UFL and Houston Compute Aggregates do not have any data for Slivers data in GMOCFIU, UFL and Houston Compute Aggregates do not have any data for Slivers in GMOC

comment:19 Changed 5 years ago by lnevers@bbn.com

Summary: FIU, UFL and Houston Compute Aggregates do not have any data for Slivers in GMOCExoGENI Compute Aggregates do not have any data for Slivers in GMOC

Site to be checked when ops monitoring data is available include:

FIU, UFL, UH, OSF, UCD, and SL.

Note: See TracTickets for help on using tickets.