Opened 12 years ago

Closed 11 years ago

#52 closed (fixed)

Failing to create a four-node InstaGENI sliver during GEMINI testing

Reported by: johren@bbn.com Owned by: somebody
Priority: major Milestone:
Component: AM Version: SPIRAL4
Keywords: Cc:
Dependencies:

Description

I have been trying to create a four-node slice for testing GEMINI. This slice contains three VMs in a mesh network configuration and one stand-alone VM. When I try to create this sliver, I am getting at least one node failure most of the time. This morning I have tried at least 12 slivers so far and all but two have had at least one node fail. The sliverstatus output does not give any geni_error indication as to what caused the failure.

{'geni_error': ,

'geni_status': 'failed', 'geni_urn': 'urn:publicid:IDN+utah.geniracks.net+sliver+10467', 'pg_manifest': {'attributes': {'client_id': 'PCC',

'component_id': 'urn:publicid:IDN+utah.geniracks.net+node+pc1', 'component_manager_id': 'urn:publicid:IDN+utah.geniracks.net+authority+cm', 'exclusive': 'false', 'sliver_id': 'urn:publicid:IDN+utah.geniracks.net+sliver+10467', 'xmlns': 'http://www.geni.net/resources/rspec/3', 'xmlns:gemini': 'http://geni.net/resources/rspec/ext/gemini/1'},

...

I have left slice johGEM1209261119 up in case someone wants to take a look at one of the failures.

Change History (3)

comment:1 Changed 12 years ago by johren@bbn.com

Well, this has been an interesting thought exercise. There are two different problems, that both occur when the machine is very busy creating lots of VMs.

The first is one I have seen before; when creating the accounts inside the VM, then kernel says the files are not writable and so it fails. Later, no problem.

The second is where I spent most of my time; sometimes the signals sent to kill a running container are ignored. I have reconstructed the sequence events pretty carefully, and I cannot come up with a failure mode that would explain it. I can easily chalk this up to some kind of kernel problem that appears when it is really busy creating containers, but I have no hard evidence of that. But I have a workaround I am going to install that will hopefully get us by the problem for now.

I will get this change installed today and let you know. In the meantime you can terminate your slices that are broken, and not bother to report any more problems until I get the new stuff installed.

comment:2 Changed 12 years ago by johren@bbn.com

On 9/27/12 9:37 AM, Leigh Stoller wrote:

Well, this has been an interesting thought exercise. There are two different problems, that both occur when the machine is very busy creating lots of VMs.

The first is one I have seen before; when creating the accounts inside the VM, then kernel says the files are not writable and so it fails. Later, no problem.

The second is where I spent most of my time; sometimes the signals sent to kill a running container are ignored. I have reconstructed the sequence events pretty carefully, and I cannot come up with a failure mode that would explain it. I can easily chalk this up to some kind of kernel problem that appears when it is really busy creating containers, but I have no hard evidence of that. But I have a workaround I am going to install that will hopefully get us by the problem for now.

I will get this change installed today and let you know. In the meantime you can terminate your slices that are broken, and not bother to report any more problems until I get the new stuff installed.

comment:3 Changed 11 years ago by johren@bbn.com

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.