Opened 12 years ago
Closed 11 years ago
#52 closed (fixed)
Failing to create a four-node InstaGENI sliver during GEMINI testing
Reported by: | johren@bbn.com | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | AM | Version: | SPIRAL4 |
Keywords: | Cc: | ||
Dependencies: |
Description
I have been trying to create a four-node slice for testing GEMINI. This slice contains three VMs in a mesh network configuration and one stand-alone VM. When I try to create this sliver, I am getting at least one node failure most of the time. This morning I have tried at least 12 slivers so far and all but two have had at least one node fail. The sliverstatus output does not give any geni_error indication as to what caused the failure.
{'geni_error': ,
'geni_status': 'failed', 'geni_urn': 'urn:publicid:IDN+utah.geniracks.net+sliver+10467', 'pg_manifest': {'attributes': {'client_id': 'PCC',
'component_id': 'urn:publicid:IDN+utah.geniracks.net+node+pc1', 'component_manager_id': 'urn:publicid:IDN+utah.geniracks.net+authority+cm', 'exclusive': 'false', 'sliver_id': 'urn:publicid:IDN+utah.geniracks.net+sliver+10467', 'xmlns': 'http://www.geni.net/resources/rspec/3', 'xmlns:gemini': 'http://geni.net/resources/rspec/ext/gemini/1'},
...
I have left slice johGEM1209261119 up in case someone wants to take a look at one of the failures.
Change History (3)
comment:1 Changed 12 years ago by
comment:2 Changed 12 years ago by
On 9/27/12 9:37 AM, Leigh Stoller wrote:
Well, this has been an interesting thought exercise. There are two different problems, that both occur when the machine is very busy creating lots of VMs.
The first is one I have seen before; when creating the accounts inside the VM, then kernel says the files are not writable and so it fails. Later, no problem.
The second is where I spent most of my time; sometimes the signals sent to kill a running container are ignored. I have reconstructed the sequence events pretty carefully, and I cannot come up with a failure mode that would explain it. I can easily chalk this up to some kind of kernel problem that appears when it is really busy creating containers, but I have no hard evidence of that. But I have a workaround I am going to install that will hopefully get us by the problem for now.
I will get this change installed today and let you know. In the meantime you can terminate your slices that are broken, and not bother to report any more problems until I get the new stuff installed.
comment:3 Changed 11 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Well, this has been an interesting thought exercise. There are two different problems, that both occur when the machine is very busy creating lots of VMs.
The first is one I have seen before; when creating the accounts inside the VM, then kernel says the files are not writable and so it fails. Later, no problem.
The second is where I spent most of my time; sometimes the signals sent to kill a running container are ignored. I have reconstructed the sequence events pretty carefully, and I cannot come up with a failure mode that would explain it. I can easily chalk this up to some kind of kernel problem that appears when it is really busy creating containers, but I have no hard evidence of that. But I have a workaround I am going to install that will hopefully get us by the problem for now.
I will get this change installed today and let you know. In the meantime you can terminate your slices that are broken, and not bother to report any more problems until I get the new stuff installed.