Opened 12 years ago
Last modified 12 years ago
#33 closed
Failure to allocate resource while attempting to create sliver — at Version 6
Reported by: | lnevers@bbn.com | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | IG-EXP-3 |
Component: | AM | Version: | SPIRAL4 |
Keywords: | vm support | Cc: | |
Dependencies: |
Description (last modified by )
Background: Listresources showed large number of pcvm slot counts available before the test:
- pc5 had 97 slots
- pc3 had 100 slots
Test sequence:
- Created 1 sliver named 25vmslice1 with 25 VMs without problems, with the following allocation: 10 VM on pc5, 10 VM on pc3 and 5 VMs on pc4.
- Created a second sliver named 25vmslice2 with 25 VMs which caused the following error:
Result Summary: Slice urn:publicid:IDN+pgeni.gpolab.bbn.com+slice+25vmslice2 expires on 2012-05-26 14:05:32 UTC Asked https://boss.utah.geniracks.net/protogeni/xmlrpc/am/2.0 to reserve resources. No manifest Rspec returned. *** ERROR: mapper: Reached run limit. Giving up. seed = 1338050345 Physical Graph: 6 Calculating shortest paths on switch fabric. Virtual Graph: 25 Generating physical equivalence classes:6 Type precheck: Type precheck passed. Node mapping precheck: Node mapping precheck succeeded Policy precheck: Policy precheck succeeded Annealing. Doing melting run Reverting: forced Reverting to best solution Done BEST SCORE: 18.71 in 49000 iters and 0.307011 seconds With 1 violations Iters to find best score: 48288 Violations: 1 unassigned: 0 pnode_load: 0 no_connect: 0 link_users: 0 bandwidth: 1 desires: 0 vclass: 0 delay: 0 trivial mix: 0 subnodes: 0 max_types: 0 endpoints: 0 Nodes: VM pc5 VM-0 pc5 VM-1 pc1 VM-10 pc5 VM-11 pc2 VM-12 pc1 VM-13 pc2 VM-14 pc3 VM-15 pc3 VM-16 pc3 VM-19 pc2 VM-2 pc1 VM-20 pc5 VM-21 pc5 VM-22 pc1 VM-23 pc2 VM-24 pc3 VM-26 pc3 VM-27 pc3 VM-3 pc2 VM-4 pc5 VM-5 pc3 VM-6 pc3 VM-7 pc3 VM-9 pc5 End Nodes Edges: linksimple/lan0/VM:0,VM-0:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linksimple/lan1/VM-0:1,VM-1:0 intraswitch link-pc5:eth2-procurve2:(null) (pc5/eth2,(null)) link-pc1:eth1-procurve2:(null) (pc1/eth1,(null)) linksimple/lan2/VM-1:1,VM-2:0 trivial pc1:loopback (pc1/null,(null)) pc1:loopback (pc1/null,(null)) linksimple/lan24/VM-9:0,VM-20:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linksimple/lan25/VM:1,VM-9:1 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linksimple/lan26/VM-9:2,VM-10:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linksimple/lan27/VM-10:1,VM-11:0 intraswitch link-pc5:eth2-procurve2:(null) (pc5/eth2,(null)) link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) linksimple/lan28/VM-11:1,VM-12:0 intraswitch link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) link-pc1:eth1-procurve2:(null) (pc1/eth1,(null)) linksimple/lan29/VM-12:1,VM-13:0 intraswitch link-pc1:eth1-procurve2:(null) (pc1/eth1,(null)) link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) linksimple/lan3/VM-2:1,VM-3:0 intraswitch link-pc1:eth1-procurve2:(null) (pc1/eth1,(null)) link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) linksimple/lan30/VM-13:1,VM-14:0 intraswitch link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) link-pc3:eth2-procurve2:(null) (pc3/eth2,(null)) linksimple/lan31/VM-14:1,VM-15:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan32/VM-15:1,VM-16:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan4/VM-3:1,VM-4:0 intraswitch link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) link-pc5:eth2-procurve2:(null) (pc5/eth2,(null)) linksimple/lan5/VM-4:1,VM-5:0 intraswitch link-pc5:eth2-procurve2:(null) (pc5/eth2,(null)) link-pc3:eth2-procurve2:(null) (pc3/eth2,(null)) linksimple/lan54/VM-16:1,VM-26:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan55/VM-27:1,VM-26:1 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan58/VM-24:0,VM-27:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan59/VM-23:0,VM-24:1 intraswitch link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) link-pc3:eth2-procurve2:(null) (pc3/eth2,(null)) linksimple/lan6/VM-5:1,VM-6:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan65/VM-20:1,VM-21:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linksimple/lan66/VM-21:1,VM-19:0 intraswitch link-pc5:eth2-procurve2:(null) (pc5/eth2,(null)) link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) linksimple/lan67/VM-19:1,VM-22:0 intraswitch link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) link-pc1:eth1-procurve2:(null) (pc1/eth1,(null)) linksimple/lan68/VM-22:1,VM-23:1 intraswitch link-pc1:eth1-procurve2:(null) (pc1/eth1,(null)) link-pc2:eth1-procurve2:(null) (pc2/eth1,(null)) linksimple/lan7/VM-6:1,VM-7:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan72/VM-6:2,VM-16:2 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan73/VM-5:2,VM-15:2 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan74/VM-4:2,VM-14:2 intraswitch link-pc5:eth2-procurve2:(null) (pc5/eth2,(null)) link-pc3:eth2-procurve2:(null) (pc3/eth2,(null)) linksimple/lan75/VM-3:2,VM-13:2 trivial pc2:loopback (pc2/null,(null)) pc2:loopback (pc2/null,(null)) linksimple/lan76/VM-2:2,VM-12:2 trivial pc1:loopback (pc1/null,(null)) pc1:loopback (pc1/null,(null)) linksimple/lan77/VM-1:2,VM-11:2 intraswitch link-pc1:eth1-procurve2:(null) (pc1/eth1,(null)) link-pc2:eth3-procurve2:(null) (pc2/eth3,(null)) linksimple/lan78/VM-0:2,VM-10:2 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linksimple/lan79/VM-10:3,VM-21:2 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linksimple/lan80/VM-11:3,VM-19:2 trivial pc2:loopback (pc2/null,(null)) pc2:loopback (pc2/null,(null)) linksimple/lan81/VM-12:3,VM-22:2 trivial pc1:loopback (pc1/null,(null)) pc1:loopback (pc1/null,(null)) linksimple/lan82/VM-13:3,VM-23:2 trivial pc2:loopback (pc2/null,(null)) pc2:loopback (pc2/null,(null)) linksimple/lan83/VM-14:3,VM-24:2 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linksimple/lan84/VM-15:3,VM-27:2 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) End Edges End solution Summary: procurve2 0 vnodes, 2800000 nontrivial BW, 0 trivial BW, type=(null) pc3 9 vnodes, 400000 nontrivial BW, 1100000 trivial BW, type=pcvm 400000 link-pc3:eth2-procurve2:(null) pc5 7 vnodes, 600000 nontrivial BW, 700000 trivial BW, type=pcvm 600000 link-pc5:eth2-procurve2:(null) pc1 4 vnodes, 700000 nontrivial BW, 300000 trivial BW, type=pcvm 700000 link-pc1:eth1-procurve2:(null) ?+virtpercent: used=0 total=100 ?+cpu: used=0 total=2666 ?+ram: used=0 total=3574 ?+cpupercent: used=0 total=92 ?+rampercent: used=0 total=80 pc2 5 vnodes, 1100000 nontrivial BW, 300000 trivial BW, type=pcvm 1000000 link-pc2:eth1-procurve2:(null) 100000 link-pc2:eth3-procurve2:(null) ?+virtpercent: used=0 total=100 ?+cpu: used=0 total=2666 ?+ram: used=0 total=3574 ?+cpupercent: used=0 total=92 ?+rampercent: used=0 total=80 Total physical nodes used: 4 End summary ASSIGN FAILED: Type precheck passed. Node mapping precheck succeeded Policy precheck succeeded Annealing. Doing melting run Reverting: forced Reverting to best solution Done BEST SCORE: 18.71 in 49000 iters and 0.307011 seconds unassigned: 0 pnode_load: 0 no_connect: 0 link_users: 0 bandwidth: 1 desires: 0 vclass: 0 delay: 0 trivial mix: 0 subnodes: 0 max_types: 0 endpoints: 0
Change History (6)
comment:1 Changed 12 years ago by
comment:2 Changed 12 years ago by
On 5/25/12 11:40 AM, Leigh Stoller wrote:
bandwidth: 1
So here is the issue; Creating a 10 node mesh of 100Mb links requires an aggregate bandwidth of 1GB. Not cause you are going to actually use that, but the resource mapper cannot make any assumptions in the absence of other information.
This is why you get to provide a bandwidth in your rspec, to inform the mapper what you really want to do.
Diving deeper for those who are interested; this is a lan of containers on the same physical node, and there is some limit to the amount of traffic that can be sent over the loopback device between containers. At some point the physical node will no longer be able to keep up, and so we set a limit on what you can ask for. At the moment that number is set a lower then it probably should be (at 400Mb).
Bottom line; I bumped that to 1Gb which should allow your rspec to map. These nodes are pretty beefy, so I imagine they can keep up.
Lbs
The message reported to the experimenter is somewhat cryptic and I missed the bandwidth violation which was on line 25 out of 145 lines (not including the omni output).
I realized we are still developing/debugging, but I am going to ask anyways... Are there plans to modify results to provide a more intuitive output?
comment:3 Changed 12 years ago by
Re-ran the 10experiments with 10 VMs assuming that the configuration changes from last Friday would handle the bandwidth requirements. Before starting, verified that both shared nodes had 99 slot available.
Set up the first 8 experiments without problem. On the createsliver for the 9th experiment (10vmslice9) fails with this error:
Asked https://boss.utah.geniracks.net/protogeni/xmlrpc/am/2.0 to reserve resources. No manifest Rspec returned. *** ERROR: mapper: Reached run limit. Giving up. seed = 1338437282 Physical Graph: 4 Calculating shortest paths on switch fabric. Virtual Graph: 11 Generating physical equivalence classes:4 Type precheck: Type precheck passed. Node mapping precheck: Node mapping precheck succeeded Policy precheck: Policy precheck succeeded Annealing. Doing melting run Reverting: forced Reverting to best solution Done BEST SCORE: 3 in 17000 iters and 0.10192 seconds With 1 violations Iters to find best score: 338 Violations: 1 unassigned: 0 pnode_load: 0 no_connect: 0 link_users: 0 bandwidth: 0 desires: 1 vclass: 0 delay: 0 trivial mix: 0 subnodes: 0 max_types: 0 endpoints: 0 Nodes: VM-1 pc3 VM-10 pc3 VM-2 pc3 VM-3 pc3 VM-4 pc3 VM-5 pc3 VM-6 pc3 VM-7 pc3 VM-8 pc3 VM-9 pc3 lan/Lan pc3 End Nodes Edges: linklan/Lan/VM-1:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-2:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-3:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-4:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-5:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-6:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-7:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-8:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-9:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) linklan/Lan/VM-10:0 trivial pc3:loopback (pc3/null,(null)) pc3:loopback (pc3/null,(null)) End Edges End solution Summary: pc3 11 vnodes, 0 nontrivial BW, 1000000 trivial BW, type=pcvm Total physical nodes used: 1 End summary ASSIGN FAILED: Type precheck passed. Node mapping precheck succeeded Policy precheck succeeded Annealing. Doing melting run Reverting: forced Reverting to best solution Done BEST SCORE: 3 in 17000 iters and 0.10192 seconds unassigned: 0 pnode_load: 0 no_connect: 0 link_users: 0 bandwidth: 0 desires: 1 vclass: 0 delay: 0 trivial mix: 0 subnodes: 0 max_types: 0 endpoints: 0
comment:4 Changed 12 years ago by
Here is the VM allocation for the 10 experiment with 10 VMs each, in case it is of interest to anyone:
- 10vmslice1 - 10 VMs on pc5
- 10vmslice2 - 10 VMs on pc3
- 10vmslice3 - 10 VMs on pc5
- 10vmslice4 - 10 VMs on pc3
- 10vmslice5 - 7 VMs on pc5 + 3 VMs on pc3
- 10vmslice6 - 2 VMs on pc5 + 6 VMs on pc3 + 2 VMs on pc1
- 10vmslice7 - 10 VMs on pc2
- 10vmslice8 - 10 VMs on pc4
comment:5 Changed 12 years ago by
Unable to create 1 experiment with 25 VMs when starting with the following available resources:
- 100 pcvm slots available on shared nodes pc3
- 100 pcvm slots available on shared nodes pc5
- pc1, pc2, and pc4 not available
The attempt to create a sliver (25vmslice1) failed with the following error:
*** Type precheck failed!*** Type precheck failed!*** ERROR: mapper: Unretriable error. Giving up. seed = 1338507038 Physical Graph: 4 Calculating shortest paths on switch fabric. Virtual Graph: 25 Generating physical equivalence classes:4 Type precheck: *** 25 nodes of type pcvm requested, but only 20 available nodes of type pcvm found *** Type precheck failed! ASSIGN FAILED: *** 25 nodes of type pcvm requested, but only 20 available nodes of type pcvm found *** Type precheck failed!
I expected to be able to get 25 nodes across the two shared nodes (pc3 and pc5).
comment:6 Changed 12 years ago by
Description: | modified (diff) |
---|
Backing off from testing!
With the following resources available:
- 100 pcvm slots available on shared nodes pc3
- 98 pcvm slots available on shared nodes pc5
- pc1, pc2, and pc4 not available
Tried to create one sliver with 20 nodes (20vmslice1), which resulted in the same failure reported yesterday afternoon due (desires:1).
*** ERROR: mapper: Reached run limit. Giving up. seed = 1338510636 Physical Graph: 4 Calculating shortest paths on switch fabric. Virtual Graph: 21 Generating physical equivalence classes:4 Type precheck: Type precheck passed. Node mapping precheck: Node mapping precheck succeeded Policy precheck: Policy precheck succeeded Annealing. Adjusting dificulty estimate for fixed nodes, 1 remain. Doing melting run Reverting: forced Reverting to best solution Done BEST SCORE: 5.42 in 17000 iters and 0.470627 seconds With 1 violations Iters to find best score: 1 Violations: 1 unassigned: 0 pnode_load: 0 no_connect: 0 link_users: 0 bandwidth: 0 desires: 1 vclass: 0 delay: 0 trivial mix: 0 subnodes: 0 max_types: 0 endpoints: 0 Nodes: VM-1 pc3 VM-10 pc5 VM-11 pc5 VM-12 pc3 VM-13 pc5 VM-14 pc5 VM-15 pc3 VM-16 pc5 VM-17 pc3 VM-18 pc5 VM-19 pc5 VM-2 pc5 VM-20 pc3 VM-3 pc3 VM-4 pc5 VM-5 pc3 VM-6 pc5 VM-7 pc3 VM-8 pc3 VM-9 pc3 lan/Lan pc5 End Nodes Edges: linklan/Lan/VM-1:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-2:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-3:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-4:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-5:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-6:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-7:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-8:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-9:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-10:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-13:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-14:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-15:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-16:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-17:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-18:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-19:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) linklan/Lan/VM-20:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-12:0 intraswitch link-pc3:eth1-procurve2:(null) (pc3/eth1,(null)) link-pc5:eth1-procurve2:(null) (pc5/eth1,(null)) linklan/Lan/VM-11:0 trivial pc5:loopback (pc5/null,(null)) pc5:loopback (pc5/null,(null)) End Edges End solution Summary: procurve2 0 vnodes, 2000000 nontrivial BW, 0 trivial BW, type= pc3 10 vnodes, 1000000 nontrivial BW, 0 trivial BW, type=pcvm 1000000 link-pc3:eth1-procurve2:(null) pc5 11 vnodes, 1000000 nontrivial BW, 1000000 trivial BW, type=pcvm 1000000 link-pc5:eth1-procurve2:(null) Total physical nodes used: 2 End summary ASSIGN FAILED: Type precheck passed. Node mapping precheck succeeded Policy precheck succeeded Annealing. Adjusting dificulty estimate for fixed nodes, 1 remain. Doing melting run Reverting: forced Reverting to best solution Done BEST SCORE: 5.42 in 17000 iters and 0.470627 seconds unassigned: 0 pnode_load: 0 no_connect: 0 link_users: 0 bandwidth: 0 desires: 1 vclass: 0 delay: 0 trivial mix: 0 subnodes: 0 max_types: 0 endpoints: 0
Another instance of the assignment failure occurred while running a 10 experiments with 10 VM test. Test was started with plenty of resources available (pc5 had 97 slots and pc3 had 100 slots). Here is the sequence of events and allocation based on sliverstatus: