| 1 | == Using Virtualization to Ease Emulab Upgrades == |
| 2 | by Pat Gunn |
| 3 | |
| 4 | === Intuitions === |
| 5 | * Upgrading emulab is hard and risky |
| 6 | * We sometimes have had downtime longer than we would like |
| 7 | * Zero downtime/Zero risk would be fantastic |
| 8 | |
| 9 | === Virtualisation as a solution? === |
| 10 | * Benefits: |
| 11 | * Snapshots |
| 12 | * Upgrades on a forked system |
| 13 | * Grow the testbed more smoothly |
| 14 | * Resilience against hardware failures |
| 15 | * Easily upgrade hardware by allocating more resources from resource pool |
| 16 | * Issues: |
| 17 | * Networking |
| 18 | * Configuration? Getting the VLANs right |
| 19 | * Performance |
| 20 | * FreeBSD vs Linux - FreeBSD needs recent CPU hardware features to be efficiently virtualized |
| 21 | * General performance - Will UDP performance be good enough? Good NFS performance? |
| 22 | |
| 23 | === Solution so far === |
| 24 | * VMWare ESX |
| 25 | * Expensive, but we already have the license |
| 26 | * Boss and Ops live in a cloud distant from the machines they manage. We might move them. |
| 27 | * Systems connect through virtual ports on real switches |
| 28 | * We can only safely snapshot nodes that are down (On Linux, this is slightly more flexible) |
| 29 | * Performance tests so far show good UDP and TCP performance |
| 30 | * We will see how this works in practice as we scale up |
| 31 | * Imagined upgrade progress: |
| 32 | * Disable logins and testbed daemons |
| 33 | * Take testbed down |
| 34 | * Clone the system |
| 35 | * Take testbed back up |
| 36 | * Boot clones of boss/ops, isolated from real versions |
| 37 | * Upgrade clones |
| 38 | * If upgrade not successful, delete clones, complain to Utah |
| 39 | * If upgrade successful, shutdown old boss/ops, massage database and experiment disk state changes into upgraded boss/ops? Not clear on this part. |
| 40 | * Fallback upgrade progress: Like a normal upgrade, but with a very good backup beforehand |
| 41 | * Other nice things: |
| 42 | * Virtual switches, virtual power controllers? |
| 43 | * Compute nodes in a cloud? (Planetlab-esque?) Virtualisation technologies are an active area of research at CMU/PDL |
| 44 | * Boss/ops can run on the same suitably powerful system |
| 45 | * Storage separate from boss/ops nodes |
| 46 | |
| 47 | Our solution is in the very early stages of deployment. The new testbed we're building will be about 120 nodes, big enough to know if it's reasonable as a longer-term solution. |
| 48 | |
| 49 | Mitch adds: We're looking into other virtualisation solutions that are more scriptable than VMWare. |