Version 37 (modified by 11 years ago) (diff) | ,
---|
Plastic Slices sandbox page
Random notes for Plastic Slices stuff.
Environment
The tools we use to wrangle Plastic Slices have a variety of requirements for your environment on the system where you want to use them:
- You should have an up-to-date copy of the syseng Subversion repository.
- ~/rspecs should be a symlink to .../syseng/geni/share/rspecs.
- ~/slices should be a directory, and ~/slices/plastic-slices should be a symlink to .../syseng/geni/share/experiment-setup/plastic-slices.
- ~/bin/omni and ~/bin/readyToLogin should be copies of (or symlinks to) the current GCF release, and you should have (or know how to add) the GCF directory to your $PYTHONPATH.
- ~/bin/shmux should be a copy of the 'shmux' executable. (If you don't have it, try jericho:/home/jbs/bin/shmux, or ask someone else who has run Plastic Slices before.)
- Your ~/.ssh/config file should include "StrictHostKeyChecking no". (FIXME: It'd be better if this were in the ~/.ssh/config section for each host, instead of being a global requirement.)
- ~/.gcf should be your Omni/GCF directory, and you should not mind if cached user and slice credentials are stored there.
- Your default project in your Omni config file should be 'gpo-infra'.
- The [omni] section of your omni_config should include the line 'useslicemembers = True'.
- You should identify a system where you can run OpenFlow controllers, and an OpenFlow controller that you can run there (eight times, once per slice).
- The system where you plan to process logs and generate graphs should have a group named "gpo" group, and you should be a member of it.
- The system where you plan to process logs and generate graphs should be an Ubuntu 12.04 system with python-rrdtool installed.
- You should run the various commands all in one shell, because some of the later steps assume that you've run the commands in some of the previous steps. You can run some things in other windows if you know what you're doing, but if you're wrong, things won't work as you expect.
- You should have the following shell functions or aliases (e.g. in your .bashrc):
alias shmux='shmux -Sall -m -B -M 20' logins () { for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; $* ~/slices/*/logins/logins-$slicename.txt >| $loginfile ; done ; logins=$(for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; cat $loginfile ; done) ; } somni () { slicename=$1 ; rspec=$2 ; am=$(grep AM: $rspec | sed -e 's/^AM: //') ; }
This list is intended to be complete, but if we've forgotten something, you may get an error when you try to use some of those tools -- so corollary, if you do get an error when you try to use some of those tools, check with someone else to see if it works for them, and look for ways in which your environment might be different (and if they're not on this list, add them).
Starting and ending a run
This describes how to start and end a Plastic Slices run.
Note that the same person doesn't have to end one run and start the next; but it's much easier for the person who started a run to also end that run. Others in the gpo-infra group can log in to the various nodes, and manipulate the slices and slivers, but things like the Zoing config files, cron jobs, and log files, are all under the user who started them up.
Starting
Synch your working dir
Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run
svn update svn status
Note that it's fine to have uncommitted changes in directories not related to Plastic Slices (or Tarvek, Zoing, the rspec templates, etc), if you're in the middle of working on other things in the syseng repo. But maybe you would like to commit those changes now, just to be on the safe side.
Update the config
Update ~/slices/plastic-slices/config/slices.json with any changes for this run. Likely changes to think about include:
- Adding or removing aggregates.
- Changing which aggregates are in which slices.
- Changing openflow_controller to point to your personal controller.
- Changing rspec_template_root to point to the directory where you personally have the rspec templates.
Update ~/slices/plastic-slices/config/pairmap.json with any changs for this run. At this point, we're maintaining the file by hand, so that we can preserve specific pairs from run to run. The pairs we're preserving are:
source | destination | TCP | UDP |
bbn-exogeni | max-instageni | ps103 | ps108 |
clemson-instageni | wisconsin-instageni | ps105 | ps110 |
fiu-exogeni | bbn-exogeni | ps104 | ps107 |
fiu-exogeni | bbn-instageni | ps103 | ps108 |
gatech-instageni | northwestern-instageni | ps106 | ps107 |
kansas-instageni | northwestern-instageni | ps105 | ps108 |
nyu-instageni | utahddc-instageni | ps106 | ps109 |
sox-instageni | illinois-instageni | ps104 | ps109 |
stanford-instageni | bbn-instageni | ps106 | ps109 |
If you add a new aggregate, make sure not to break up those pairs.
If for some reason you want to generate a new random pairmap, the Tarvek 00README file has docs for how to do that.
Generate the rest of the configuration:
cd ~/slices/plastic-slices python ~/tarvek/generate-experiment-config.py ./config/slices.json ./config/pairmap.json ./wiki-source.txt svn rm $(svn st | grep ^! | awk '{ print $2; }') svn add $(svn st | grep ? | awk '{ print $2; }')
Note that the 'svn rm' and 'svn add' will return an error message if there's nothing to remove or add (respectively), like "svn: Not enough arguments provided"; that's fine, and is safe to ignore.
Review to make sure that things look right, then commit that to Subversion:
svn commit
Create slivers
Set the list of slices:
slices=$(echo ps{103..110})
Renew the slices to expire in 55 days:
renewdate="$(date +%Y-%m-%d -d 'now + 55 days') 23:00 UTC" for slicename in $slices ; do omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml renewslice $slicename "$renewdate" ; done
Fetch your user and slice credentials:
(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)
Set up variables to create the slivers:
declare -A rspecs for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done for slicename in $slices ; do echo ${rspecs[$slicename]} ; done for slicename in $slices ; do echo ${rspecs[$slicename]} ; done | wc
The last two echo lines are a good place to sanity-check that things are as you expect: The first should list an rspec for every sliver you expect to create, and the second should list a count of them. There should be one line per slice, and probably a few hundred rspecs, but the exact number will depend on how many aggregates you have in each slice.
Actually create the slivers:
for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am createsliver $slicename $rspec & done ; sleep 5m ; done
Some notes about that:
- The combination of (a) the ampersand; and (b) the sleep 5m at the end; means that this (a) fires off a createsliver for every aggregate in the slice and runs them all in parallel in the background; (b) sleeps for five minutes between slices, to avoid swamping any aggregates with too many requests at once. That 5m seems to work well for not crashing FV and not overloading InstaGENI, but it could potentially be cranked down if both of those improve.
- This doesn't capture output at all. We could potentially add something to stuff the output into one giant file, but it might be a little hard to sort out, since output is coming back from all of the slivers at once, all intermingled together. We could have each createsliver write an output file, but we'd need to be careful to name them and save them so that the output file from an aggregate in one slice wouldn't overwrite the output from the same aggregate in another slice. For now, we just check later to see what worked and what didn't, and try again by hand if it's not obvious why some things didn't work.
Wait for all of the createsliver calls to finish; check that there isn't anything still running in the background:
jobs
if there's no output from that, everything's done, and you can continue.
Renew slivers
Renew the Utah slivers, which default to expiring in six hours:
declare -A rspecs for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec | grep utah | egrep -v '(openflow|vts)') ; done for slicename in $slices ; do echo ${rspecs[$slicename]} ; done renewdate="$(date +%Y-%m-%d -d 'now + 4 days') 23:00 UTC" for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am renewsliver $slicename "$renewdate" & done ; sleep 5s ; done
Set a reminder for yourself to renew those in four days. (Something in your calendar, a cron job, a mental note to watch your e-mail for expiration warnings the day before they expire, etc.)
Gather up expiration information for everything, and stuff it into a results file:
for slicename in $slices do cd rm -rf ~/tmp/renewsliver/$slicename mkdir -p ~/tmp/renewsliver/$slicename cd ~/tmp/renewsliver/$slicename for rspec in ${rspecs[$slicename]} ; do outfile=$(echo $(basename $rspec) | sed -e 's/.rspec$//') ; somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename >& $outfile ; done cd ~/tmp/renewsliver/$slicename grep -h _expires * >> results.txt for i in * ; do grep _expires $i > /dev/null || echo "no 'expires' lines in $i" ; done >> results.txt done
Set some variables to match the dates you expect things to expire on (these are just examples, and may need to be edited):
mm_dd="05-15" mon_day="Apr 28"
Look for anomalies in the results files:
cd ~/tmp/renewsliver for slicename in $slices ; do echo "==> $slicename" ; grep foam_expires $slicename/results.txt ; done | grep -v "$mm_dd" for slicename in $slices ; do echo "==> $slicename" ; grep orca_expires $slicename/results.txt ; done | grep -v "$mon_day" for slicename in $slices ; do echo "==> $slicename" ; grep pg_expires $slicename/results.txt ; done | grep -v "$mm_dd" for slicename in $slices ; do echo "==> $slicename" ; grep "no 'expires' lines" $slicename/results.txt ; done
If you find anomalies, you'll probably need to go back to the original output files to figure out where they came from.
This will often expose errors of the form "I don't have a sliver at this aggregate at all, for some reason". Fix any of those before continuing. (This is usually just a matter of trying again to create a sliver that failed for whatever reason.)
Get login info
Get login info:
cd ~/slices/plastic-slices/ssh_config for slicename in $slices ; do ams="" ; for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; ams="$ams -a $am" ; done ; readyToLogin --no-keys --output --prefix=$slicename --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml $ams $slicename ; done for slicename in $slices ; do mv -f $slicename-sshconfig.txt $slicename ; rm -f $slicename*.xml $slicename*.json $slicename-logininfo.txt ; done
Extract your login info from those files, and put it into your ~/.ssh/config file, via whatever means you find appealing.
Find old SSH keys for IP addresses that ExoGENI has reused, and print lines to remove them:
logins grep -- -eg- for login in $logins ; do ssh $login true |& grep ssh-keygen | sed -e 's/remove with://' ; done
Copy and paste the output (simply exec-ing it doesn't seem to work, and we haven't debugged why); then repeat the above and expect no output.
Test logins
Make sure you can log in, and that each login's hostname is as expected:
logins cat shmux -c "hostname" $logins | egrep -v '(.+): \1'
Expect no output from that, except possibly messages about new SSH keys. Run it again in that case, and address any other issues if you get any output.
This will often expose errors of the form "I can't log in to my hosts at this aggregate, for some reason". Fix any of those before continuing.
For example, if an InstaGENI rack sliver's VMs fail to boot, you can delete it and re-create it (BBN IG in ps104 in this example):
somni ps104 ~/rspecs/request/ps104/bbn-instageni-ps104.rspec omni -a $am deletesliver $slicename omni -a $am createsliver $slicename $rspec
You can then watch the spew log URL (in the createsliver output, before the manifest), or run sliverstatus to check the status:
omni -a $am sliverstatus $slicename |& grep _status
Watching the spew log URL is usually a better bet if you can.
Once you can log in everywhere, commit to Subversion the changes to ~/slices/plastic-slices/ssh_config:
svn commit
If you want to copy any of your personal dotfiles to each host, to customize your own personal environment there, now would be an opportune time to do that, since you're about to start running commands on the hosts. If you don't, you can safely skip this step. Josh used to copy the files in his ~/.cfhome directory, like so:
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/.cfhome/ $login: && echo $login ; done & done
Start OpenFlow controllers
If your OpenFlow controllers aren't already running, start them up before continuing. http://groups.geni.net/syseng/wiki/POX has more information about how to do this, if you want to use POX. Here are the essentials:
mkdir -p ~/pox port=33101 ; ~/bin/python ~/src/pox/pox.py py openflow.of_01 --port=${port} misc.full_payload geni_l2_learning samples.pretty_log log.level --WARNING log --*TimedRotatingFile=filename=$HOME/pox/pox-${port}.log,when=D,backupCount=2 --no-default geni_requests from geni_requests import GENIOFRequestHandler req = GENIOFRequestHandler() req.print_dpids()
That will start up on listening on port 33101; you'll need to repeat that for each port, in a different window. One way that works well is to do this under 'screen' on your OF controller host.
Test connectivity
Copy in connectivity test files:
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/*/reachability/addrs-$slicename.conf $login:pingtest.conf && echo $login ; done & done
Log in to one host in each slice, and test connectivity:
fping -q -c 10 < pingtest.conf |& grep -v "ICMP Host Unreachable"
If anything isn't reachable, debug why not.
Set up Zoing
Copy in Zoing stuff:
shmux -c 'mkdir -p bin' $logins for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/plastic-slices/zoing/zoing $login:bin/zoing && echo $login ; done & done shmux -c 'sudo mv bin/zoing /usr/local/bin/zoing' $logins for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/plastic-slices/zoing/zoingrc-$login $login:.zoingrc && echo $login ; done & done
Copy in traffic-shaping stuff:
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/plastic-slices/tc-shape-eth1-ten-mbps $login:tc-shape-eth1-ten-mbps && echo $login ; done & done shmux -c 'sudo chown root:root tc-shape-eth1-ten-mbps' $logins shmux -c 'sudo mv tc-shape-eth1-ten-mbps /etc/init.d/tc-shape-eth1-ten-mbps' $logins shmux -c 'sudo ln -s ../init.d/tc-shape-eth1-ten-mbps /etc/rc2.d/S99tc-shape-eth1-ten-mbps' $logins shmux -c 'sudo service tc-shape-eth1-ten-mbps start' $logins
Fire up Zoing:
shmux -c "zoing activate" $logins
Final prep work
Create a directory for logs, and copy other files into it:
subdir=<a subdirectory> mkdir -p ~/tmp/plastic-slices/$subdir/logs cp ~/slices/plastic-slices/config/*json ~/tmp/plastic-slices/$subdir rsync -avC ~/slices/plastic-slices/hosts/ ~/tmp/plastic-slices/$subdir/00hosts rsync -avC ~/slices/plastic-slices/logins/ ~/tmp/plastic-slices/$subdir/00logins rsync -avC ~/slices/plastic-slices/ssh_config/ ~/tmp/plastic-slices/$subdir/00ssh_config
Create a wiki page for this run: http://groups.geni.net/geni/wiki/PlasticSlices/Continuation has sub-pages for the various runs, so one good way to do this is:
- Create a new sub-page for this run.
- Copy the text from the sub-page for the previous run before this one, from the start of the page, up to and including the "Everything below this point ..." line.
- Edit that text to refer to this run.
- Copy in the wiki-source.txt file that Tarvek generated earlier, after the "Everything below this point ..." line.
Send mail to gpo-tech letting folks know. Just mention that you've started up a run, link to the wiki page, and include the "In this run <here's what's new>" line from the wiki page.
Ending
Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run
svn update svn status
Set the list of slices:
slices=$(echo ps{103..110})
Fetch your user and slice credentials:
(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)
Deactivate Zoing (so it won't launch another set of experiments at the top of the hour):
logins cat shmux -c "zoing deactivate" $logins
Wait for the current run to finish, typically 56 minutes past the hour.
Check that all sources are shut down ("-a" nodes):
logins grep -- -a shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
Reset everything, and make sure that everything is shut down:
logins cat shmux -c "zoing reset" $logins shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
Fetch logs one last time, and upload them to the webserver.
Delete all of the slivers, to start the next run with a clean slate:
declare -A rspecs for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done for slicename in $slices ; do echo ${rspecs[$slicename]} ; done for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am deletesliver $slicename & done ; sleep 30s ; done
Confirm that everything's gone:
for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename |& egrep -q -i '(code 12|code 2)' || echo "unexpected sliver in $slicename at $am" & done ; sleep 5s ; done | grep unexpected | grep -v omni
Update the wiki page for this run with any final details (e.g. when the run ended).
To do
Here are some random things I've jotted down that I'd like to do:
- Add a way to positively confirm that slivers *don't* exist
- Add a way to show more concise sliver status -- not four+ lines per sliver
- Add a way to supply a paramter to test against, like "this date" for expiry
- Add a way to save all omni output in files, so I can look up what happened if something goes wrong
- Maybe use vxargs to parallelize omni for some things? Sliver deletion takes freakin' forever. Or just a loop, do ten slices in parallel, although this won't help for single big slices. Maybe parallelize across one slice would be better, so it hits all the aggregates once, then again, etc.
Some of those would end up on my "slice notes" sandbox page, but they affect Plastic Slices the most (because of its scale), so they're here for now. Or I might add it to Tarvek, we'll see.
Fetch logs
I run all this stuff on anubis.
Pull them into a subdirectory of my temp log processing directory:
subdir=<a subdirectory> mkdir -p ~/tmp/plastic-slices/$subdir/logs logins grep -- -a shmux -c "sed -i -e '/nanosleep failed:/d' zoing-logs/zoing*log" $logins logins cat for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a $login:zoing-logs/ ~/tmp/plastic-slices/$subdir/logs/$login && echo $login ; done & done
Remove the last day's PNG file and the all PNG file, to make sure we re-generate it:
lastday=$(ls -1 ~/tmp/plastic-slices/$subdir/pngs/hosts/bbn-ig-ps104-b | tail -1 | sed -e 's/zoing-daily-\(.*\).png/\1/') rm ~/tmp/plastic-slices/$subdir/pngs/*/*/*all*png ~/tmp/plastic-slices/$subdir/pngs/*/*/*daily-$lastday*png
Plot graphs:
firstlog=$(find ~/tmp/plastic-slices/$subdir/logs/*-b -name '*log' -print | sed -e 's/.*zoing-\(.*\).log/\1/' | sort | head -1) lastlog=$(find ~/tmp/plastic-slices/$subdir/logs/*-b -name '*log' -print | sed -e 's/.*zoing-\(.*\).log/\1/' | sort | tail -1) time python ~/tarvek/generate-graphs.py --progress --mainconfig=~/tmp/plastic-slices/$subdir/slices.json --pairmap=~/tmp/plastic-slices/$subdir/pairmap.json --rootdir=~/tmp/plastic-slices/$subdir --starttime=$firstlog --endtime=$lastlog
Push everything up to the webserver:
chgrp -R gpo ~/tmp/plastic-slices/$subdir rsync -av ~/tmp/plastic-slices/$subdir www.gpolab.bbn.com:/srv/www/plastic-slices/continuation
Checking in
From the wiki page for a run, browse to the directory with graphs and logs for that run. Look at:
- The page with a graph for traffic to each destination host for the whole run, organized by aggregate. This is a good way to identify aggregates that aren't working well in any slice, due to some aggregate-wide problem (like a connectivity issue).
- The page with a graph for traffic to each destination host for the whole run, organized by slice. This is a good way to identify slices that aren't working well at any aggregate, due to some slice-wide problem (like a controller issue).
Some notes about the graphs:
- On the TCP graphs, large blocks of green with a flat top are good -- they show good throughput flowing. Jagged tops and gaps of white are a bad sign.
- On the UDP graphs, large blocks of white are good -- they show zero packet loss. Any red is a bad sign; red *below* zero indicates a log file with no data, which may be a bad sign, or may be a known issue.
- The graphs with many hosts on a single graph are pretty hard to read, but they can sometimes help you spot other things you might want to look at.
Don't forget to reload the page after pushing new graphs.
The old way
This is how I used to check in, downloading graphs to my laptop to view (with an image viewer, 'gq' was an alias to one I liked).
On my laptop, copy down the graphs:
subdir=<a directory> rsync -av --delete --delete-excluded anubis:tmp/plastic-slices/$subdir/pngs ~/tmp/plastic-slices/$subdir
Identify the last day we have graphs for:
lastday=$(ls -1 ~/tmp/plastic-slices/$subdir/pngs/hosts/bbn-ig-ps104-b | tail -1 | sed -e 's/zoing-daily-\(.*\).png/\1/')
Show the per-slice graphs of the most recent day:
gq ~/tmp/plastic-slices/$subdir/pngs/slices/*/zoing-daily-$lastday.png
Show the per-host daily graphs for the most recent day:
gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-daily-$lastday.png
Show the per-slice graphs of the whole run:
gq ~/tmp/plastic-slices/$subdir/pngs/slices/*/zoing-all.png
Show the per-host graphs of the whole run:
gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-all.png
Show the per-host daily graphs for all of the days:
gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-daily*.png
The older way
This is how I used to check in, using grep to scan log files; nowadays I'm using the graphs.
Get a quick summary of the current state of things (based on the last completed run; or change $timestamp to get a different run):
timestamp=$(date -d "now - 1 hour" +%Y%m%d.%H) for subnet in {103..106} do echo -e "--> plastic $subnet\n" for login in $(awk 'NR%2==1' ~/slices/plastic-slices/logins/logins-ps$subnet.txt) do echo -n "$login to " grep "connected with" logs/$login/zoing-$timestamp*.log | awk '{ print $(NF-2); }' grep /sec logs/$login/zoing-$timestamp*.log || echo no results echo "" done done for subnet in {107..110} do echo -e "--> plastic $subnet\n" for login in $(awk 'NR%2==0' ~/slices/plastic-slices/logins/logins-ps$subnet.txt) do echo -n $(grep "connected with" logs/$login/zoing-$timestamp*.log | awk '{ print $(NF-2); }') echo " to $login" egrep " 0.0-[^ ].+/sec" logs/$login/zoing-$timestamp*.log || echo no results echo "" done done