wiki:JBSsandbox/PlasticSlices

Version 27 (modified by Josh Smift, 10 years ago) (diff)

--

Plastic Slices sandbox page

Random notes for Plastic Slices stuff.

A lot of things that used to be here are now on my general "slice notes" sandbox page. What's left should in theory be pretty specific to Plastic Slices.

Environment

The tools we use to wrangle Plastic Slices have a variety of requirements for your environment on the system where you want to use them:

  • You should have an up-to-date copy of the syseng Subversion repository.
  • ~/rspecs should be a symlink to .../syseng/geni/share/rspecs.
  • ~/slices/plastic-slices should be a symlink to .../syseng/geni/share/experiment-setup/plastic-slices.
  • ~/bin/omni and ~/bin/readyToLogin should be copies of (or symlinks to) the current GCF release, and you should have (or know how to add) the GCF directory to your $PYTHONPATH.
  • ~/bin/shmux should be a copy of the 'shmux' executable.
  • Your ~/.ssh/config file should include "StrictHostKeyChecking no". (FIXME: It'd be better if this were in the ~/.ssh/config section for each host, instead of being a global requirement.)
  • ~/.gcf should be your Omni/GCF directory, and you should not mind if cached user and slice credentials are stored there.
  • Your default project in your Omni config file should be 'gpo-infra'.
  • Your 'users' list in your Omni config file should include the gpo-infra users.
  • You should identify a system where you can run OpenFlow controllers, and an OpenFlow controller that you can run there (eight times, once per slice).
  • The system where you plan to process logs and generate graphs should have a group named "gpo" group, and you should be a member of it.
  • The system where you plan to process logs and generate graphs should be an Ubuntu 12.04 system with python-rrdtool installed.
  • You should run the various commands all in one shell, because some of the later steps assume that you've run the commands in some of the previous steps. You can run some things in other windows if you know what you're doing, but if you're wrong, things won't work as you expect.
  • You should have the following shell functions or aliases (e.g. in your .bashrc):
    alias shmux='shmux -Sall -m -B -M 20'
    logins () { for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; $* ~/slices/*/logins/logins-$slicename.txt >| $loginfile ; done ; logins=$(for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; cat $loginfile ; done) ; }
    somni () { slicename=$1 ; rspec=$2 ; am=$(grep AM: $rspec | sed -e 's/^AM: //') ; }
    

This list is intended to be complete, but if we've forgotten something, you may get an error when you try to use some of those tools -- so corollary, if you do get an error when you try to use some of those tools, check with someone else to see if it works for them, and look for ways in which your environment might be different (and if they're not on this list, add them).

Ending and starting a run

This is how I end one Plastic Slices run, and start the next. These commands use techniques from my "slice notes" sandbox page, so before doing all this, I should double-check that this copy of those techniques is still accurate.

Ending

Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run

svn update
svn status

Set the list of slices:

slices=$(echo ps{103..110})

Fetch your user and slice credentials:

(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)

Deactivate Zoing (so it won't launch another set of experiments at the top of the hour):

logins cat
shmux -c "zoing deactivate" $logins

Wait for the current run to finish, typically 56 minutes past the hour.

Check that all sources are shut down ("-a" nodes):

logins grep -- -a
shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins

Reset everything, and make sure that everything is shut down:

logins cat
shmux -c "zoing reset" $logins
shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins

Fetch logs one last time, and upload them to the webserver.

Delete all of the slivers, to start the next run with a clean slate:

declare -A rspecs
for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am deletesliver $slicename & done ; sleep 30s ; done

Confirm that everything's gone:

for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename |& egrep -q -i '(code 12|code 2)' || echo "unexpected sliver in $slicename at $am" & done ; sleep 5s ; done | grep unexpected | grep -v omni

Update the wiki page for this run with any final details (e.g. when the run ended).

Starting

Synch your working dir

Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run

svn update
svn status

Update the config

Update ~/slices/plastic-slices/config/slices.json with any changes for this run. Likely changes to think about include:

  • Adding or removing aggregates.
  • Changing which aggregates are in which slices.
  • Changing openflow_controller to point to your personal controller.
  • Changing rspec_template_root to point to the directory where you personally have the rspec templates.

Update ~/slices/plastic-slices/config/pairmap.json with any changs for this run. At this point, we're maintaining the file by hand, so that we can preserve specific pairs from run to run. The pairs we're preserving are:

source destination TCP UDP
bbn-exogeni max-instageni ps103 ps108
clemson-instageni wisconsin-instageni ps105 ps110
fiu-exogeni bbn-exogeni ps104 ps107
fiu-exogeni bbn-instageni ps103 ps108
gatech-instageni northwestern-instageni ps106 ps107
kansas-instageni northwestern-instageni ps105 ps108
nyu-instageni utahddc-instageni ps106 ps109
sox-instageni illinois-instageni ps104 ps109
stanford-instageni bbn-instageni ps106 ps109

If you add a new aggregate, make sure not to break up those pairs.

If for some reason you want to generate a new random pairmap, the Tarvek 00README file has docs for how to do that.

Generate the rest of the configuration:

cd ~/slices/plastic-slices
python ~/tarvek/generate-experiment-config.py ./config/slices.json ./config/pairmap.json ./wiki-source.txt
svn rm $(svn st | grep ^! | awk '{ print $2; }')
svn add $(svn st | grep ? | awk '{ print $2; }')

Note that the 'svn rm' and 'svn add' will return an error message if there's nothing to remove or add (respectively), like "svn: Not enough arguments provided"; that's fine, and is safe to ignore.

Review to make sure that things look right, then commit that to Subversion.

Create slivers

Set the list of slices:

slices=$(echo ps{103..110})

Renew the slices to expire in 55 days:

renewdate="$(date +%Y-%m-%d -d 'now + 55 days') 23:00 UTC"
for slicename in $slices ; do omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml renewslice $slicename "$renewdate" ; done

Fetch your user and slice credentials:

(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)

Set up variables to create the slivers:

declare -A rspecs
for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
for slicename in $slices ; do echo ${rspecs[$slicename]} ; done | wc

The last two echo lines are a good place to sanity-check that things are as you expect: The first should list an rspec for every sliver you expect to create, and the second should list a count of them. There should be one line per slice, and probably a few hundred rspecs, but the exact number will depend on how many aggregates you have in each slice.

Actually create the slivers:

for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am createsliver $slicename $rspec & done ; sleep 5m ; done

Some notes about that:

  • The combination of (a) the ampersand; and (b) the sleep 5m at the end; means that this (a) fires off a createsliver for every aggregate in the slice and runs them all in parallel in the background; (b) sleeps for five minutes between slices, to avoid swamping any aggregates with too many requests at once. That 5m seems to work well for not crashing FV and not overloading InstaGENI, but it could potentially be cranked down if both of those improve.
  • This doesn't capture output at all. We could potentially add something to stuff the output into one giant file, but it might be a little hard to sort out, since output is coming back from all of the slivers at once, all intermingled together. We could have each createsliver write an output file, but we'd need to be careful to name them and save them so that the output file from an aggregate in one slice wouldn't overwrite the output from the same aggregate in another slice. For now, we just check later to see what worked and what didn't, and try again by hand if it's not obvious why some things didn't work.

Wait for all of the createsliver calls to finish; check that there isn't anything still running in the background:

jobs

if there's no output from that, everything's done, and you can continue.

Renew slivers

Renew the Utah slivers, which default to expiring in six hours:

declare -A rspecs
for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec | grep utah | egrep -v '(openflow|vts)') ; done
for slicename in $slices ; do echo ${rspecs[$slicename]} ; done

renewdate="$(date +%Y-%m-%d -d 'now + 4 days') 23:00 UTC"
for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am renewsliver $slicename "$renewdate" & done ; sleep 5s ; done

Set a reminder for yourself to renew those in four days. (Something in your calendar, a cron job, a mental note to watch your e-mail for expiration warnings the day before they expire, etc.)

Gather up expiration information for everything, and stuff it into a results file:

for slicename in $slices
do
  cd
  rm -rf ~/tmp/renewsliver/$slicename
  mkdir -p ~/tmp/renewsliver/$slicename
  cd ~/tmp/renewsliver/$slicename
  for rspec in ${rspecs[$slicename]} ; do outfile=$(echo $(basename $rspec) | sed -e 's/.rspec$//') ; somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename >& $outfile ; done
  cd ~/tmp/renewsliver/$slicename
  grep -h _expires * >> results.txt
  for i in * ; do grep _expires $i > /dev/null || echo "no 'expires' lines in $i" ; done >> results.txt
done

Set some variables to match the dates you expect things to expire on (these are just examples, and may need to be edited):

mm_dd="05-15"
mon_day="Apr 28"

Look for anomalies in the results files:

cd ~/tmp/renewsliver
for slicename in $slices ; do echo "==> $slicename" ; grep foam_expires $slicename/results.txt ; done | grep -v "$mm_dd"
for slicename in $slices ; do echo "==> $slicename" ; grep orca_expires $slicename/results.txt ; done | grep -v "$mon_day"
for slicename in $slices ; do echo "==> $slicename" ; grep pg_expires $slicename/results.txt ; done | grep -v "$mm_dd"
for slicename in $slices ; do echo "==> $slicename" ; grep "no 'expires' lines" $slicename/results.txt ; done

If you find anomalies, you'll probably need to go back to the original output files to figure out where they came from.

This will often expose errors of the form "I don't have a sliver at this aggregate at all, for some reason". Fix any of those before continuing. (This is usually just a matter of trying again to create a sliver that failed for whatever reason.

Get login info

Get login info:

cd ~/slices/plastic-slices/ssh_config
for slicename in $slices ; do ams="" ; for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; ams="$ams -a $am" ; done ; readyToLogin --no-keys --output --prefix=$slicename --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml $ams $slicename ; done
for slicename in $slices ; do mv -f $slicename-sshconfig.txt $slicename ; rm -f $slicename*.xml $slicename*.json $slicename-logininfo.txt ; done

Extract your login info from those files, and put it into your ~/.ssh/config file, via whatever means you find appealing.

Find old SSH keys for IP addresses that ExoGENI has reused, and print lines to remove them:

logins grep -- -eg-
for login in $logins ; do ssh $login true |& grep ssh-keygen | sed -e 's/remove with://' ; done

Copy and paste the output (simply exec-ing it doesn't seem to work, and we haven't debugged why); then repeat the above and expect no output.

Test logins

Make sure you can log in, and that each login's hostname is as expected:

logins cat
shmux -c "hostname" $logins | egrep -v '(.+): \1'

Expect no output from that, except possibly messages about new SSH keys. Run it again in that case, and address any other issues if you get any output.

This will often expose errors of the form "I can't log in to my hosts at this aggregate, for some reason". Fix any of those before continuing.

For example, if an InstaGENI rack sliver's VMs fail to boot, you can delete it and re-create it (BBN IG in ps104 in this example):

somni ps104 ~/rspecs/request/ps104/bbn-instageni-ps104.rspec
omni -a $am deletesliver $slicename
omni -a $am createsliver $slicename $rspec

You can then watch the spew log URL (in the createsliver output, before the manifest), or run sliverstatus to check the status:

omni -a $am sliverstatus $slicename |& grep _status

Watching the spew log URL is usually a better bet if you can.

Once you can log in everywhere, commit to Subversion the changes to ~/slices/plastic-slices/ssh_config.

If you want to copy any of your personal dotfiles to each host, to customize your own personal environment there, now would be an opportune time to do that, since you're about to start running commands on the hosts. If you don't, you can safely skip this step. Josh used to copy the files in his ~/.cfhome directory, like so:

for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/.cfhome/ $login: && echo $login ; done & done

Start OpenFlow controllers

If your OpenFlow controllers aren't already running, start them up before continuing. http://groups.geni.net/syseng/wiki/POX has more information about how to do this, if you want to use POX. Here are the essentials:

mkdir -p ~/pox

port=33101 ; ~/bin/python ~/src/pox/pox.py py openflow.of_01 --port=${port} misc.full_payload geni_l2_learning samples.pretty_log log.level --WARNING log --*TimedRotatingFile=filename=$HOME/pox/pox-${port}.log,when=D,backupCount=2 --no-default geni_requests

from geni_requests import GENIOFRequestHandler
req = GENIOFRequestHandler()

req.print_dpids()

That will start up on listening on port 33101; you'll need to repeat that for each port, in a different window. One way that works well is to do this under 'screen' on your OF controller host.

Test connectivity

Copy in connectivity test files:

for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/*/reachability/addrs-$slicename.conf $login:pingtest.conf && echo $login ; done & done

Log in to one host in each slice, and test connectivity:

fping -q -c 10 < pingtest.conf |& grep -v "ICMP Host Unreachable"

If anything isn't reachable, debug why not.

Set up Zoing

Copy in Zoing stuff:

shmux -c 'mkdir -p bin' $logins
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/plastic-slices/zoing/zoing $login:bin/zoing && echo $login ; done & done
shmux -c 'sudo mv bin/zoing /usr/local/bin/zoing' $logins
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/plastic-slices/zoing/zoingrc-$login $login:.zoingrc && echo $login ; done & done

Copy in traffic-shaping stuff:

for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/plastic-slices/tc-shape-eth1-ten-mbps $login:tc-shape-eth1-ten-mbps && echo $login ; done & done
shmux -c 'sudo chown root:root tc-shape-eth1-ten-mbps' $logins
shmux -c 'sudo mv tc-shape-eth1-ten-mbps /etc/init.d/tc-shape-eth1-ten-mbps' $logins
shmux -c 'sudo ln -s ../init.d/tc-shape-eth1-ten-mbps /etc/rc2.d/S99tc-shape-eth1-ten-mbps' $logins
shmux -c 'sudo service tc-shape-eth1-ten-mbps start' $logins

Fire up Zoing:

shmux -c "zoing activate" $logins

Final prep work

Create a directory for logs, and copy other files into it:

subdir=<a subdirectory>

mkdir -p ~/tmp/plastic-slices/$subdir/logs

cp ~/slices/plastic-slices/config/*json ~/tmp/plastic-slices/$subdir
rsync -avC ~/slices/plastic-slices/hosts/ ~/tmp/plastic-slices/$subdir/00hosts
rsync -avC ~/slices/plastic-slices/logins/ ~/tmp/plastic-slices/$subdir/00logins
rsync -avC ~/slices/plastic-slices/ssh_config/ ~/tmp/plastic-slices/$subdir/00ssh_config

Create a wiki page for this run: http://groups.geni.net/geni/wiki/PlasticSlices/Continuation has sub-pages for the various runs, so one good way to do this is:

  • Create a new sub-page for this run.
  • Copy the text from the sub-page for the previous run before this one, from the start of the page, up to and including the "Everything below this point ..." line.
  • Edit that text to refer to this run.
  • Copy in the wiki-source.txt file that Tarvek generated earlier, after the "Everything below this point ..." line.

Send mail to gpo-tech letting folks know.

To do

Here are some random things I've jotted down that I'd like to do:

  • Add a way to positively confirm that slivers *don't* exist
  • Add a way to show more concise sliver status -- not four+ lines per sliver
  • Add a way to supply a paramter to test against, like "this date" for expiry
  • Add a way to save all omni output in files, so I can look up what happened if something goes wrong
  • Maybe use vxargs to parallelize omni for some things? Sliver deletion takes freakin' forever. Or just a loop, do ten slices in parallel, although this won't help for single big slices. Maybe parallelize across one slice would be better, so it hits all the aggregates once, then again, etc.

Some of those would end up on my "slice notes" sandbox page, but they affect Plastic Slices the most (because of its scale), so they're here for now. Or I might add it to Tarvek, we'll see.

Fetch logs

I run all this stuff on anubis.

Pull them into a subdirectory of my temp log processing directory:

subdir=<a subdirectory>

mkdir -p ~/tmp/plastic-slices/$subdir/logs

logins grep -- -a
shmux -c "sed -i -e '/nanosleep failed:/d' zoing-logs/zoing*log" $logins
logins cat
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a $login:zoing-logs/ ~/tmp/plastic-slices/$subdir/logs/$login && echo $login ; done & done

Remove the last day's PNG file and the all PNG file, to make sure we re-generate it:

lastday=$(ls -1 ~/tmp/plastic-slices/$subdir/pngs/hosts/bbn-ig-ps104-b | tail -1 | sed -e 's/zoing-daily-\(.*\).png/\1/')
rm ~/tmp/plastic-slices/$subdir/pngs/*/*/*all*png ~/tmp/plastic-slices/$subdir/pngs/*/*/*daily-$lastday*png

Plot graphs:

firstlog=$(find ~/tmp/plastic-slices/$subdir/logs/*-b -name '*log' -print | sed -e 's/.*zoing-\(.*\).log/\1/' | sort | head -1)
lastlog=$(find ~/tmp/plastic-slices/$subdir/logs/*-b -name '*log' -print | sed -e 's/.*zoing-\(.*\).log/\1/' | sort | tail -1)
time python ~/tarvek/generate-graphs.py --progress --mainconfig=~/tmp/plastic-slices/$subdir/slices.json --pairmap=~/tmp/plastic-slices/$subdir/pairmap.json --rootdir=~/tmp/plastic-slices/$subdir --starttime=$firstlog --endtime=$lastlog

Push everything up to the webserver:

chgrp -R gpo ~/tmp/plastic-slices/$subdir
rsync -av ~/tmp/plastic-slices/$subdir www.gpolab.bbn.com:/srv/www/plastic-slices/continuation

Checking in

On my laptop, copy down the graphs:

subdir=<a directory>

rsync -av --delete --delete-excluded anubis:tmp/plastic-slices/$subdir/pngs ~/tmp/plastic-slices/$subdir

Identify the last day we have graphs for:

lastday=$(ls -1 ~/tmp/plastic-slices/$subdir/pngs/hosts/bbn-ig-ps104-b | tail -1 | sed -e 's/zoing-daily-\(.*\).png/\1/')

Show the per-slice graphs of the most recent day:

gq ~/tmp/plastic-slices/$subdir/pngs/slices/*/zoing-daily-$lastday.png

Show the per-host daily graphs for the most recent day:

gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-daily-$lastday.png

Show the per-slice graphs of the whole run:

gq ~/tmp/plastic-slices/$subdir/pngs/slices/*/zoing-all.png

Show the per-host graphs of the whole run:

gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-all.png

Show the per-host daily graphs for all of the days:

gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-daily*.png

The old way

This is how I used to check in, using grep to scan log files; nowadays I'm using the graphs.

Get a quick summary of the current state of things (based on the last completed run; or change $timestamp to get a different run):

timestamp=$(date -d "now - 1 hour" +%Y%m%d.%H)

for subnet in {103..106}
do
  echo -e "--> plastic $subnet\n"
  for login in $(awk 'NR%2==1' ~/slices/plastic-slices/logins/logins-ps$subnet.txt)
  do
    echo -n "$login to "
    grep "connected with" logs/$login/zoing-$timestamp*.log | awk '{ print $(NF-2); }'
    grep /sec logs/$login/zoing-$timestamp*.log || echo no results
    echo ""
  done
done

for subnet in {107..110}
do
  echo -e "--> plastic $subnet\n"
  for login in $(awk 'NR%2==0' ~/slices/plastic-slices/logins/logins-ps$subnet.txt)
  do
    echo -n $(grep "connected with" logs/$login/zoing-$timestamp*.log | awk '{ print $(NF-2); }')
    echo " to $login"
    egrep " 0.0-[^ ].+/sec" logs/$login/zoing-$timestamp*.log || echo no results
    echo ""
  done
done

Use NOX

Run NOX for plastic-101, with the learning switch ('switch') module and LAVI:

subnet=101
port=33$subnet ; (cd /usr/bin && /usr/bin/nox_core --info=/home/jbs/nox/nox-${port}.info -i ptcp:$port switch lavi_switches jsonmessenger=tcpport=11$subnet,sslport=0)

In another window, ask the plastic-101 NOX (via LAVI) what datapaths are connected:

subnet=101 ; nox-console -n localhost -p 11$subnet getnodes