wiki:JBSsandbox/PlasticSlices

Version 6 (modified by Josh Smift, 10 years ago) (diff)

Added some notes about the createsliver loop.

Plastic Slices sandbox page

Random notes for Plastic Slices stuff.

A lot of things that used to be here are now on my general "slice notes" sandbox page. What's left should in theory be pretty specific to Plastic Slices.

Environment

The tools we use to wrangle Plastic Slices have a variety of requirements for your environment on the system where you want to use them:

  • You should have an up-to-date copy of the syseng Subversion repository.
  • ~/rspecs should be a symlink to .../syseng/geni/share/rspecs.
  • ~/slices/plastic-slices should be a symlink to .../syseng/geni/share/experiment-setup/plastic-slices.
  • ~/bin/omni and ~/bin/readyToLogin should be copies of (or symlinks to) the current GCF release.
  • ~/bin/shmux should be a copy of the 'shmux' executable.
  • Your ~/.ssh/config file should include "StrictHostKeyChecking no". (FIXME: It'd be better if this were in the ~/.ssh/config section for each host, instead of being a global requirement.)
  • ~/.gcf should be your Omni/GCF directory, and you should not mind if cached user and slice credentials are stored there.
  • Your default project in your Omni config file should be 'gpo-infra'.
  • Your 'users' list in your Omni config file should include the gpo-infra users.
  • You should run the various commands all in one shell, because some of the later steps assume that you've run the commands in some of the previous steps. You can run some things in other windows if you know what you're doing, but if you're wrong, things won't work as you expect.
  • You should have the following shell functions or aliases (e.g. in your .bashrc):
    alias shmux='shmux -Sall -m -B -M 20'
    logins () { for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; $* ~/slices/*/logins/logins-$slicename.txt >| $loginfile ; done ; logins=$(for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; cat $loginfile ; done) ; }
    somni () { slicename=$1 ; rspec=$2 ; am=$(grep AM: $rspec | sed -e 's/^AM: //') ; }
    

This list is intended to be complete, but if we've forgotten something, you may get an error when you try to use some of those tools -- so corollary, if you do get an error when you try to use some of those tools, check with someone else to see if it works for them, and look for ways in which your environment might be different (and if they're not on this list, add them).

Ending and starting a run

This is how I end one Plastic Slices run, and start the next. These commands use techniques from my "slice notes" sandbox page, so before doing all this, I should double-check that this copy of those techniques is still accurate.

Ending

Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run

svn update
svn status

Set the list of slices:

slices=$(echo ps{103..110})

Fetch my user and slice credentials:

(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)

Deactivate Zoing (so it won't launch another set of experiments at the top of the hour):

logins cat
shmux -c "zoing deactivate" $logins

Wait for the current run to finish, typically 56 minutes past the hour.

Check that all sources are shut down ("-a" nodes):

logins grep -- -a
shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins

Reset everything, and make sure that everything is shut down:

logins cat
shmux -c "zoing reset" $logins
shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins

Fetch logs one last time, and upload them to the webserver.

Delete all of the slivers, to start the next run with a clean slate:

declare -A rspecs
for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am deletesliver $slicename & done ; sleep 30s ; done

Confirm that everything's gone:

for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename |& egrep -q -i '(code 12|code 2)' || echo "unexpected sliver in $slicename at $am" & done ; sleep 5s ; done | grep unexpected | grep -v omni

Update the wiki page for this run with any final details (e.g. when the run ended).

Starting

Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run

svn update
svn status

Update ~/slices/plastic-slices/config/slices.json with any changes for this run. Likely changes to think about include:

  • Adding or removing aggregates.
  • Changing which aggregates are in which slices.
  • Changing openflow_controller to point to your personal controller.
  • Changing rspec_template_root to point to the directory where you personally have the rspec templates.

Update ~/slices/plastic-slices/config/pairmap.json with any changs for this run. At this point, we're maintaining the file by hand, so that we can preserve specific pairs from run to run. The pairs we're preserving are:

source destination TCP UDP
bbn-exogeni max-instageni ps103 ps108
clemson-instageni wisconsin-instageni ps105 ps110
fiu-exogeni bbn-exogeni ps104 ps107
fiu-exogeni bbn-instageni ps103 ps108
gatech-instageni northwestern-instageni ps106 ps107
kansas-instageni northwestern-instageni ps105 ps108
nyu-instageni utahddc-instageni ps106 ps109
sox-instageni illinois-instageni ps104 ps109
stanford-instageni bbn-instageni ps106 ps109

If you add a new aggregate, make sure not to break up those pairs.

If for some reason you want to generate a new random pairmap, the Tarvek 00README file has docs for how to do that.

Generate the rest of the configuration:

cd ~/slices/plastic-slices
python ~/tarvek/generate-experiment-config.py ./config/slices.json ./config/pairmap.json ./wiki-source.txt
svn rm $(svn st | grep ^! | awk '{ print $2; }')
svn add $(svn st | grep ? | awk '{ print $2; }')

Review to make sure that things look right, then commit that to Subversion.

Set the list of slices:

slices=$(echo ps{103..110})

Fetch my user and slice credentials:

(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)

Set up variables to create the slivers:

declare -A rspecs
for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
for slicename in $slices ; do echo ${rspecs[$slicename]} ; done | wc

The last two echo lines are a good place to sanity-check that things are as you expect: The first should list an rspec for every sliver you expect to create, and the second should list a count of them. There should be one line per slice, and probably a few hundred rspecs, but the exact number will depend on how many aggregates you have in each slice.

Actually create the slivers:

for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am createsliver $slicename $rspec & done ; sleep 5m ; done

Some notes about that:

  • The combination of (a) the ampersand; and (b) the sleep 5m at the end; means that this (a) fires off a createsliver for every aggregate in the slice and runs them all in parallel in the background; (b) sleeps for five minutes between slices, to avoid swamping any aggregates with too many requests at once. That 5m seems to work well for not crashing FV and not overloading InstaGENI, but it could potentially be cranked down if both of those improve.
  • This doesn't capture output at all. We could potentially add something to stuff the output into one giant file, but it might be a little hard to sort out, since output is coming back from all of the slivers at once, all intermingled together. We could have each createsliver write an output file, but we'd need to be careful to name them and save them so that the output file from an aggregate in one slice wouldn't overwrite the output from the same aggregate in another slice. For now, we just check later to see what worked and what didn't, and try again by hand if it's not obvious why some things didn't work.

Confirm that all of the slivers' expiration dates are as expected, and renew anything that isn't using my general slice notes.

Using my general slice notes, get login info.

Using my general slice notes, do other login-related stuff.

Using my general slice notes, test connectivity. Trying "the fast way" from one node in each slice is probably good enough, but "the reliable way" will work too if you're not in a hurry.

Copy in Zoing stuff:

shmux -c 'mkdir -p bin' $logins
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; export PSSH_ERRDIR=~/tmp/prsync-errors/$slicename ; prsync -h $loginfile -a ~/slices/plastic-slices/zoing/zoing bin/zoing ; done
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a ~/slices/plastic-slices/zoing/zoingrc-$login $login:.zoingrc && echo $login ; done & done

Copy in traffic-shaping stuff:

for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; export PSSH_ERRDIR=~/tmp/prsync-errors/$slicename ; prsync -h $loginfile -a ~/slices/plastic-slices/tc-shape-eth1-ten-mbps tc-shape-eth1-ten-mbps ; done
shmux -c 'sudo chown root:root tc-shape-eth1-ten-mbps' $logins
shmux -c 'sudo mv tc-shape-eth1-ten-mbps /etc/init.d/tc-shape-eth1-ten-mbps' $logins
shmux -c 'sudo ln -s ../init.d/tc-shape-eth1-ten-mbps /etc/rc2.d/S99tc-shape-eth1-ten-mbps' $logins
shmux -c 'sudo service tc-shape-eth1-ten-mbps start' $logins

Fire up Zoing:

shmux -c "zoing activate" $logins

Create a directory for logs, and copy other files into it:

subdir=<a subdirectory>

mkdir -p ~/tmp/plastic-slices/$subdir/logs

cp ~/slices/plastic-slices/config/*json ~/tmp/plastic-slices/$subdir
rscpc ~/slices/plastic-slices/hosts/ ~/tmp/plastic-slices/$subdir/00hosts
rscpc ~/slices/plastic-slices/logins/ ~/tmp/plastic-slices/$subdir/00logins
rscpc ~/slices/plastic-slices/ssh_config/ ~/tmp/plastic-slices/$subdir/00ssh_config

Create the wiki page.

Send mail to gpo-tech letting folks know.

To do

Here are some random things I've jotted down that I'd like to do:

  • Add a way to positively confirm that slivers *don't* exist
  • Add a way to show more concise sliver status -- not four+ lines per sliver
  • Add a way to supply a paramter to test against, like "this date" for expiry
  • Add a way to save all omni output in files, so I can look up what happened if something goes wrong
  • Maybe use vxargs to parallelize omni for some things? Sliver deletion takes freakin' forever. Or just a loop, do ten slices in parallel, although this won't help for single big slices. Maybe parallelize across one slice would be better, so it hits all the aggregates once, then again, etc.

Some of those would end up on my "slice notes" sandbox page, but they affect Plastic Slices the most (because of its scale), so they're here for now. Or I might add it to Tarvek, we'll see.

Fetch logs

I run all this stuff on anubis.

Pull them into a subdirectory of my temp log processing directory:

subdir=<a subdirectory>

mkdir -p ~/tmp/plastic-slices/$subdir/logs

logins grep -- -a
shmux -c "sed -i -e '/nanosleep failed:/d' zoing-logs/zoing*log" $logins
logins cat
for slicename in $slices ; do loginfile=~/tmp/logins-$slicename.txt ; for login in $(cat $loginfile) ; do rsync -a $login:zoing-logs/ ~/tmp/plastic-slices/$subdir/logs/$login && echo $login ; done & done

Remove the last day's PNG file and the all PNG file, to make sure we re-generate it:

lastday=$(ls -1 ~/tmp/plastic-slices/$subdir/pngs/hosts/bbn-ig-ps104-b | tail -1 | sed -e 's/zoing-daily-\(.*\).png/\1/')
rm ~/tmp/plastic-slices/$subdir/pngs/*/*/*all*png ~/tmp/plastic-slices/$subdir/pngs/*/*/*daily-$lastday*png

Plot graphs:

firstlog=$(ls -1 ~/tmp/plastic-slices/$subdir/logs/bbn-ig-ps104-b | head -1 | sed -e 's/zoing-\(.*\).log/\1/')
lastlog=$(ls -1 ~/tmp/plastic-slices/$subdir/logs/bbn-ig-ps104-b | tail -1 | sed -e 's/zoing-\(.*\).log/\1/')
time python ~/tarvek/generate-graphs.py --progress --mainconfig=~/tmp/plastic-slices/$subdir/slices.json --pairmap=~/tmp/plastic-slices/$subdir/pairmap.json --rootdir=~/tmp/plastic-slices/$subdir --starttime=$firstlog --endtime=$lastlog

Push everything up to the webserver:

rsync -av ~/tmp/plastic-slices/$subdir www.gpolab.bbn.com:/srv/www/plastic-slices/continuation

Checking in

On my laptop, copy down the graphs:

subdir=<a directory>

rscpd anubis:tmp/plastic-slices/$subdir/pngs ~/tmp/plastic-slices/$subdir

Identify the last day we have graphs for:

lastday=$(ls -1 ~/tmp/plastic-slices/$subdir/pngs/hosts/bbn-ig-ps104-b | tail -1 | sed -e 's/zoing-daily-\(.*\).png/\1/')

Show the per-slice graphs of the most recent day:

gq ~/tmp/plastic-slices/$subdir/pngs/slices/*/zoing-daily-$lastday.png

Show the per-host daily graphs for the most recent day:

gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-daily-$lastday.png

Show the per-slice graphs of the whole run:

gq ~/tmp/plastic-slices/$subdir/pngs/slices/*/zoing-all.png

Show the per-host graphs of the whole run:

gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-all.png

Show the per-host daily graphs for all of the days:

gq ~/tmp/plastic-slices/$subdir/pngs/hosts/*-b/zoing-daily*.png

The old way

This is how I used to check in, using grep to scan log files; nowadays I'm using the graphs.

Get a quick summary of the current state of things (based on the last completed run; or change $timestamp to get a different run):

timestamp=$(date -d "now - 1 hour" +%Y%m%d.%H)

for subnet in {103..106}
do
  echo -e "--> plastic $subnet\n"
  for login in $(awk 'NR%2==1' ~/slices/plastic-slices/logins/logins-ps$subnet.txt)
  do
    echo -n "$login to "
    grep "connected with" logs/$login/zoing-$timestamp*.log | awk '{ print $(NF-2); }'
    grep /sec logs/$login/zoing-$timestamp*.log || echo no results
    echo ""
  done
done

for subnet in {107..110}
do
  echo -e "--> plastic $subnet\n"
  for login in $(awk 'NR%2==0' ~/slices/plastic-slices/logins/logins-ps$subnet.txt)
  do
    echo -n $(grep "connected with" logs/$login/zoing-$timestamp*.log | awk '{ print $(NF-2); }')
    echo " to $login"
    egrep " 0.0-[^ ].+/sec" logs/$login/zoing-$timestamp*.log || echo no results
    echo ""
  done
done

Use NOX

Run NOX for plastic-101, with the learning switch ('switch') module and LAVI:

subnet=101
port=33$subnet ; (cd /usr/bin && /usr/bin/nox_core --info=/home/jbs/nox/nox-${port}.info -i ptcp:$port switch lavi_switches jsonmessenger=tcpport=11$subnet,sslport=0)

In another window, ask the plastic-101 NOX (via LAVI) what datapaths are connected:

subnet=101 ; nox-console -n localhost -p 11$subnet getnodes