Context Navigation

Changes between Version 28 and Version 29 of JBSsandbox/PlasticSlices

Timestamp:: 04/22/14 15:05:29 (10 years ago)
Author:: Josh Smift
Comment:: Put Starting before Ending, and added a section about checking using the graphs on the web.

Legend:

: Unmodified
: Added
: Removed
: Modified

JBSsandbox/PlasticSlices

-                      v28
+                      v29
 This list is intended to be complete, but if we've forgotten something, you may get an error when you try to use some of those tools -- so corollary, if you do get an error when you try to use some of those tools, check with someone else to see if it works for them, and look for ways in which your environment might be different (and if they're not on this list, add them).
 = Ending and starting a run =
 This describes how to end one Plastic Slices run, and start the next.
+= Starting and ending a run =
+This describes how to start and end a Plastic Slices run.
 Note that the same person doesn't have to end one run and start the next; but it's much easier for the person who started a run to also end that run. Others in the gpo-infra group can log in to the various nodes, and manipulate the slices and slivers, but things like the Zoing config files, cron jobs, and log files, are all under the user who started them up.
-== Ending ==
-Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run
-{{{
-svn update
-svn status
-}}}
-Set the list of slices:
-{{{
-slices=$(echo ps{103..110})
-}}}
-Fetch your user and slice credentials:
-{{{
-(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)
-}}}
-Deactivate Zoing (so it won't launch another set of experiments at the top of the hour):
-{{{
-logins cat
-shmux -c "zoing deactivate" $logins
-}}}
-Wait for the current run to finish, typically 56 minutes past the hour.
-Check that all sources are shut down ("-a" nodes):
-{{{
-logins grep -- -a
-shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
-}}}
-Reset everything, and make sure that everything is shut down:
-{{{
-logins cat
-shmux -c "zoing reset" $logins
-shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
-}}}
-[#Fetchlogs Fetch logs] one last time, and upload them to the webserver.
-Delete all of the slivers, to start the next run with a clean slate:
-{{{
-declare -A rspecs
-for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
-for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
-for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am deletesliver $slicename & done ; sleep 30s ; done
-}}}
-Confirm that everything's gone:
-{{{
-for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename |& egrep -q -i '(code 12|code 2)' || echo "unexpected sliver in $slicename at $am" & done ; sleep 5s ; done | grep unexpected | grep -v omni
-}}}
-Update the wiki page for this run with any final details (e.g. when the run ended).
 == Starting ==
 …
 If you find anomalies, you'll probably need to go back to the original output files to figure out where they came from.
 This will often expose errors of the form "I don't have a sliver at this aggregate at all, for some reason". Fix any of those before continuing. (This is usually just a matter of trying again to create a sliver that failed for whatever reason.
+This will often expose errors of the form "I don't have a sliver at this aggregate at all, for some reason". Fix any of those before continuing. (This is usually just a matter of trying again to create a sliver that failed for whatever reason.)
 === Get login info ===
 …
 Send mail to gpo-tech letting folks know.
+== Ending ==
+Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run
+{{{
+svn update
+svn status
+}}}
+Set the list of slices:
+{{{
+slices=$(echo ps{103..110})
+}}}
+Fetch your user and slice credentials:
+{{{
+(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)
+}}}
+Deactivate Zoing (so it won't launch another set of experiments at the top of the hour):
+{{{
+logins cat
+shmux -c "zoing deactivate" $logins
+}}}
+Wait for the current run to finish, typically 56 minutes past the hour.
+Check that all sources are shut down ("-a" nodes):
+{{{
+logins grep -- -a
+shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
+}}}
+Reset everything, and make sure that everything is shut down:
+{{{
+logins cat
+shmux -c "zoing reset" $logins
+shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
+}}}
+[#Fetchlogs Fetch logs] one last time, and upload them to the webserver.
+Delete all of the slivers, to start the next run with a clean slate:
+{{{
+declare -A rspecs
+for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
+for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
+for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am deletesliver $slicename & done ; sleep 30s ; done
+}}}
+Confirm that everything's gone:
+{{{
+for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename |& egrep -q -i '(code 12|code 2)' || echo "unexpected sliver in $slicename at $am" & done ; sleep 5s ; done | grep unexpected | grep -v omni
+}}}
+Update the wiki page for this run with any final details (e.g. when the run ended).
 = To do =
 …
 == Checking in ==
+From the wiki page for a run, browse to the directory with graphs and logs for that run. Look at:
+ * The page with a graph for traffic to each destination host for the whole run, organized by aggregate. This is a good way to identify aggregates that aren't working well in any slice, due to some aggregate-wide problem (like a connectivity issue).
+ * The page with a graph for traffic to each destination host for the whole run, organized by slice. This is a good way to identify slices that aren't working well at any aggregate, due to some slice-wide problem (like a controller issue).
+Some notes about the graphs:
+ * On the TCP graphs, large blocks of green with a flat top are good -- they show good throughput flowing. Jagged tops and gaps of white are a bad sign.
+ * On the UDP graphs, large blocks of white are good -- they show zero packet loss. Any red is a bad sign; red *below* zero indicates a log file with no data, which may be a bad sign, or may be a known issue.
+ * The graphs with many hosts on a single graph are pretty hard to read, but they can sometimes help you spot other things you might want to look at.
+Don't forget to reload the page after pushing new graphs.
+=== The old way ===
+This is how I used to check in, downloading graphs to my laptop to view (with an image viewer, 'gq' was an alias to one I liked).
 On my laptop, copy down the graphs:
 …
 }}}
 === The old way ===
+=== The older way ===
 This is how I used to check in, using grep to scan log files; nowadays I'm using the graphs.