Changes between Version 28 and Version 29 of JBSsandbox/PlasticSlices


Ignore:
Timestamp:
04/22/14 15:05:29 (7 years ago)
Author:
Josh Smift
Comment:

Put Starting before Ending, and added a section about checking using the graphs on the web.

Legend:

Unmodified
Added
Removed
Modified
  • JBSsandbox/PlasticSlices

    v28 v29  
    3131This list is intended to be complete, but if we've forgotten something, you may get an error when you try to use some of those tools -- so corollary, if you do get an error when you try to use some of those tools, check with someone else to see if it works for them, and look for ways in which your environment might be different (and if they're not on this list, add them).
    3232
    33 = Ending and starting a run =
    34 
    35 This describes how to end one Plastic Slices run, and start the next.
     33= Starting and ending a run =
     34
     35This describes how to start and end a Plastic Slices run.
    3636
    3737Note that the same person doesn't have to end one run and start the next; but it's much easier for the person who started a run to also end that run. Others in the gpo-infra group can log in to the various nodes, and manipulate the slices and slivers, but things like the Zoing config files, cron jobs, and log files, are all under the user who started them up.
    38 
    39 == Ending ==
    40 
    41 Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run
    42 
    43 {{{
    44 svn update
    45 svn status
    46 }}}
    47 
    48 Set the list of slices:
    49 
    50 {{{
    51 slices=$(echo ps{103..110})
    52 }}}
    53 
    54 Fetch your user and slice credentials:
    55 
    56 {{{
    57 (cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)
    58 }}}
    59 
    60 Deactivate Zoing (so it won't launch another set of experiments at the top of the hour):
    61 
    62 {{{
    63 logins cat
    64 shmux -c "zoing deactivate" $logins
    65 }}}
    66 
    67 Wait for the current run to finish, typically 56 minutes past the hour.
    68 
    69 Check that all sources are shut down ("-a" nodes):
    70 
    71 {{{
    72 logins grep -- -a
    73 shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
    74 }}}
    75 
    76 Reset everything, and make sure that everything is shut down:
    77 
    78 {{{
    79 logins cat
    80 shmux -c "zoing reset" $logins
    81 shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
    82 }}}
    83 
    84 [#Fetchlogs Fetch logs] one last time, and upload them to the webserver.
    85 
    86 Delete all of the slivers, to start the next run with a clean slate:
    87 
    88 {{{
    89 declare -A rspecs
    90 for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
    91 for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
    92 for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am deletesliver $slicename & done ; sleep 30s ; done
    93 }}}
    94 
    95 Confirm that everything's gone:
    96 
    97 {{{
    98 for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename |& egrep -q -i '(code 12|code 2)' || echo "unexpected sliver in $slicename at $am" & done ; sleep 5s ; done | grep unexpected | grep -v omni
    99 }}}
    100 
    101 Update the wiki page for this run with any final details (e.g. when the run ended).
    10238
    10339== Starting ==
     
    253189If you find anomalies, you'll probably need to go back to the original output files to figure out where they came from.
    254190
    255 This will often expose errors of the form "I don't have a sliver at this aggregate at all, for some reason". Fix any of those before continuing. (This is usually just a matter of trying again to create a sliver that failed for whatever reason.
     191This will often expose errors of the form "I don't have a sliver at this aggregate at all, for some reason". Fix any of those before continuing. (This is usually just a matter of trying again to create a sliver that failed for whatever reason.)
    256192
    257193=== Get login info ===
     
    397333Send mail to gpo-tech letting folks know.
    398334
     335== Ending ==
     336
     337Make sure your copy of the syseng Subversion repository is up to date and that you don't have uncommitted changes there. Change into your .../syseng directory, and run
     338
     339{{{
     340svn update
     341svn status
     342}}}
     343
     344Set the list of slices:
     345
     346{{{
     347slices=$(echo ps{103..110})
     348}}}
     349
     350Fetch your user and slice credentials:
     351
     352{{{
     353(cd ~/.gcf ; omni getusercred -o ; for slicename in $slices ; do omni getslicecred $slicename -o ; done)
     354}}}
     355
     356Deactivate Zoing (so it won't launch another set of experiments at the top of the hour):
     357
     358{{{
     359logins cat
     360shmux -c "zoing deactivate" $logins
     361}}}
     362
     363Wait for the current run to finish, typically 56 minutes past the hour.
     364
     365Check that all sources are shut down ("-a" nodes):
     366
     367{{{
     368logins grep -- -a
     369shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
     370}}}
     371
     372Reset everything, and make sure that everything is shut down:
     373
     374{{{
     375logins cat
     376shmux -c "zoing reset" $logins
     377shmux -c "zoing status | grep -v -- '-active -cron -running -processes' || true" $logins
     378}}}
     379
     380[#Fetchlogs Fetch logs] one last time, and upload them to the webserver.
     381
     382Delete all of the slivers, to start the next run with a clean slate:
     383
     384{{{
     385declare -A rspecs
     386for slicename in $slices ; do rspecs[$slicename]=$(ls -1 ~/rspecs/request/$slicename/*.rspec) ; done
     387for slicename in $slices ; do echo ${rspecs[$slicename]} ; done
     388for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am deletesliver $slicename & done ; sleep 30s ; done
     389}}}
     390
     391Confirm that everything's gone:
     392
     393{{{
     394for slicename in $slices ; do for rspec in ${rspecs[$slicename]} ; do somni $slicename $rspec ; omni --usercredfile=$HOME/.gcf/$USER-geni-usercred.xml --slicecredfile=$HOME/.gcf/$slicename-cred.xml -a $am sliverstatus $slicename |& egrep -q -i '(code 12|code 2)' || echo "unexpected sliver in $slicename at $am" & done ; sleep 5s ; done | grep unexpected | grep -v omni
     395}}}
     396
     397Update the wiki page for this run with any final details (e.g. when the run ended).
     398
    399399= To do =
    400400
     
    450450== Checking in ==
    451451
     452From the wiki page for a run, browse to the directory with graphs and logs for that run. Look at:
     453
     454 * The page with a graph for traffic to each destination host for the whole run, organized by aggregate. This is a good way to identify aggregates that aren't working well in any slice, due to some aggregate-wide problem (like a connectivity issue).
     455
     456 * The page with a graph for traffic to each destination host for the whole run, organized by slice. This is a good way to identify slices that aren't working well at any aggregate, due to some slice-wide problem (like a controller issue).
     457
     458Some notes about the graphs:
     459
     460 * On the TCP graphs, large blocks of green with a flat top are good -- they show good throughput flowing. Jagged tops and gaps of white are a bad sign.
     461 * On the UDP graphs, large blocks of white are good -- they show zero packet loss. Any red is a bad sign; red *below* zero indicates a log file with no data, which may be a bad sign, or may be a known issue.
     462 * The graphs with many hosts on a single graph are pretty hard to read, but they can sometimes help you spot other things you might want to look at.
     463
     464Don't forget to reload the page after pushing new graphs.
     465
     466=== The old way ===
     467
     468This is how I used to check in, downloading graphs to my laptop to view (with an image viewer, 'gq' was an alias to one I liked).
     469
    452470On my laptop, copy down the graphs:
    453471
     
    494512}}}
    495513
    496 === The old way ===
     514=== The older way ===
    497515
    498516This is how I used to check in, using grep to scan log files; nowadays I'm using the graphs.