Changes between Version 27 and Version 28 of LAMP/Tutorial


Ignore:
Timestamp:
09/22/10 02:27:55 (9 years ago)
Author:
fernande@cis.udel.edu
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LAMP/Tutorial

    v27 v28  
    426426[[Image(psb_bwctl.png)]]
    427427
    428 Ah! This looks familiar. We can see our test and its parameters, and we also see a 1 week summary of the bandwidth for our test. We have two options of graphs, the Line Graph and Scatter Graph. Let's see both (1 Day).
    429 
    430 [[Image(bwctl.png)]]
    431 [[Image(psb_bwctl.png)]]
     428Ah! This looks familiar. We can see our test and its parameters, and we also see a 1 week summary of the bandwidth for our test. We have two options of graphs, the Line Graph and Scatter Graph. Let's see both (1 Day, line and scatter respectively).
     429
     430[[Image(bwctl-line.png)]]
     431
     432[[Image(bwctl-scatter.png)]]
     433
     434That looks nice. Seems like ProtoGENI allocated a 100Mb link for our slice. Let's confirm this:
     435
     436{{{
     437# ethtool eth24
     438Settings for eth24:
     439        Supported ports: [ TP MII ]
     440        Supported link modes:   10baseT/Half 10baseT/Full
     441                                100baseT/Half 100baseT/Full
     442        Supports auto-negotiation: Yes
     443        Advertised link modes:  Not reported
     444        Advertised auto-negotiation: No
     445        Speed: 100Mb/s
     446...
     447}}
     448
     449Yes, seems like we're measuring our link throughput pretty accurately. Let's move on to the one-way latency data.
     450
     451==== Visualizing One-way Latency Data  ====
     452
     453We go back to the ''Registered Services'' page and ''Query'' the PSB_OWAMP service on node1.
     454
     455[[Image(psb_owamp.png)]]
     456
     457Oops! We have found another bug :). perfSONARBUOY seems to only be exporting the data for the one-way delay tests on the loopback interface. All these bugs should be fixed in RC2 and certainly by the final release (expected around October 10).
     458
     459However, even running a one-way latency test manually shows a couple of problems in our slice.
     460
     461{{{
     462# owping node2
     463Approximately 13.1 seconds until results available
     464
     465--- owping statistics from [node1-link1]:59783 to [node2]:59781 ---
     466..
     467100 sent, 0 lost (0.000%), 0 duplicates
     468one-way delay min/median/max = 8.02/9.1/9.29 ms, (err=4.91 ms)
     469one-way jitter = 0.1 ms (P95-P50)
     470...
     471--- owping statistics from [node2]:45501 to [node1-link1]:33482 ---
     472...
     473one-way delay min/median/max = -7.88/-7.8/2.1 ms, (err=4.91 ms)
     474one-way jitter = 1.1 ms (P95-P50)
     475}}}
     476
     477Ouch, 5ms error and max of 10ms? We have already seen through our Ping latency tests that the '''round-trip''' latency hovers around 2ms; these tests cannot be trusted! Analyzing our node a little bit we can find one of the culprits:
     478
     479{{{
     480 ntpq -p
     481     remote           refid      st t when poll reach   delay   offset  jitter
     482==============================================================================
     483*ops.emulab.net  198.60.22.240    2 u   16   64   37    0.153   -4.779   2.687
     484... (other servers that have not peered) ...
     485}}}
     486
     487We are offset by 5ms from our only NTP synchronization source! This will greatly affect precision measurements on our slice. Many factors can contribute to errors on this type of network measurements; only extensive testing on different slices and hardwares will show if they're appropriate for this environment.
     488
     489Unfortunately, nothing to see here, let's move on to the Host Monitoring data.
     490
     491
     492==== Visualizing Host Monitoring Data  ====
     493
     494We have saved the best for last (or maybe you like networking like we do). There are two ways of accessing the host monitoring data that we've collected on our nodes. One is by querying the SNMP MA that exports the data with the perfSONAR format. The other is to go to the Ganglia Web interface on our ''host monitoring collector'' node (in this example it runs on the same node as the LAMP Portal). Let's first try the SNMP MA. We go to the now familiar ''Registered Services'' page (or click on the Host Monitoring link on the side bar, which takes us there), and ''Query'' the SNMP Service running on the lamp node.
     495
     496[[Image(snmpma.png)]]
     497
     498We have been greeted with a large, red "proceed at your own risk" warning :). The corresponding interface on the pS-Performance Toolkit only queried the network utilization (bytes/sec) eventType on the SNMP MA. We are extending this interface to query all of the host monitoring metrics collected by Ganglia. This is still an early prototype, but should be functional (we are keen on receiving bug reports!). Let's pick a random metric, say amount of processes running on the CPU, and open its Flash Graph. Note that you can read a description for the eventType by rest the mouse on top of it.
     499
     500
     501The perfAdmin visualization tool above allowed us to verify that our data is indeed being exported by the SNMP MA using the perfSONAR schema and API. However, the Ganglia Web visualization tool shows all the hosts monitoring metrics collected with a comprehensive and robust interface. Thus, for host monitoring in specific, we suggest this tool for visualizing the instrumentation on the slice. We can access the Ganglia Web through the URL https://<collector node>/ganglia/.
     502
     503[[Image(ganglia-web.png)]]
     504
     505On the front page we have a summary of our "cluster", in our case the whole slice. We can select a node from the dropdown box to see all the metrics we're collecting on each node. Let's select ''node2''.
     506
     507[[Image(ganglia-node2.png)]]
     508
     509We can see clear periodic spikes on CPU load and network traffic. These spikes most likely correspond to our scheduled Throughput tests. Note that the network traffic graph shows only 5MB/s, even though we were getting 90Mb/s