428 | | Ah! This looks familiar. We can see our test and its parameters, and we also see a 1 week summary of the bandwidth for our test. We have two options of graphs, the Line Graph and Scatter Graph. Let's see both (1 Day). |
429 | | |
430 | | [[Image(bwctl.png)]] |
431 | | [[Image(psb_bwctl.png)]] |
| 428 | Ah! This looks familiar. We can see our test and its parameters, and we also see a 1 week summary of the bandwidth for our test. We have two options of graphs, the Line Graph and Scatter Graph. Let's see both (1 Day, line and scatter respectively). |
| 429 | |
| 430 | [[Image(bwctl-line.png)]] |
| 431 | |
| 432 | [[Image(bwctl-scatter.png)]] |
| 433 | |
| 434 | That looks nice. Seems like ProtoGENI allocated a 100Mb link for our slice. Let's confirm this: |
| 435 | |
| 436 | {{{ |
| 437 | # ethtool eth24 |
| 438 | Settings for eth24: |
| 439 | Supported ports: [ TP MII ] |
| 440 | Supported link modes: 10baseT/Half 10baseT/Full |
| 441 | 100baseT/Half 100baseT/Full |
| 442 | Supports auto-negotiation: Yes |
| 443 | Advertised link modes: Not reported |
| 444 | Advertised auto-negotiation: No |
| 445 | Speed: 100Mb/s |
| 446 | ... |
| 447 | }} |
| 448 | |
| 449 | Yes, seems like we're measuring our link throughput pretty accurately. Let's move on to the one-way latency data. |
| 450 | |
| 451 | ==== Visualizing One-way Latency Data ==== |
| 452 | |
| 453 | We go back to the ''Registered Services'' page and ''Query'' the PSB_OWAMP service on node1. |
| 454 | |
| 455 | [[Image(psb_owamp.png)]] |
| 456 | |
| 457 | Oops! We have found another bug :). perfSONARBUOY seems to only be exporting the data for the one-way delay tests on the loopback interface. All these bugs should be fixed in RC2 and certainly by the final release (expected around October 10). |
| 458 | |
| 459 | However, even running a one-way latency test manually shows a couple of problems in our slice. |
| 460 | |
| 461 | {{{ |
| 462 | # owping node2 |
| 463 | Approximately 13.1 seconds until results available |
| 464 | |
| 465 | --- owping statistics from [node1-link1]:59783 to [node2]:59781 --- |
| 466 | .. |
| 467 | 100 sent, 0 lost (0.000%), 0 duplicates |
| 468 | one-way delay min/median/max = 8.02/9.1/9.29 ms, (err=4.91 ms) |
| 469 | one-way jitter = 0.1 ms (P95-P50) |
| 470 | ... |
| 471 | --- owping statistics from [node2]:45501 to [node1-link1]:33482 --- |
| 472 | ... |
| 473 | one-way delay min/median/max = -7.88/-7.8/2.1 ms, (err=4.91 ms) |
| 474 | one-way jitter = 1.1 ms (P95-P50) |
| 475 | }}} |
| 476 | |
| 477 | Ouch, 5ms error and max of 10ms? We have already seen through our Ping latency tests that the '''round-trip''' latency hovers around 2ms; these tests cannot be trusted! Analyzing our node a little bit we can find one of the culprits: |
| 478 | |
| 479 | {{{ |
| 480 | ntpq -p |
| 481 | remote refid st t when poll reach delay offset jitter |
| 482 | ============================================================================== |
| 483 | *ops.emulab.net 198.60.22.240 2 u 16 64 37 0.153 -4.779 2.687 |
| 484 | ... (other servers that have not peered) ... |
| 485 | }}} |
| 486 | |
| 487 | We are offset by 5ms from our only NTP synchronization source! This will greatly affect precision measurements on our slice. Many factors can contribute to errors on this type of network measurements; only extensive testing on different slices and hardwares will show if they're appropriate for this environment. |
| 488 | |
| 489 | Unfortunately, nothing to see here, let's move on to the Host Monitoring data. |
| 490 | |
| 491 | |
| 492 | ==== Visualizing Host Monitoring Data ==== |
| 493 | |
| 494 | We have saved the best for last (or maybe you like networking like we do). There are two ways of accessing the host monitoring data that we've collected on our nodes. One is by querying the SNMP MA that exports the data with the perfSONAR format. The other is to go to the Ganglia Web interface on our ''host monitoring collector'' node (in this example it runs on the same node as the LAMP Portal). Let's first try the SNMP MA. We go to the now familiar ''Registered Services'' page (or click on the Host Monitoring link on the side bar, which takes us there), and ''Query'' the SNMP Service running on the lamp node. |
| 495 | |
| 496 | [[Image(snmpma.png)]] |
| 497 | |
| 498 | We have been greeted with a large, red "proceed at your own risk" warning :). The corresponding interface on the pS-Performance Toolkit only queried the network utilization (bytes/sec) eventType on the SNMP MA. We are extending this interface to query all of the host monitoring metrics collected by Ganglia. This is still an early prototype, but should be functional (we are keen on receiving bug reports!). Let's pick a random metric, say amount of processes running on the CPU, and open its Flash Graph. Note that you can read a description for the eventType by rest the mouse on top of it. |
| 499 | |
| 500 | |
| 501 | The perfAdmin visualization tool above allowed us to verify that our data is indeed being exported by the SNMP MA using the perfSONAR schema and API. However, the Ganglia Web visualization tool shows all the hosts monitoring metrics collected with a comprehensive and robust interface. Thus, for host monitoring in specific, we suggest this tool for visualizing the instrumentation on the slice. We can access the Ganglia Web through the URL https://<collector node>/ganglia/. |
| 502 | |
| 503 | [[Image(ganglia-web.png)]] |
| 504 | |
| 505 | On the front page we have a summary of our "cluster", in our case the whole slice. We can select a node from the dropdown box to see all the metrics we're collecting on each node. Let's select ''node2''. |
| 506 | |
| 507 | [[Image(ganglia-node2.png)]] |
| 508 | |
| 509 | We can see clear periodic spikes on CPU load and network traffic. These spikes most likely correspond to our scheduled Throughput tests. Note that the network traffic graph shows only 5MB/s, even though we were getting 90Mb/s |