Opened 11 years ago
Closed 10 years ago
#176 closed (fixed)
Poor UDP performance in ExoGENI racks
Reported by: | lnevers@bbn.com | Owned by: | somebody |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | AM | Version: | SPIRAL5 |
Keywords: | Cc: | ||
Dependencies: |
Description
In a scenario where VMs are requested within one rack without specifying the link "capacity", the link is allocated as best effort. It was found in testing that the VMs requested via the ExoSM consistently had half the throughput of the VMs requested via the Local SM when no capacity is specified. Here are some throughput measured in the GPO and RENCI racks:
==> Measurements collected in the RENCI rack - VM m1.small to VM m1.small
Results for sliver requested from ExoSM: 1 TCP client: 1.52 Gbits/sec 5 TCP clients: 1.84 Gbits/sec 10 TCP clients: 2.10 Gbits/sec 1 UDP client: Failed (requested 10 Mbits/sec, iperf server shows around 4 Mbits/sec) Results for sliver requested from RENCI SM: 1 TCP client: 3.49 Gbits/sec 5 TCP clients: 4.89 Gbits/sec 10 TCP clients: 4.88 Gbits/sec 1 UDP client: 10.0 Mbits/sec; 101 Mbits/sec; 1.04 Gbits/sec;
==> Measurements collected in GPO rack - VM m1.small to VM m1.small
Results for sliver requested from ExoSM: 1 TCP client: 2.77 Gbits/sec 5 TCP clients: 5.28 Gbits/sec 10 TCP clients: 5.39 Gbits/sec 1 UDP client: 3.87 Mbits/sec (requested 10 Mbits/sec) Results for sliver requested from GPO SM: 1 TCP client: 6.92 Gbits/sec 5 TCP clients: 9.39 Gbits/sec 10 TCP clients: 9.39 Gbits/sec 1 UDP client: Failed (requested 10 Mbits/sec, iperf server shows around 4 Mbits/sec)
To avoid the lower performance seen in the VMs allocated by the ExoSM, experimenters can specify a capacity for the link connecting the VMs within the rack. It was verified that setting the link to capacity to 10Gbs resulted in throughput around 5 Gbps.
Change History (12)
comment:1 Changed 11 years ago by
comment:2 Changed 10 years ago by
This ticket is partially addressed. Nodes in the same rack requested via ExoSM perform the same as nodes requested via the local SM for TCP only.
Re-ran test in GPO Rack, used the same rspec with 2 m1.large VM to create one sliver via ExoSM and one sliver via BBN SM. Ran the same iperf scenario on both slivers and here are the results:
Results for sliver requested from ExoSM: 1 TCP client: 6.70 Gbits/sec 5 TCP clients: 6.60 Gbits/sec 10 TCP clients: 6.59 Gbits/sec 1 UDP client: (??Question) Results for sliver requested from GPO SM: 1 TCP client: 6.10 Gbits/sec 5 TCP clients: 6.40 Gbits/sec 10 TCP clients: 6.46 Gbits/sec 1 UDP client: (??Question)
Results are now consistent across the number of clients and across the SMs for TCP.
Question:
I am not able to get ANY UDP traffic exchanged between the nodes. Tried a range of requests some as low as 1 Mbits/sec and none succeed. What is the expected throughput for UDP traffic between two nodes in the same rack with the default link capacity?
comment:3 follow-up: 4 Changed 10 years ago by
Found a rate that worked for UDP between two nodes in one rack: 683 bits/sec.
comment:4 Changed 10 years ago by
Replying to lnevers@bbn.com:
Found a rate that worked for UDP between two nodes in one rack: 683 bits/sec.
Also took a look at rack to rack UDP rates and found that GPO to UFL UDP iperf can only get up to around 3 Mbits/sec, with lots of packets out of order:
[ ID] Interval Transfer Bandwidth [ 3] 0.0-60.0 sec 28.6 MBytes 4.00 Mbits/sec [ 3] Sent 20410 datagrams [ 3] Server Report: [ 3] 0.0-62.1 sec 23.3 MBytes 3.15 Mbits/sec 2.771 ms 3757/20409 (18%) [ 3] 0.0-62.1 sec 601 datagrams received out-of-order
Is this rate expected?
comment:5 follow-up: 6 Changed 10 years ago by
I do not believe UDP issues are fixable until we enable legacy VLAN support across the switches. I blame some combination of our switch firmware and floodlight that we use.
comment:6 Changed 10 years ago by
Replying to ibaldin@renci.org:
I do not believe UDP issues are fixable until we enable legacy VLAN support across the switches. I blame some combination of our switch firmware and floodlight that we use.
Humn, so experimenters have to live with 683 bits/sec between nodes within the same rack?
comment:7 follow-up: 8 Changed 10 years ago by
Experimenters have to live with the design decisions imposed on us.
comment:8 Changed 10 years ago by
Replying to ibaldin@renci.org:
Experimenters have to live with the design decisions imposed on us.
Ilia, I am only sending 1 Mbits/sec of UDP traffic to get the 683 bits/sec between the two hosts in the same rack. Is floodlight really having such a large negative impact?
comment:9 Changed 10 years ago by
I blame some combination of our switch firmware and floodlight that we use.
comment:10 Changed 10 years ago by
Summary: | VM requests from ExoSM within rack perform differently than VM requested from a site local SM → Poor UDP performance in ExoGENI racks |
---|
Expanding ticket definition to capture scenarios in which poor UDP performance has been observed. Also updating summary to capture the current scope of this ticket.
The ExoGENI sites GPO, UFL and FIU show low UDP performances for the following scenarios:
Topology Description | Test | Throughput observed |
VM to VM (1 rack) | EG-CT-1 | 700 Kbps - 3 Mbps |
VM to Bare Metal (1 rack) | EG-CT-2 | 2 Mbps - 3 Mbps |
Bare Metal to VM (1 rack) | EG-CT-2 | 2 Mbps - 3 Mbps |
Bare Metal to Bare metal (1 rack) | no test case | 3 Mbps |
VM to remote VM (2 racks) | EG-CT-3 | 2 Mbps - 3 Mbps |
comment:11 Changed 10 years ago by
These UDP results were collected with the image suggested in today's ExoGENI call. Image details:
- CentOS 6.3 v1.0.10:
- http://geni-images.renci.org/images/standard/centos/centos6.3-v1.0.10.xml, fde66f7d94557d30ebf00c86be2ff9581c9b951c
- uploaded to registry by pruth@renci.org on 2014-03-13 16:39:42.0.
Topology Description | Test | Throughput observed |
FIU VM to FIU VM | EG-CT-1 | 2.91 Mbits/sec |
FIU VM to FIU Bare Metal | EG-CT-2 | 2.93 Mbits/sec |
FIU Bare Metal to FIU VM | EG-CT-2 | 2.91 Mbits/sec |
GPO VM to FIU VM | EG-CT-3 | 3.02 Mbits/sec |
FIU VM to GPO VM | EG-CT-3 | 3.09 Mbits/sec |
How does this compare to the results you capture in ORCA performance testing?
comment:12 Changed 10 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
The ExoGENI sites GPO and UFL were used to re-collect UDP performances results for the following scenarios:
Topology Description Test Current Throughput Throughput observed VM to VM (1 rack) EG-CT-1 810 Mbits/sec 700 Kbps - 3 Mbps VM to Bare Metal (1 rack) EG-CT-2 810 Mbits/sec 2 Mbps - 3 Mbps Bare Metal to VM (1 rack) EG-CT-2 735 Mbits/sec 2 Mbps - 3 Mbps VM to remote VM (2 racks) EG-CT-3 801 Mbits/sec 2 Mbps - 3 Mbps
UDP Performance problem is resolved. Closing ticket.
Have run some additional test in the BBN Rack earlier this week, after verifying worker nodes usage was minimal. Re-ran test for suggested sliver_type m1.xlarge and saw similar behavior:
===> Measurements collected in GPO rack - VM m1.xlarge to VM m1.xlarge
Did have one instance where both ExoSM and Local SM nodes performed similarly, But for all other test runs (10) the nodes reserved from the local SM performed better.