Opened 7 years ago

Closed 6 years ago

#176 closed (fixed)

Poor UDP performance in ExoGENI racks

Reported by: lnevers@bbn.com Owned by: somebody
Priority: minor Milestone:
Component: AM Version: SPIRAL5
Keywords: Cc:
Dependencies:

Description

In a scenario where VMs are requested within one rack without specifying the link "capacity", the link is allocated as best effort. It was found in testing that the VMs requested via the ExoSM consistently had half the throughput of the VMs requested via the Local SM when no capacity is specified. Here are some throughput measured in the GPO and RENCI racks:

==> Measurements collected in the RENCI rack - VM m1.small to VM m1.small

Results for sliver requested from ExoSM:
    1 TCP client: 1.52 Gbits/sec
    5 TCP clients: 1.84 Gbits/sec
    10 TCP clients: 2.10 Gbits/sec
    1 UDP client: Failed (requested 10 Mbits/sec, iperf server shows around  4 Mbits/sec)

Results for sliver requested from RENCI SM:
    1 TCP client: 3.49 Gbits/sec
    5 TCP clients: 4.89 Gbits/sec
    10 TCP clients:  4.88 Gbits/sec
    1 UDP client: 10.0 Mbits/sec; 101 Mbits/sec; 1.04 Gbits/sec; 

==> Measurements collected in GPO rack - VM m1.small to VM m1.small

Results for sliver requested from ExoSM:
    1 TCP client: 2.77 Gbits/sec
    5 TCP clients: 5.28 Gbits/sec
    10 TCP clients: 5.39 Gbits/sec
    1 UDP client: 3.87 Mbits/sec  (requested 10 Mbits/sec)

Results for sliver requested from GPO SM:
    1 TCP client: 6.92 Gbits/sec
    5 TCP clients: 9.39 Gbits/sec
   10 TCP clients: 9.39 Gbits/sec
    1 UDP client: Failed (requested 10 Mbits/sec, iperf server shows around  4 Mbits/sec)

To avoid the lower performance seen in the VMs allocated by the ExoSM, experimenters can specify a capacity for the link connecting the VMs within the rack. It was verified that setting the link to capacity to 10Gbs resulted in throughput around 5 Gbps.

Change History (12)

comment:1 Changed 7 years ago by lnevers@bbn.com

Have run some additional test in the BBN Rack earlier this week, after verifying worker nodes usage was minimal. Re-ran test for suggested sliver_type m1.xlarge and saw similar behavior:

===> Measurements collected in GPO rack - VM m1.xlarge to VM m1.xlarge

Results for sliver requested from ExoSM:
      1 TCP client:  3.99 Gbits/sec
      5 TCP clients: 5.42 Gbits/sec
     10 TCP clients: 5.69 Gbits/sec 
      1 UDP client: 4.20 Mbits/sec

Results for sliver requested from GPO SM:
      1 TCP client:  12.3 Gbits/sec 
      5 TCP client:  11.2 Gbits/sec 
     10 TCP clients: 10.6 Gbits/sec
      1 UDP client:  10.0 Mbits/sec  

Did have one instance where both ExoSM and Local SM nodes performed similarly, But for all other test runs (10) the nodes reserved from the local SM performed better.

comment:2 Changed 6 years ago by lnevers@bbn.com

This ticket is partially addressed. Nodes in the same rack requested via ExoSM perform the same as nodes requested via the local SM for TCP only.

Re-ran test in GPO Rack, used the same rspec with 2 m1.large VM to create one sliver via ExoSM and one sliver via BBN SM. Ran the same iperf scenario on both slivers and here are the results:

Results for sliver requested from ExoSM:
      1 TCP client: 6.70 Gbits/sec
     5 TCP clients: 6.60 Gbits/sec
    10 TCP clients: 6.59 Gbits/sec
      1 UDP client: (??Question)

Results for sliver requested from GPO SM:
      1 TCP client: 6.10 Gbits/sec
     5 TCP clients: 6.40 Gbits/sec
    10 TCP clients: 6.46 Gbits/sec
      1 UDP client: (??Question)

Results are now consistent across the number of clients and across the SMs for TCP.

Question:

I am not able to get ANY UDP traffic exchanged between the nodes. Tried a range of requests some as low as 1 Mbits/sec and none succeed. What is the expected throughput for UDP traffic between two nodes in the same rack with the default link capacity?

comment:3 Changed 6 years ago by lnevers@bbn.com

Found a rate that worked for UDP between two nodes in one rack: 683 bits/sec.

comment:4 in reply to:  3 Changed 6 years ago by lnevers@bbn.com

Replying to lnevers@bbn.com:

Found a rate that worked for UDP between two nodes in one rack: 683 bits/sec.

Also took a look at rack to rack UDP rates and found that GPO to UFL UDP iperf can only get up to around 3 Mbits/sec, with lots of packets out of order:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-60.0 sec  28.6 MBytes  4.00 Mbits/sec
[  3] Sent 20410 datagrams
[  3] Server Report:
[  3]  0.0-62.1 sec  23.3 MBytes  3.15 Mbits/sec  2.771 ms 3757/20409 (18%)
[  3]  0.0-62.1 sec  601 datagrams received out-of-order

Is this rate expected?

comment:5 Changed 6 years ago by ibaldin@renci.org

I do not believe UDP issues are fixable until we enable legacy VLAN support across the switches. I blame some combination of our switch firmware and floodlight that we use.

comment:6 in reply to:  5 Changed 6 years ago by lnevers@bbn.com

Replying to ibaldin@renci.org:

I do not believe UDP issues are fixable until we enable legacy VLAN support across the switches. I blame some combination of our switch firmware and floodlight that we use.

Humn, so experimenters have to live with 683 bits/sec between nodes within the same rack?

comment:7 Changed 6 years ago by ibaldin@renci.org

Experimenters have to live with the design decisions imposed on us.

comment:8 in reply to:  7 Changed 6 years ago by lnevers@bbn.com

Replying to ibaldin@renci.org:

Experimenters have to live with the design decisions imposed on us.

Ilia, I am only sending 1 Mbits/sec of UDP traffic to get the 683 bits/sec between the two hosts in the same rack. Is floodlight really having such a large negative impact?

comment:9 Changed 6 years ago by ibaldin@renci.org

I blame some combination of our switch firmware and floodlight that we use.

comment:10 Changed 6 years ago by lnevers@bbn.com

Summary: VM requests from ExoSM within rack perform differently than VM requested from a site local SMPoor UDP performance in ExoGENI racks

Expanding ticket definition to capture scenarios in which poor UDP performance has been observed. Also updating summary to capture the current scope of this ticket.

The ExoGENI sites GPO, UFL and FIU show low UDP performances for the following scenarios:

Topology Description Test Throughput observed
VM to VM (1 rack) EG-CT-1 700 Kbps - 3 Mbps
VM to Bare Metal (1 rack) EG-CT-2 2 Mbps - 3 Mbps
Bare Metal to VM (1 rack) EG-CT-2 2 Mbps - 3 Mbps
Bare Metal to Bare metal (1 rack)no test case 3 Mbps
VM to remote VM (2 racks) EG-CT-3 2 Mbps - 3 Mbps

comment:11 Changed 6 years ago by lnevers@bbn.com

These UDP results were collected with the image suggested in today's ExoGENI call. Image details:

  • CentOS 6.3 v1.0.10:
  • http://geni-images.renci.org/images/standard/centos/centos6.3-v1.0.10.xml, fde66f7d94557d30ebf00c86be2ff9581c9b951c
  • uploaded to registry by pruth@renci.org on 2014-03-13 16:39:42.0.
Topology Description Test Throughput observed
FIU VM to FIU VM EG-CT-1 2.91 Mbits/sec
FIU VM to FIU Bare MetalEG-CT-2 2.93 Mbits/sec
FIU Bare Metal to FIU VMEG-CT-2 2.91 Mbits/sec
GPO VM to FIU VM EG-CT-3 3.02 Mbits/sec
FIU VM to GPO VM EG-CT-3 3.09 Mbits/sec

How does this compare to the results you capture in ORCA performance testing?

comment:12 Changed 6 years ago by lnevers@bbn.com

Resolution: fixed
Status: newclosed

The ExoGENI sites GPO and UFL were used to re-collect UDP performances results for the following scenarios:

Topology Description Test Current Throughput Throughput observed
VM to VM (1 rack) EG-CT-1 810 Mbits/sec 700 Kbps - 3 Mbps
VM to Bare Metal (1 rack) EG-CT-2 810 Mbits/sec 2 Mbps - 3 Mbps
Bare Metal to VM (1 rack) EG-CT-2 735 Mbits/sec 2 Mbps - 3 Mbps
VM to remote VM (2 racks) EG-CT-3 801 Mbits/sec 2 Mbps - 3 Mbps

UDP Performance problem is resolved. Closing ticket.

Note: See TracTickets for help on using tickets.