netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fwd: UDP/IPv6 performance issue
       [not found] <CAGA2gK6rf8E7cCO=RwbdFAt2ucfxUbCjqhVoga9cnHrEKxcp9g@mail.gmail.com>
@ 2013-12-10 16:19 ` ajay seshadri
  2013-12-10 17:05   ` Rick Jones
  2013-12-10 17:12   ` Hannes Frederic Sowa
  0 siblings, 2 replies; 9+ messages in thread
From: ajay seshadri @ 2013-12-10 16:19 UTC (permalink / raw)
  To: netdev

Hi,

I have been testing network performance using my application and other
third party tools like netperf on my systems that have 10G NIC Cards.
It's a simple back to back setup with no switches in between.

I see about 15 to 20% performance degradation for UDP/IPv6 as compared
to UDP/IPv4 for packets of size 1500.

On performing "perf top" analysis for ipv6 traffic, I identified the
following functions as some hot functions:
fib6_force_start_gc()
csum_partial_copy_generic()
udp_v6_flush_pending_frames()
dst_mtu()

csum_partial_copy_generic() shows up because my card doesn't support
checksum offloading for ipv6 packets. In fact turning off rx / tx
checksum offloading for ipv4 showed the same function in the "perf
top" profile, but did not cause any performance degradation.

Now I am CPU bound on packets of size 1500 and I am not using GSO (for
both IPv4 and IPv6). I tried twiddling with the route cache garbage
collection timer values and tried to set the socket options to disable
pmtu discovery and set the mtu for the socket, but it did not make any
difference.

I am wondering if this is a known performance issue or can I fine tune
the system to match UDP / IPv4 performance with UDP / IPv6? As I am
CPU bound, the functions I identified are using up CPU cycles that i
could probably save.

Any help is appreciated.

Thanks,
Ajay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 16:19 ` Fwd: UDP/IPv6 performance issue ajay seshadri
@ 2013-12-10 17:05   ` Rick Jones
  2013-12-10 17:24     ` Rick Jones
  2013-12-10 17:12   ` Hannes Frederic Sowa
  1 sibling, 1 reply; 9+ messages in thread
From: Rick Jones @ 2013-12-10 17:05 UTC (permalink / raw)
  To: ajay seshadri, netdev

If you want to compare the "fundamental" path length difference between 
IPv4 and IPv6, without any concerns about stateless offloads like CKO or 
GRO et al, you could use something like a single-byte netperf TCP_RR test.

netperf -c -C -H <remote> -t TCP_RR

and then compare service demands between the two cases.   You can add a 
"-i 30,3" to have netperf run several iterations to get a better idea of 
how close it is to the "real" mean result.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 16:19 ` Fwd: UDP/IPv6 performance issue ajay seshadri
  2013-12-10 17:05   ` Rick Jones
@ 2013-12-10 17:12   ` Hannes Frederic Sowa
  2013-12-10 19:46     ` ajay seshadri
  1 sibling, 1 reply; 9+ messages in thread
From: Hannes Frederic Sowa @ 2013-12-10 17:12 UTC (permalink / raw)
  To: ajay seshadri; +Cc: netdev

Hello!

On Tue, Dec 10, 2013 at 11:19:29AM -0500, ajay seshadri wrote:
> I have been testing network performance using my application and other
> third party tools like netperf on my systems that have 10G NIC Cards.
> It's a simple back to back setup with no switches in between.
> 
> I see about 15 to 20% performance degradation for UDP/IPv6 as compared
> to UDP/IPv4 for packets of size 1500.
> 
> On performing "perf top" analysis for ipv6 traffic, I identified the
> following functions as some hot functions:
> fib6_force_start_gc()

IPv6 Routing code is not as well optimized as the IPv4 one. But it is
strange to see fib6_force_start_gc() to be that high in perf top.

I guess you are sending the frames to distinct destinations each time? A
cached entry is created on each send in the fib and as soon as the maximum of
4096 is reached a gc is forced. This setting is tunable in
/proc/sys/net/ipv6/route/max_size.

> csum_partial_copy_generic()
> udp_v6_flush_pending_frames()
> dst_mtu()
> 
> csum_partial_copy_generic() shows up because my card doesn't support
> checksum offloading for ipv6 packets. In fact turning off rx / tx
> checksum offloading for ipv4 showed the same function in the "perf
> top" profile, but did not cause any performance degradation.
> 
> Now I am CPU bound on packets of size 1500 and I am not using GSO (for
> both IPv4 and IPv6). I tried twiddling with the route cache garbage
> collection timer values and tried to set the socket options to disable
> pmtu discovery and set the mtu for the socket, but it did not make any
> difference.

A cached entry will be inserted nontheless. If you don't hit the max_size
route entries limit I guess there could be a bug which triggers needless gc
invocation.

> I am wondering if this is a known performance issue or can I fine tune
> the system to match UDP / IPv4 performance with UDP / IPv6? As I am
> CPU bound, the functions I identified are using up CPU cycles that i
> could probably save.

Could you send me your send pattern so maybe I could try to reproduce it?

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 17:05   ` Rick Jones
@ 2013-12-10 17:24     ` Rick Jones
  2013-12-10 19:32       ` ajay seshadri
  0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2013-12-10 17:24 UTC (permalink / raw)
  To: ajay seshadri, netdev

On 12/10/2013 09:05 AM, Rick Jones wrote:
> If you want to compare the "fundamental" path length difference between
> IPv4 and IPv6, without any concerns about stateless offloads like CKO or
> GRO et al, you could use something like a single-byte netperf TCP_RR test.
>
> netperf -c -C -H <remote> -t TCP_RR

Or UDP_RR, since that is your usage case...

rick

>
> and then compare service demands between the two cases.   You can add a
> "-i 30,3" to have netperf run several iterations to get a better idea of
> how close it is to the "real" mean result.
>
> happy benchmarking,
>
> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 17:24     ` Rick Jones
@ 2013-12-10 19:32       ` ajay seshadri
  2013-12-10 21:30         ` Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: ajay seshadri @ 2013-12-10 19:32 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Tue, Dec 10, 2013 at 12:24 PM, Rick Jones <rick.jones2@hp.com> wrote:
>> If you want to compare the "fundamental" path length difference between
>> IPv4 and IPv6, without any concerns about stateless offloads like CKO or
>> GRO et al, you could use something like a single-byte netperf TCP_RR test.
>>
>> netperf -c -C -H <remote> -t TCP_RR
>
> Or UDP_RR, since that is your usage case...

I am not sure what you mean by '"fundamental" path length difference',
can you please elaborate. For now i use:
./netperf -t UDP_STREAM -H <remote> -l 60 -- -m 1450
I am trying to find the maximum throughput I can get at 1500 MTU for
UDP packets (IPv4 / IPv6) without using offloading.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 17:12   ` Hannes Frederic Sowa
@ 2013-12-10 19:46     ` ajay seshadri
  2013-12-10 22:11       ` ajay seshadri
  0 siblings, 1 reply; 9+ messages in thread
From: ajay seshadri @ 2013-12-10 19:46 UTC (permalink / raw)
  To: ajay seshadri, netdev

Hi,

On Tue, Dec 10, 2013 at 12:12 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> IPv6 Routing code is not as well optimized as the IPv4 one. But it is
> strange to see fib6_force_start_gc() to be that high in perf top.
>
> I guess you are sending the frames to distinct destinations each time? A
> cached entry is created on each send in the fib and as soon as the maximum of
> 4096 is reached a gc is forced. This setting is tunable in
> /proc/sys/net/ipv6/route/max_size.

My sender is connected to just one other system. My management
interfaces use IPv4 addresses. Only the data path has IPv6 addresses.
So, the IPv6 route cache always has only one entry for the
destination. This eliminates the possibility of exceeding the
capacity(4096). Even I was surprised to see fib6_force_start_gc().

> A cached entry will be inserted nontheless. If you don't hit the max_size
> route entries limit I guess there could be a bug which triggers needless gc
> invocation.

I am leaning towards needless invocation of gc. At this point in time,
I am not sure why.

> Could you send me your send pattern so maybe I could try to reproduce it?

For netperf i use:
./netperf -t UDP_STREAM -H <remote> -l 60 -- -m 1450
If that answers your question. I am not trying to download any file of
a specific type.

As a side note,
ipv6_get_saddr_eval() used to show up right at the top of the "perf
top" profile using the most CPU cycles (especially when I had multiple
global IPv6 addresses). I was able to get rid of it by binding the
socket to the corresponding source address at the sender. If the
sender side socket is bound to in6addr_any or we don't bind it
explicitly as in most cases for UDP, for every extra global address I
configure on the interface, I take a performance hit. I am wondering
if the source address look up code is not optimized enough.

Thanks,
Ajay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 19:32       ` ajay seshadri
@ 2013-12-10 21:30         ` Rick Jones
  2013-12-10 22:25           ` ajay seshadri
  0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2013-12-10 21:30 UTC (permalink / raw)
  To: ajay seshadri; +Cc: netdev

On 12/10/2013 11:32 AM, ajay seshadri wrote:
> On Tue, Dec 10, 2013 at 12:24 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>> If you want to compare the "fundamental" path length difference between
>>> IPv4 and IPv6, without any concerns about stateless offloads like CKO or
>>> GRO et al, you could use something like a single-byte netperf TCP_RR test.
>>>
>>> netperf -c -C -H <remote> -t TCP_RR
>>
>> Or UDP_RR, since that is your usage case...
>
> I am not sure what you mean by '"fundamental" path length difference',
> can you please elaborate.

I mean how many instructions/cycles it takes to send/receive a single 
packet, where all the costs are the per-packet costs and the per-byte 
costs are kept at a minimum.  And also where the stateless offloads 
won't matter.  That way whether a stateless offload is enabled for one 
protocol or another is essentially a don't care.

When I talk about a per-byte cost that is usually something like 
computing the checksum, or copying data to/from the kernel.

A per-packet cost would be going up/down the protocol stack.  TSO, GSO, 
UFO, and their receive side analogues would be things that reduced 
per-packet costs, but only when one is sending a lot of data at one time.

So, single-byte _RR since it is sending only one byte at a time will 
effectively "bypass" the offloads.  I use it as something of a proxy for 
those things that aren't blasting great quantities of data.

> For now i use:
> ./netperf -t UDP_STREAM -H <remote> -l 60 -- -m 1450
> I am trying to find the maximum throughput I can get at 1500 MTU for
> UDP packets (IPv4 / IPv6) without using offloading.

I presume you are looking at the receive throughput and not the reported 
send side throughput right?

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 19:46     ` ajay seshadri
@ 2013-12-10 22:11       ` ajay seshadri
  0 siblings, 0 replies; 9+ messages in thread
From: ajay seshadri @ 2013-12-10 22:11 UTC (permalink / raw)
  To: netdev, Hannes Frederic Sowa

On Tue, Dec 10, 2013 at 2:46 PM, ajay seshadri <seshajay@gmail.com> wrote:
> On Tue, Dec 10, 2013 at 12:12 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> A cached entry will be inserted nontheless. If you don't hit the max_size
>> route entries limit I guess there could be a bug which triggers needless gc
>> invocation.
>
> I am leaning towards needless invocation of gc. At this point in time,
> I am not sure why.


Additionally,  fib6_force_start_gc() doesn't show up in TCP/IPv6 tests
when I disable TSO and am CPU bound. With TSO turned off TCP/IPv4 and
TCP/IPv6 throughputs are identical.

Thanks,
Ajay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: UDP/IPv6 performance issue
  2013-12-10 21:30         ` Rick Jones
@ 2013-12-10 22:25           ` ajay seshadri
  0 siblings, 0 replies; 9+ messages in thread
From: ajay seshadri @ 2013-12-10 22:25 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

Hi Rick,

On Tue, Dec 10, 2013 at 4:30 PM, Rick Jones <rick.jones2@hp.com> wrote:

> I mean how many instructions/cycles it takes to send/receive a single
> packet, where all the costs are the per-packet costs and the per-byte costs
> are kept at a minimum.  And also where the stateless offloads won't matter.
> That way whether a stateless offload is enabled for one protocol or another
> is essentially a don't care.

Given below are the results for tests you recommended:
IPv6:

MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to
1111::80 () port 0 AF_INET6 : first burst 0
Local /Remote
Socket        Size        Request Resp. Elapsed Trans.     CPU    CPU
  S.dem   S.dem
Send          Recv       Size        Size   Time       Rate
local  remote local   remote
bytes          bytes       bytes     bytes  secs.      per sec   % S
 % S    us/Tr   us/Tr

10485760 10485760 1            1        10.00   16977.61   5.12   5.46
  24.138  25.721
10485760 10485760

IPv4:

MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
AF_INET to 192.168.31.80 () port 0 AF_INET : first burst 0
Local /Remote
Socket       Size          Request Resp.  Elapsed Trans.       CPU
CPU    S.dem   S.dem
Send         Recv         Size       Size     Time      Rate
local  remote local   remote
bytes         bytes        bytes      bytes  secs.      per sec      %
S    % S    us/Tr   us/Tr

10485760 10485760  1            1        10.00      20414.38   5.24
4.70   20.522  18.417
10485760 10485760

The transaction rate for IPv6 is less.


> So, single-byte _RR since it is sending only one byte at a time will
> effectively "bypass" the offloads.  I use it as something of a proxy for
> those things that aren't blasting great quantities of data.

With no segmentation offload, I was treating the CPU as a known
bottleneck in both the cases and trying to do an apples to apples
comparison between UDP IPv4 and IPv6 performance. As we take a 15 -
20% performance hit for IPv6, I was trying to understand why the ipv6
route cache gc functions showed up in the profile, which was a bit
surprising.


> I presume you are looking at the receive throughput and not the reported
> send side throughput right?

I am looking at the bytes received. In fact the two are identical as
the network no longer is the bottleneck and we are pegged by the CPU
on the sender for 1500 byte packets.

Thanks,
Ajay

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-12-10 22:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAGA2gK6rf8E7cCO=RwbdFAt2ucfxUbCjqhVoga9cnHrEKxcp9g@mail.gmail.com>
2013-12-10 16:19 ` Fwd: UDP/IPv6 performance issue ajay seshadri
2013-12-10 17:05   ` Rick Jones
2013-12-10 17:24     ` Rick Jones
2013-12-10 19:32       ` ajay seshadri
2013-12-10 21:30         ` Rick Jones
2013-12-10 22:25           ` ajay seshadri
2013-12-10 17:12   ` Hannes Frederic Sowa
2013-12-10 19:46     ` ajay seshadri
2013-12-10 22:11       ` ajay seshadri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).