* Fwd: UDP/IPv6 performance issue [not found] <CAGA2gK6rf8E7cCO=RwbdFAt2ucfxUbCjqhVoga9cnHrEKxcp9g@mail.gmail.com> @ 2013-12-10 16:19 ` ajay seshadri 2013-12-10 17:05 ` Rick Jones 2013-12-10 17:12 ` Hannes Frederic Sowa 0 siblings, 2 replies; 9+ messages in thread From: ajay seshadri @ 2013-12-10 16:19 UTC (permalink / raw) To: netdev Hi, I have been testing network performance using my application and other third party tools like netperf on my systems that have 10G NIC Cards. It's a simple back to back setup with no switches in between. I see about 15 to 20% performance degradation for UDP/IPv6 as compared to UDP/IPv4 for packets of size 1500. On performing "perf top" analysis for ipv6 traffic, I identified the following functions as some hot functions: fib6_force_start_gc() csum_partial_copy_generic() udp_v6_flush_pending_frames() dst_mtu() csum_partial_copy_generic() shows up because my card doesn't support checksum offloading for ipv6 packets. In fact turning off rx / tx checksum offloading for ipv4 showed the same function in the "perf top" profile, but did not cause any performance degradation. Now I am CPU bound on packets of size 1500 and I am not using GSO (for both IPv4 and IPv6). I tried twiddling with the route cache garbage collection timer values and tried to set the socket options to disable pmtu discovery and set the mtu for the socket, but it did not make any difference. I am wondering if this is a known performance issue or can I fine tune the system to match UDP / IPv4 performance with UDP / IPv6? As I am CPU bound, the functions I identified are using up CPU cycles that i could probably save. Any help is appreciated. Thanks, Ajay ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 16:19 ` Fwd: UDP/IPv6 performance issue ajay seshadri @ 2013-12-10 17:05 ` Rick Jones 2013-12-10 17:24 ` Rick Jones 2013-12-10 17:12 ` Hannes Frederic Sowa 1 sibling, 1 reply; 9+ messages in thread From: Rick Jones @ 2013-12-10 17:05 UTC (permalink / raw) To: ajay seshadri, netdev If you want to compare the "fundamental" path length difference between IPv4 and IPv6, without any concerns about stateless offloads like CKO or GRO et al, you could use something like a single-byte netperf TCP_RR test. netperf -c -C -H <remote> -t TCP_RR and then compare service demands between the two cases. You can add a "-i 30,3" to have netperf run several iterations to get a better idea of how close it is to the "real" mean result. happy benchmarking, rick jones ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 17:05 ` Rick Jones @ 2013-12-10 17:24 ` Rick Jones 2013-12-10 19:32 ` ajay seshadri 0 siblings, 1 reply; 9+ messages in thread From: Rick Jones @ 2013-12-10 17:24 UTC (permalink / raw) To: ajay seshadri, netdev On 12/10/2013 09:05 AM, Rick Jones wrote: > If you want to compare the "fundamental" path length difference between > IPv4 and IPv6, without any concerns about stateless offloads like CKO or > GRO et al, you could use something like a single-byte netperf TCP_RR test. > > netperf -c -C -H <remote> -t TCP_RR Or UDP_RR, since that is your usage case... rick > > and then compare service demands between the two cases. You can add a > "-i 30,3" to have netperf run several iterations to get a better idea of > how close it is to the "real" mean result. > > happy benchmarking, > > rick jones > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 17:24 ` Rick Jones @ 2013-12-10 19:32 ` ajay seshadri 2013-12-10 21:30 ` Rick Jones 0 siblings, 1 reply; 9+ messages in thread From: ajay seshadri @ 2013-12-10 19:32 UTC (permalink / raw) To: Rick Jones; +Cc: netdev On Tue, Dec 10, 2013 at 12:24 PM, Rick Jones <rick.jones2@hp.com> wrote: >> If you want to compare the "fundamental" path length difference between >> IPv4 and IPv6, without any concerns about stateless offloads like CKO or >> GRO et al, you could use something like a single-byte netperf TCP_RR test. >> >> netperf -c -C -H <remote> -t TCP_RR > > Or UDP_RR, since that is your usage case... I am not sure what you mean by '"fundamental" path length difference', can you please elaborate. For now i use: ./netperf -t UDP_STREAM -H <remote> -l 60 -- -m 1450 I am trying to find the maximum throughput I can get at 1500 MTU for UDP packets (IPv4 / IPv6) without using offloading. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 19:32 ` ajay seshadri @ 2013-12-10 21:30 ` Rick Jones 2013-12-10 22:25 ` ajay seshadri 0 siblings, 1 reply; 9+ messages in thread From: Rick Jones @ 2013-12-10 21:30 UTC (permalink / raw) To: ajay seshadri; +Cc: netdev On 12/10/2013 11:32 AM, ajay seshadri wrote: > On Tue, Dec 10, 2013 at 12:24 PM, Rick Jones <rick.jones2@hp.com> wrote: >>> If you want to compare the "fundamental" path length difference between >>> IPv4 and IPv6, without any concerns about stateless offloads like CKO or >>> GRO et al, you could use something like a single-byte netperf TCP_RR test. >>> >>> netperf -c -C -H <remote> -t TCP_RR >> >> Or UDP_RR, since that is your usage case... > > I am not sure what you mean by '"fundamental" path length difference', > can you please elaborate. I mean how many instructions/cycles it takes to send/receive a single packet, where all the costs are the per-packet costs and the per-byte costs are kept at a minimum. And also where the stateless offloads won't matter. That way whether a stateless offload is enabled for one protocol or another is essentially a don't care. When I talk about a per-byte cost that is usually something like computing the checksum, or copying data to/from the kernel. A per-packet cost would be going up/down the protocol stack. TSO, GSO, UFO, and their receive side analogues would be things that reduced per-packet costs, but only when one is sending a lot of data at one time. So, single-byte _RR since it is sending only one byte at a time will effectively "bypass" the offloads. I use it as something of a proxy for those things that aren't blasting great quantities of data. > For now i use: > ./netperf -t UDP_STREAM -H <remote> -l 60 -- -m 1450 > I am trying to find the maximum throughput I can get at 1500 MTU for > UDP packets (IPv4 / IPv6) without using offloading. I presume you are looking at the receive throughput and not the reported send side throughput right? happy benchmarking, rick jones ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 21:30 ` Rick Jones @ 2013-12-10 22:25 ` ajay seshadri 0 siblings, 0 replies; 9+ messages in thread From: ajay seshadri @ 2013-12-10 22:25 UTC (permalink / raw) To: Rick Jones; +Cc: netdev Hi Rick, On Tue, Dec 10, 2013 at 4:30 PM, Rick Jones <rick.jones2@hp.com> wrote: > I mean how many instructions/cycles it takes to send/receive a single > packet, where all the costs are the per-packet costs and the per-byte costs > are kept at a minimum. And also where the stateless offloads won't matter. > That way whether a stateless offload is enabled for one protocol or another > is essentially a don't care. Given below are the results for tests you recommended: IPv6: MIGRATED UDP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 1111::80 () port 0 AF_INET6 : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 10485760 10485760 1 1 10.00 16977.61 5.12 5.46 24.138 25.721 10485760 10485760 IPv4: MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.80 () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 10485760 10485760 1 1 10.00 20414.38 5.24 4.70 20.522 18.417 10485760 10485760 The transaction rate for IPv6 is less. > So, single-byte _RR since it is sending only one byte at a time will > effectively "bypass" the offloads. I use it as something of a proxy for > those things that aren't blasting great quantities of data. With no segmentation offload, I was treating the CPU as a known bottleneck in both the cases and trying to do an apples to apples comparison between UDP IPv4 and IPv6 performance. As we take a 15 - 20% performance hit for IPv6, I was trying to understand why the ipv6 route cache gc functions showed up in the profile, which was a bit surprising. > I presume you are looking at the receive throughput and not the reported > send side throughput right? I am looking at the bytes received. In fact the two are identical as the network no longer is the bottleneck and we are pegged by the CPU on the sender for 1500 byte packets. Thanks, Ajay ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 16:19 ` Fwd: UDP/IPv6 performance issue ajay seshadri 2013-12-10 17:05 ` Rick Jones @ 2013-12-10 17:12 ` Hannes Frederic Sowa 2013-12-10 19:46 ` ajay seshadri 1 sibling, 1 reply; 9+ messages in thread From: Hannes Frederic Sowa @ 2013-12-10 17:12 UTC (permalink / raw) To: ajay seshadri; +Cc: netdev Hello! On Tue, Dec 10, 2013 at 11:19:29AM -0500, ajay seshadri wrote: > I have been testing network performance using my application and other > third party tools like netperf on my systems that have 10G NIC Cards. > It's a simple back to back setup with no switches in between. > > I see about 15 to 20% performance degradation for UDP/IPv6 as compared > to UDP/IPv4 for packets of size 1500. > > On performing "perf top" analysis for ipv6 traffic, I identified the > following functions as some hot functions: > fib6_force_start_gc() IPv6 Routing code is not as well optimized as the IPv4 one. But it is strange to see fib6_force_start_gc() to be that high in perf top. I guess you are sending the frames to distinct destinations each time? A cached entry is created on each send in the fib and as soon as the maximum of 4096 is reached a gc is forced. This setting is tunable in /proc/sys/net/ipv6/route/max_size. > csum_partial_copy_generic() > udp_v6_flush_pending_frames() > dst_mtu() > > csum_partial_copy_generic() shows up because my card doesn't support > checksum offloading for ipv6 packets. In fact turning off rx / tx > checksum offloading for ipv4 showed the same function in the "perf > top" profile, but did not cause any performance degradation. > > Now I am CPU bound on packets of size 1500 and I am not using GSO (for > both IPv4 and IPv6). I tried twiddling with the route cache garbage > collection timer values and tried to set the socket options to disable > pmtu discovery and set the mtu for the socket, but it did not make any > difference. A cached entry will be inserted nontheless. If you don't hit the max_size route entries limit I guess there could be a bug which triggers needless gc invocation. > I am wondering if this is a known performance issue or can I fine tune > the system to match UDP / IPv4 performance with UDP / IPv6? As I am > CPU bound, the functions I identified are using up CPU cycles that i > could probably save. Could you send me your send pattern so maybe I could try to reproduce it? Greetings, Hannes ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 17:12 ` Hannes Frederic Sowa @ 2013-12-10 19:46 ` ajay seshadri 2013-12-10 22:11 ` ajay seshadri 0 siblings, 1 reply; 9+ messages in thread From: ajay seshadri @ 2013-12-10 19:46 UTC (permalink / raw) To: ajay seshadri, netdev Hi, On Tue, Dec 10, 2013 at 12:12 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > IPv6 Routing code is not as well optimized as the IPv4 one. But it is > strange to see fib6_force_start_gc() to be that high in perf top. > > I guess you are sending the frames to distinct destinations each time? A > cached entry is created on each send in the fib and as soon as the maximum of > 4096 is reached a gc is forced. This setting is tunable in > /proc/sys/net/ipv6/route/max_size. My sender is connected to just one other system. My management interfaces use IPv4 addresses. Only the data path has IPv6 addresses. So, the IPv6 route cache always has only one entry for the destination. This eliminates the possibility of exceeding the capacity(4096). Even I was surprised to see fib6_force_start_gc(). > A cached entry will be inserted nontheless. If you don't hit the max_size > route entries limit I guess there could be a bug which triggers needless gc > invocation. I am leaning towards needless invocation of gc. At this point in time, I am not sure why. > Could you send me your send pattern so maybe I could try to reproduce it? For netperf i use: ./netperf -t UDP_STREAM -H <remote> -l 60 -- -m 1450 If that answers your question. I am not trying to download any file of a specific type. As a side note, ipv6_get_saddr_eval() used to show up right at the top of the "perf top" profile using the most CPU cycles (especially when I had multiple global IPv6 addresses). I was able to get rid of it by binding the socket to the corresponding source address at the sender. If the sender side socket is bound to in6addr_any or we don't bind it explicitly as in most cases for UDP, for every extra global address I configure on the interface, I take a performance hit. I am wondering if the source address look up code is not optimized enough. Thanks, Ajay ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fwd: UDP/IPv6 performance issue 2013-12-10 19:46 ` ajay seshadri @ 2013-12-10 22:11 ` ajay seshadri 0 siblings, 0 replies; 9+ messages in thread From: ajay seshadri @ 2013-12-10 22:11 UTC (permalink / raw) To: netdev, Hannes Frederic Sowa On Tue, Dec 10, 2013 at 2:46 PM, ajay seshadri <seshajay@gmail.com> wrote: > On Tue, Dec 10, 2013 at 12:12 PM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> A cached entry will be inserted nontheless. If you don't hit the max_size >> route entries limit I guess there could be a bug which triggers needless gc >> invocation. > > I am leaning towards needless invocation of gc. At this point in time, > I am not sure why. Additionally, fib6_force_start_gc() doesn't show up in TCP/IPv6 tests when I disable TSO and am CPU bound. With TSO turned off TCP/IPv4 and TCP/IPv6 throughputs are identical. Thanks, Ajay ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-12-10 22:25 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAGA2gK6rf8E7cCO=RwbdFAt2ucfxUbCjqhVoga9cnHrEKxcp9g@mail.gmail.com>
2013-12-10 16:19 ` Fwd: UDP/IPv6 performance issue ajay seshadri
2013-12-10 17:05 ` Rick Jones
2013-12-10 17:24 ` Rick Jones
2013-12-10 19:32 ` ajay seshadri
2013-12-10 21:30 ` Rick Jones
2013-12-10 22:25 ` ajay seshadri
2013-12-10 17:12 ` Hannes Frederic Sowa
2013-12-10 19:46 ` ajay seshadri
2013-12-10 22:11 ` ajay seshadri
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).