From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin Poirier Subject: [PATCH v2] net: Do not enable tx-nocache-copy by default Date: Tue, 7 Jan 2014 10:11:10 -0500 Message-ID: <1389107470-18213-1-git-send-email-bpoirier@suse.de> References: <20140106.200002.1747627391067832069.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Eric Dumazet , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Tom Herbert To: "David S. Miller" Return-path: Received: from mail-qa0-f43.google.com ([209.85.216.43]:35396 "EHLO mail-qa0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750958AbaAGPLX (ORCPT ); Tue, 7 Jan 2014 10:11:23 -0500 In-Reply-To: <20140106.200002.1747627391067832069.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org List-ID: There are many cases where this feature does not improve performance or= even reduces it. =46or example, here are the results from tests that I've run using 3.12= =2E6 on one Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The result= s are from the Xeon, but they're similar on the i7. All numbers report the mean=C2=B1stddev over 10 runs of 10s. 1) latency tests similar to what is described in "c6e1a0d net: Allow no= -cache copy from user on transmit" There is no statistically significant difference between tx-nocache-cop= y on/off. nic irqs spread out (one queue per cpu) 200x netperf -r 1400,1 tx-nocache-copy off 692000=C2=B11000 tps 50/90/95/99% latency (us): 275=C2=B12/643.8=C2=B10.4/799=C2=B11= /2474.4=C2=B10.3 tx-nocache-copy on 693000=C2=B11000 tps 50/90/95/99% latency (us): 274=C2=B11/644.1=C2=B10.7/800=C2=B12= /2474.5=C2=B10.7 200x netperf -r 14000,14000 tx-nocache-copy off 86450=C2=B180 tps 50/90/95/99% latency (us): 334.37=C2=B10.02/838=C2=B11/2100=C2=B1= 20/3990=C2=B140 tx-nocache-copy on 86110=C2=B160 tps 50/90/95/99% latency (us): 334.28=C2=B10.01/837=C2=B12/2110=C2=B1= 20/3990=C2=B120 2) single stream throughput tests tx-nocache-copy leads to higher service demand throughput cpu0 cpu1 demand (Gb/s) (Gcycle) (Gcycle) (cycle/B) nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send) tx-nocache-copy off 9402=C2=B15 9.4=C2=B10.2 0= =2E80=C2=B10.01 tx-nocache-copy on 9403=C2=B13 9.85=C2=B10.04 0= =2E838=C2=B10.004 nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send) tx-nocache-copy off 9401=C2=B15 5.83=C2=B10.03 5.0=C2=B10.1 = 0.923=C2=B10.007 tx-nocache-copy on 9404=C2=B12 5.74=C2=B10.03 5.523=C2=B10.= 009 0.958=C2=B10.002 As a second example, here are some results from Eric Dumazet with lates= t net-next. tx-nocache-copy also leads to higher service demand (cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) lpq83:~# ./ethtool -K eth0 tx-nocache-copy on lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84= =2Eprod.google.com () port 0 AF_INET Recv Send Send Utilization Service = Demand Socket Socket Message Elapsed Send Recv Send = Recv Size Size Size Time Throughput local remote local = remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB = us/KB 87380 16384 16384 10.00 9407.44 2.50 -1.00 0.522 = -1.000 Performance counter stats for './netperf -H lpq84 -c': 4282.648396 task-clock # 0.423 CPUs utilized 9,348 context-switches # 0.002 M/sec 88 CPU-migrations # 0.021 K/sec 355 page-faults # 0.083 K/sec 11,812,797,651 cycles # 2.758 GHz = [82.79%] 9,020,522,817 stalled-cycles-frontend # 76.36% frontend cycles= idle [82.54%] 4,579,889,681 stalled-cycles-backend # 38.77% backend cycles= idle [67.33%] 6,053,172,792 instructions # 0.51 insns per cycle # 1.49 stalled cycles = per insn [83.64%] 597,275,583 branches # 139.464 M/sec = [83.70%] 8,960,541 branch-misses # 1.50% of all branches= [83.65%] 10.128990264 seconds time elapsed lpq83:~# ./ethtool -K eth0 tx-nocache-copy off lpq83:~# perf stat ./netperf -H lpq84 -c MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84= =2Eprod.google.com () port 0 AF_INET Recv Send Send Utilization Service = Demand Socket Socket Message Elapsed Send Recv Send = Recv Size Size Size Time Throughput local remote local = remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB = us/KB 87380 16384 16384 10.00 9412.45 2.15 -1.00 0.449 = -1.000 Performance counter stats for './netperf -H lpq84 -c': 2847.375441 task-clock # 0.281 CPUs utilized 11,632 context-switches # 0.004 M/sec 49 CPU-migrations # 0.017 K/sec 354 page-faults # 0.124 K/sec 7,646,889,749 cycles # 2.686 GHz = [83.34%] 6,115,050,032 stalled-cycles-frontend # 79.97% frontend cycles= idle [83.31%] 1,726,460,071 stalled-cycles-backend # 22.58% backend cycles= idle [66.55%] 2,079,702,453 instructions # 0.27 insns per cycle # 2.94 stalled cycles = per insn [83.22%] 363,773,213 branches # 127.757 M/sec = [83.29%] 4,242,732 branch-misses # 1.17% of all branches= [83.51%] 10.128449949 seconds time elapsed CC: Tom Herbert Signed-off-by: Benjamin Poirier --- net/core/dev.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 4fc1722..0e82e77 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5831,13 +5831,8 @@ int register_netdevice(struct net_device *dev) dev->features |=3D NETIF_F_SOFT_FEATURES; dev->wanted_features =3D dev->features & dev->hw_features; =20 - /* Turn on no cache copy if HW is doing checksum */ if (!(dev->flags & IFF_LOOPBACK)) { dev->hw_features |=3D NETIF_F_NOCACHE_COPY; - if (dev->features & NETIF_F_ALL_CSUM) { - dev->wanted_features |=3D NETIF_F_NOCACHE_COPY; - dev->features |=3D NETIF_F_NOCACHE_COPY; - } } =20 /* Make NETIF_F_HIGHDMA inheritable to VLAN devices. --=20 1.8.4