* [PATCH 0/2] Get rid of ndo_xmit_flush @ 2014-08-25 23:34 David Miller 2014-08-26 6:28 ` Jesper Dangaard Brouer ` (3 more replies) 0 siblings, 4 replies; 29+ messages in thread From: David Miller @ 2014-08-25 23:34 UTC (permalink / raw) To: netdev Cc: therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer Given Jesper's performance numbers, it's not the way to go. Instead, go with a signalling scheme via new boolean skb->xmit_more. This has several advantages: 1) Nearly trivial driver support, just protect the tail pointer update with the skb->xmit_more check. 2) No extra indirect calls in the non-deferral cases. Signed-off-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller @ 2014-08-26 6:28 ` Jesper Dangaard Brouer 2014-08-26 10:13 ` Jesper Dangaard Brouer 2014-08-27 12:19 ` Jesper Dangaard Brouer ` (2 subsequent siblings) 3 siblings, 1 reply; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-26 6:28 UTC (permalink / raw) To: David Miller Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > Given Jesper's performance numbers, it's not the way to go. > > Instead, go with a signalling scheme via new boolean skb->xmit_more. I'll do benchmarking based on this new API proposal today. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-26 6:28 ` Jesper Dangaard Brouer @ 2014-08-26 10:13 ` Jesper Dangaard Brouer 2014-08-26 12:52 ` Jesper Dangaard Brouer ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-26 10:13 UTC (permalink / raw) Cc: brouer, David Miller, netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > > > Given Jesper's performance numbers, it's not the way to go. > > > > Instead, go with a signalling scheme via new boolean skb->xmit_more. > > I'll do benchmarking based on this new API proposal today. While establish an accurate baseline for my measurements. I'm starting to see too much variation in my trafgen measurements. Meaning that we unfortunately cannot use it to measure variations on the nanosec scale. I'm measuring the packets per sec via "ifpps", and calculating an average over the measurements, via the following oneliner: $ ifpps -clod eth5 -t 1000 | awk 'BEGIN{txsum=0; rxsum=0; n=0} /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps\n", $3, $11, n, rxsum/n, txsum/n }' Below is measurements done on the *same* kerne: - M1: instant tx:1572766 pps n:215 average: tx:1573360 pps (reboot#1) - M2: instant tx:1561930 pps n:173 average: tx:1557064 pps (reboot#2) - M3: instant tx:1562088 pps n:300 average: tx:1559150 pps (reboot#2) - M4: instant tx:1564404 pps n:120 average: tx:1564948 pps (reboot#3) M1->M2: +6.65ns M1->M3: +5.79ns M1->M4: +3.42ns M3->M4: -2.38ns I cannot explain the variations, but some options could be 1) how well the SKB is cache-hot cached via kmem_cache 2) other interrups on CPU#0 could disturb us 3) interactions with scheduler 4) interactions with transparent hugepages 5) CPU "turbostat" interactions M1 tx:1573360 pps translates into 636ns per packet, and 1% change would translate into 6.36ns. Perhaps we just cannot accurately measure 1% improvement. Trying to increase the sched priority of trafgen (supported via option --prio-high) resulted in even worse performance results. And kernel starts to complain "BUG: soft lockup - CPU#0 stuck for 22s!". With --prio-high the "instant" start to fluctuate a lot see: - instant rx:0 tx:1529260 pps n:191 average: rx:0 tx:1528885 pps - instant rx:0 tx:1512640 pps n:192 average: rx:0 tx:1528800 pps - instant rx:0 tx:1480050 pps n:193 average: rx:0 tx:1528548 pps - instant rx:0 tx:1526474 pps n:194 average: rx:0 tx:1528537 pps -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-26 10:13 ` Jesper Dangaard Brouer @ 2014-08-26 12:52 ` Jesper Dangaard Brouer 2014-08-26 16:43 ` Alexander Duyck 2014-08-26 14:40 ` Jamal Hadi Salim 2014-09-01 0:37 ` Rusty Russell 2 siblings, 1 reply; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-26 12:52 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman On Tue, 26 Aug 2014 12:13:47 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > > > > > Given Jesper's performance numbers, it's not the way to go. > > > > > > Instead, go with a signalling scheme via new boolean skb->xmit_more. > > > > I'll do benchmarking based on this new API proposal today. > > While establish an accurate baseline for my measurements. I'm > starting to see too much variation in my trafgen measurements. > Meaning that we unfortunately cannot use it to measure variations on > the nanosec scale. Thus, we need to find a better more accurate measurement tool than trafgen/af_packet. Changed my PPS monitor "ifpps-oneliner" to calculate the nanosec variation between the instant reading and the average. For TX also record the "max" and "min" variation value seen. This should give us a better (instant) picture of how accurate the measurement is. ifpps -clod eth5 -t 1000 | \ awk 'BEGIN{txsum=0; rxsum=0; n=0; txvar=0; txvar_min=0; txvar_max=0; rxvar=0;} \ /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; \ txvar=0; if (txsum/n>10 && $11>0) { \ txvar=((1/(txsum/n)*10^9)-(1/$11*10^9)); \ if (n>10 && txvar < txvar_min) {txvar_min=txvar}; \ if (n>10 && txvar > txvar_max) {txvar_max=txvar}; \ }; \ rxvar=0; if (rxsum/n>10 && $3>0 ) { rxvar=((1/(rxsum/n)*10^9)-(1/$3*10^9))}; \ printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps (instant variation TX %.3f ns (min:%.3f max:%.3f) RX %.3f ns)\n", $3, $11, n, rxsum/n, txsum/n, txvar, txvar_min, txvar_max, rxvar; \ if (txvar > 2) {printf "WARNING instant variation high\n" } }' Nanosec variation with trafgen: ------------------------------- As can be seen, the min and max nanosec variation with trafgen is higher than we would like: Results: trafgen (sudo ethtool -C eth5 rx-usecs 1) instant rx:0 tx:1566064 pps n:152 average: rx:0 tx:1564534 pps (instant variation TX 0.624 ns (min:-6.336 max:1.766) RX 0.000 ns) Results: trafgen (sudo ethtool -C eth5 rx-usecs 30) instant rx:0 tx:1576452 pps n:121 average: rx:0 tx:1575652 pps (instant variation TX 0.322 ns (min:-4.479 max:0.714) RX 0.000 ns) Switching to pktgen ------------------- I suspect a more accurate measurement tool will be "pktgen", because we can cut out most of the things that can cause these variations (like kmem_cache and cache-hot variations, and most sched variations). The main problem with ixgbe is that, in this overload scenario, the performance is limited by the TX ring size and cleanup intervals, as described in: http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html https://www.kernel.org/doc/Documentation/networking/pktgen.txt Results below: Try to determine which ixgbe ethtool setting gives the most stable PPS readings. Notice the TX "min" and "max" nanosec variations seen over the period. Sampling over approx 120 sec. The best setting seems to be: sudo ethtool -C eth5 rx-usecs 30 sudo ethtool -G eth5 tx 512 #(default size) Pktgen tests are single CPU performance numbers, script based on: https://github.com/netoptimizer/network-testing/blob/master/pktgen/example01.sh with CLONE_SKB="100000" (and single flow, const port number 9/discard) Setting: sudo ethtool -G eth5 tx 512 #(Default setting) sudo ethtool -C eth5 rx-usecs 1 #(Default setting) Result pktgen: * instant rx:1 tx:3933892 pps n:120 average: rx:1 tx:3934182 pps (instant variation TX -0.019 ns (min:-0.047 max:0.016) RX 0.000 ns) The variation very small, but the performance is limited by the TX ring buffer being full most of the time, TX cleanup being too slow. Setting: (inc TX ring size) sudo ethtool -G eth5 tx 1024 sudo ethtool -C eth5 rx-usecs 1 #(default setting) Result pktgen: * instant rx:1 tx:5745632 pps n:118 average: rx:1 tx:5748818 pps (instant variation TX -0.096 ns (min:-0.293 max:0.897) RX 0.000 ns) Setting: sudo ethtool -G eth5 tx 512 sudo ethtool -C eth5 rx-usecs 20 Result pktgen: * instant rx:1 tx:5765168 pps n:120 average: rx:0 tx:5782242 pps (instant variation TX -0.512 ns (min:-1.008 max:1.599) RX 0.000 ns) Setting: sudo ethtool -G eth5 tx 512 sudo ethtool -C eth5 rx-usecs 30 Result pktgen: * instant rx:1 tx:5920856 pps n:114 average: rx:1 tx:5918350 pps (instant variation TX 0.071 ns (min:-0.177 max:0.135) RX 0.000 ns) Setting: sudo ethtool -G eth5 tx 512 sudo ethtool -C eth5 rx-usecs 40 Result pktgen: * instant rx:1 tx:5958408 pps n:120 average: rx:0 tx:5947908 pps (instant variation TX 0.296 ns (min:-1.410 max:0.595) RX 0.000 ns) Setting: sudo ethtool -G eth5 tx 512 sudo ethtool -C eth5 rx-usecs 50 Result pktgen: * instant rx:1 tx:5966964 pps n:120 average: rx:1 tx:5967306 pps (instant variation TX -0.010 ns (min:-1.330 max:0.169) RX 0.000 ns) Setting: sudo ethtool -C eth5 rx-usecs 30 sudo ethtool -G eth5 tx 1024 Result pktgen: instant rx:0 tx:5846252 pps n:120 average: rx:1 tx:5852464 pps (instant variation TX -0.182 ns (min:-0.467 max:2.249) RX 0.000 ns) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-26 12:52 ` Jesper Dangaard Brouer @ 2014-08-26 16:43 ` Alexander Duyck 2014-08-27 7:48 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 29+ messages in thread From: Alexander Duyck @ 2014-08-26 16:43 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman On 08/26/2014 05:52 AM, Jesper Dangaard Brouer wrote: > > On Tue, 26 Aug 2014 12:13:47 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > >> On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: >>> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: >>> >>>> Given Jesper's performance numbers, it's not the way to go. >>>> >>>> Instead, go with a signalling scheme via new boolean skb->xmit_more. >>> >>> I'll do benchmarking based on this new API proposal today. >> >> While establish an accurate baseline for my measurements. I'm >> starting to see too much variation in my trafgen measurements. >> Meaning that we unfortunately cannot use it to measure variations on >> the nanosec scale. > > Thus, we need to find a better more accurate measurement tool than > trafgen/af_packet. > > Changed my PPS monitor "ifpps-oneliner" to calculate the nanosec > variation between the instant reading and the average. For TX also > record the "max" and "min" variation value seen. > > This should give us a better (instant) picture of how accurate the > measurement is. > > ifpps -clod eth5 -t 1000 | \ > awk 'BEGIN{txsum=0; rxsum=0; n=0; txvar=0; txvar_min=0; txvar_max=0; rxvar=0;} \ > /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; \ > txvar=0; if (txsum/n>10 && $11>0) { \ > txvar=((1/(txsum/n)*10^9)-(1/$11*10^9)); \ > if (n>10 && txvar < txvar_min) {txvar_min=txvar}; \ > if (n>10 && txvar > txvar_max) {txvar_max=txvar}; \ > }; \ > rxvar=0; if (rxsum/n>10 && $3>0 ) { rxvar=((1/(rxsum/n)*10^9)-(1/$3*10^9))}; \ > printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps (instant variation TX %.3f ns (min:%.3f max:%.3f) RX %.3f ns)\n", $3, $11, n, rxsum/n, txsum/n, txvar, txvar_min, txvar_max, rxvar; \ > if (txvar > 2) {printf "WARNING instant variation high\n" } }' > > > Nanosec variation with trafgen: > ------------------------------- > > As can be seen, the min and max nanosec variation with trafgen is > higher than we would like: > > Results: trafgen > (sudo ethtool -C eth5 rx-usecs 1) > instant rx:0 tx:1566064 pps n:152 average: rx:0 tx:1564534 pps > (instant variation TX 0.624 ns (min:-6.336 max:1.766) RX 0.000 ns) > > Results: trafgen > (sudo ethtool -C eth5 rx-usecs 30) > instant rx:0 tx:1576452 pps n:121 average: rx:0 tx:1575652 pps > (instant variation TX 0.322 ns (min:-4.479 max:0.714) RX 0.000 ns) > > > Switching to pktgen > ------------------- > > I suspect a more accurate measurement tool will be "pktgen", because > we can cut out most of the things that can cause these variations > (like kmem_cache and cache-hot variations, and most sched variations). > > The main problem with ixgbe is that, in this overload scenario, the > performance is limited by the TX ring size and cleanup intervals, as > described in: > http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html > https://www.kernel.org/doc/Documentation/networking/pktgen.txt > > Results below: Try to determine which ixgbe ethtool setting gives the > most stable PPS readings. Notice the TX "min" and "max" nanosec > variations seen over the period. Sampling over approx 120 sec. > > The best setting seems to be: > sudo ethtool -C eth5 rx-usecs 30 > sudo ethtool -G eth5 tx 512 #(default size) > > Pktgen tests are single CPU performance numbers, script based on: > https://github.com/netoptimizer/network-testing/blob/master/pktgen/example01.sh > with CLONE_SKB="100000" (and single flow, const port number 9/discard) > > Setting: > sudo ethtool -G eth5 tx 512 #(Default setting) > sudo ethtool -C eth5 rx-usecs 1 #(Default setting) > Result pktgen: > * instant rx:1 tx:3933892 pps n:120 average: rx:1 tx:3934182 pps > (instant variation TX -0.019 ns (min:-0.047 max:0.016) RX 0.000 ns) > > The variation very small, but the performance is limited by the TX > ring buffer being full most of the time, TX cleanup being too slow. > > Setting: (inc TX ring size) > sudo ethtool -G eth5 tx 1024 > sudo ethtool -C eth5 rx-usecs 1 #(default setting) > Result pktgen: > * instant rx:1 tx:5745632 pps n:118 average: rx:1 tx:5748818 pps > (instant variation TX -0.096 ns (min:-0.293 max:0.897) RX 0.000 ns) > > Setting: > sudo ethtool -G eth5 tx 512 > sudo ethtool -C eth5 rx-usecs 20 > Result pktgen: > * instant rx:1 tx:5765168 pps n:120 average: rx:0 tx:5782242 pps > (instant variation TX -0.512 ns (min:-1.008 max:1.599) RX 0.000 ns) > > Setting: > sudo ethtool -G eth5 tx 512 > sudo ethtool -C eth5 rx-usecs 30 > Result pktgen: > * instant rx:1 tx:5920856 pps n:114 average: rx:1 tx:5918350 pps > (instant variation TX 0.071 ns (min:-0.177 max:0.135) RX 0.000 ns) > > Setting: > sudo ethtool -G eth5 tx 512 > sudo ethtool -C eth5 rx-usecs 40 > Result pktgen: > * instant rx:1 tx:5958408 pps n:120 average: rx:0 tx:5947908 pps > (instant variation TX 0.296 ns (min:-1.410 max:0.595) RX 0.000 ns) > > Setting: > sudo ethtool -G eth5 tx 512 > sudo ethtool -C eth5 rx-usecs 50 > Result pktgen: > * instant rx:1 tx:5966964 pps n:120 average: rx:1 tx:5967306 pps > (instant variation TX -0.010 ns (min:-1.330 max:0.169) RX 0.000 ns) > > Setting: > sudo ethtool -C eth5 rx-usecs 30 > sudo ethtool -G eth5 tx 1024 > Result pktgen: > instant rx:0 tx:5846252 pps n:120 average: rx:1 tx:5852464 pps > (instant variation TX -0.182 ns (min:-0.467 max:2.249) RX 0.000 ns) > > My advice would be to disable all C states and P states (including turbo) if possible, and try using idle=poll. Any processor frequency and/or C state transitions will totally wreak havoc with trying to get reliable results out of any performance test. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-26 16:43 ` Alexander Duyck @ 2014-08-27 7:48 ` Jesper Dangaard Brouer 2014-08-27 8:37 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-27 7:48 UTC (permalink / raw) To: Alexander Duyck Cc: David Miller, netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer, Jeremy Eder On Tue, 26 Aug 2014 09:43:48 -0700 Alexander Duyck <alexander.h.duyck@intel.com> wrote: > On 08/26/2014 05:52 AM, Jesper Dangaard Brouer wrote: > > [...] > > > > Setting: (inc TX ring size) > > sudo ethtool -G eth5 tx 1024 > > sudo ethtool -C eth5 rx-usecs 1 #(default setting) > > Result pktgen: > > * instant rx:1 tx:5745632 pps n:118 average: rx:1 tx:5748818 pps > > (instant variation TX -0.096 ns (min:-0.293 max:0.897) RX 0.000 ns) > > [...] > > Setting: > > sudo ethtool -G eth5 tx 512 > > sudo ethtool -C eth5 rx-usecs 30 > > Result pktgen: > > * instant rx:1 tx:5920856 pps n:114 average: rx:1 tx:5918350 pps > > (instant variation TX 0.071 ns (min:-0.177 max:0.135) RX 0.000 ns) > > > > My advice would be to disable all C states and P states (including > turbo) if possible, and try using idle=poll. Any processor frequency > and/or C state transitions will totally wreak havoc with trying to get > reliable results out of any performance test. Thanks for the advice. Reading Jeremy Eder's blog post: http://www.breakage.org/2012/11/14/processor-max_cstate-intel_idle-max_cstate-and-devcpu_dma_latency/ It seems the best method for disabling these C and P states, and keeping all CPUs in C0/C1 state is doing: # tuned-adm profile latency-performance Accuracy of: "ethtool -C eth5 rx-usecs 30" got slightly little better, (see below) latency variations below 1ns min:-0.209 ns and max:0.114 ns. Which is very good, and should be good enough for us, to measure the upcoming code changes. Just increasing TX ring queue still have variations. I'm also going to disable Hyper-threading in BIOS, even-though I'm only using one CPU in these tests (I worry that some process running on a sibling could disturb the accuracy). Setting: * sudo ethtool -C eth5 rx-usecs 30 Result: * instant rx:1 tx:5603644 pps n:120 average: rx:0 tx:5603317 pps (instant variation TX 0.010 ns (min:-0.132 max:0.114) RX 0.000 ns) * instant rx:1 tx:5599820 pps n:300 average: rx:1 tx:5602382 pps (instant variation TX -0.082 ns (min:-0.209 max:0.114) RX 0.000 ns) Setting: * sudo ethtool -G eth5 tx 1024 Result: * instant rx:1 tx:5398208 pps n:300 average: rx:1 tx:5404893 pps (instant variation TX -0.229 ns (min:-0.257 max:1.666) RX 0.000 ns) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 7:48 ` Jesper Dangaard Brouer @ 2014-08-27 8:37 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-27 8:37 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Alexander Duyck, David Miller, netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman, Jeremy Eder On Wed, 27 Aug 2014 09:48:59 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > Reading Jeremy Eder's blog post: > http://www.breakage.org/2012/11/14/processor-max_cstate-intel_idle-max_cstate-and-devcpu_dma_latency/ > > It seems the best method for disabling these C and P states, and > keeping all CPUs in C0/C1 state is doing: > > # tuned-adm profile latency-performance [...] > I'm also going to disable Hyper-threading in BIOS, even-though I'm only > using one CPU in these tests (I worry that some process running on a > sibling could disturb the accuracy). Disabling Hyper-Threading in BIOS helped the accuracy, now the latency variation is below 0.1ns min:-0.031 and max:0.054 over 120 sec. Still using: # tuned-adm profile latency-performance > Setting: > * sudo ethtool -C eth5 rx-usecs 30 > Result: > * instant rx:1 tx:5603644 pps n:120 average: rx:0 tx:5603317 pps > (instant variation TX 0.010 ns (min:-0.132 max:0.114) RX 0.000 ns) > * instant rx:1 tx:5599820 pps n:300 average: rx:1 tx:5602382 pps > (instant variation TX -0.082 ns (min:-0.209 max:0.114) RX 0.000 ns) With no HT: * instant rx:1 tx:5597616 pps n:120 average: rx:1 tx:5596283 pps (instant variation TX 0.043 ns (min:-0.031 max:0.054) RX 0.000 ns) With this setup, I think we are ready to measure the effect of our API changes with sufficient accuracy. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-26 10:13 ` Jesper Dangaard Brouer 2014-08-26 12:52 ` Jesper Dangaard Brouer @ 2014-08-26 14:40 ` Jamal Hadi Salim 2014-09-01 0:37 ` Rusty Russell 2 siblings, 0 replies; 29+ messages in thread From: Jamal Hadi Salim @ 2014-08-26 14:40 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: David Miller, netdev, therbert, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman On 08/26/14 06:13, Jesper Dangaard Brouer wrote: > > On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: >> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: >> >>> Given Jesper's performance numbers, it's not the way to go. >>> >>> Instead, go with a signalling scheme via new boolean skb->xmit_more. >> >> I'll do benchmarking based on this new API proposal today. > > While establish an accurate baseline for my measurements. I'm > starting to see too much variation in my trafgen measurements. > Meaning that we unfortunately cannot use it to measure variations on > the nanosec scale. > > I'm measuring the packets per sec via "ifpps", and calculating an > average over the measurements, via the following oneliner: > > $ ifpps -clod eth5 -t 1000 | awk 'BEGIN{txsum=0; rxsum=0; n=0} /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps\n", $3, $11, n, rxsum/n, txsum/n }' > > Below is measurements done on the *same* kerne: > - M1: instant tx:1572766 pps n:215 average: tx:1573360 pps (reboot#1) > - M2: instant tx:1561930 pps n:173 average: tx:1557064 pps (reboot#2) > - M3: instant tx:1562088 pps n:300 average: tx:1559150 pps (reboot#2) > - M4: instant tx:1564404 pps n:120 average: tx:1564948 pps (reboot#3) > > M1->M2: +6.65ns > M1->M3: +5.79ns > M1->M4: +3.42ns > M3->M4: -2.38ns > > I cannot explain the variations, but some options could be > 1) how well the SKB is cache-hot cached via kmem_cache > 2) other interrups on CPU#0 could disturb us > 3) interactions with scheduler > 4) interactions with transparent hugepages > 5) CPU "turbostat" interactions > The clock source used in your system may be an issue as well. I am not sure if that matters these days but it used to be. I think back then you could pick ACPI, Jiffies etc to use as a qdisc clock source. cheers, jamal ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-26 10:13 ` Jesper Dangaard Brouer 2014-08-26 12:52 ` Jesper Dangaard Brouer 2014-08-26 14:40 ` Jamal Hadi Salim @ 2014-09-01 0:37 ` Rusty Russell 2 siblings, 0 replies; 29+ messages in thread From: Rusty Russell @ 2014-09-01 0:37 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: brouer, David Miller, netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, dborkman Jesper Dangaard Brouer <brouer@redhat.com> writes: > On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: >> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: >> >> > Given Jesper's performance numbers, it's not the way to go. >> > >> > Instead, go with a signalling scheme via new boolean skb->xmit_more. >> >> I'll do benchmarking based on this new API proposal today. > > While establish an accurate baseline for my measurements. I'm > starting to see too much variation in my trafgen measurements. > Meaning that we unfortunately cannot use it to measure variations on > the nanosec scale. > > I'm measuring the packets per sec via "ifpps", and calculating an > average over the measurements, via the following oneliner: > > $ ifpps -clod eth5 -t 1000 | awk 'BEGIN{txsum=0; rxsum=0; n=0} /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps\n", $3, $11, n, rxsum/n, txsum/n }' FYI, this is what I use for this kind of thing: https://github.com/rustyrussell/stats >From the README: This filter finds identical lines and collapses all the numbers into a range, average and standard deviation; it can also print out all the numbers in CSV form for import into spreadsheets, etc. Cheers, Rusty. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller 2014-08-26 6:28 ` Jesper Dangaard Brouer @ 2014-08-27 12:19 ` Jesper Dangaard Brouer 2014-08-27 20:43 ` David Miller 2014-08-27 12:31 ` Hannes Frederic Sowa 2014-08-27 18:28 ` Cong Wang 3 siblings, 1 reply; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-27 12:19 UTC (permalink / raw) To: David Miller Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > Given Jesper's performance numbers, it's not the way to go. > > Instead, go with a signalling scheme via new boolean skb->xmit_more. > > This has several advantages: > > 1) Nearly trivial driver support, just protect the tail pointer > update with the skb->xmit_more check. > > 2) No extra indirect calls in the non-deferral cases. Even-though it is obvious that this new API skb->xmit_more will not hurt performance, especially given skb->xmit_more is always 0 in this kernel, I've still run my pktgen performance tests. Compared to baseline[1]: (averaged 5609929 pps) (details below signature) * (1/5609929*10^9)-(1/5603728*10^9) = -0.197ns As expected, this API does not hurt performance (as -0.197ns is below our accuracy levels). [1] http://thread.gmane.org/gmane.linux.network/327254/focus=327838 -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer Results: on branch bulking02 ---------------------------- Kernel with skb->xmit_more API. Kernel at: * commit a3d1214688d ("neigh: document gc_thresh2") With no HT: * ethtool -C eth5 rx-usecs 30 * tuned-adm profile latency-performance Results (pktgen): * instant rx:1 tx:5604056 pps n:250 average: rx:1 tx:5604013 pps (instant variation TX 0.001 ns (min:-0.114 max:0.104) RX 0.000 ns) * instant rx:1 tx:5603388 pps n:87 average: rx:1 tx:5603872 pps (instant variation TX -0.015 ns (min:-0.031 max:0.060) RX 0.000 ns) * instant rx:1 tx:5604332 pps n:429 average: rx:1 tx:5603300 pps (instant variation TX 0.033 ns (min:-0.094 max:0.881) RX 0.000 ns) * Average: (5604013+5603872+5603300)/3 = 5603728 Compared to baseline: (averaged 5609929 pps) * (1/5609929*10^9)-(1/5603728*10^9) = -0.197ns ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 12:19 ` Jesper Dangaard Brouer @ 2014-08-27 20:43 ` David Miller 0 siblings, 0 replies; 29+ messages in thread From: David Miller @ 2014-08-27 20:43 UTC (permalink / raw) To: brouer Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Wed, 27 Aug 2014 14:19:18 +0200 > On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) > David Miller <davem@davemloft.net> wrote: > >> Given Jesper's performance numbers, it's not the way to go. >> >> Instead, go with a signalling scheme via new boolean skb->xmit_more. >> >> This has several advantages: >> >> 1) Nearly trivial driver support, just protect the tail pointer >> update with the skb->xmit_more check. >> >> 2) No extra indirect calls in the non-deferral cases. > > Even-though it is obvious that this new API skb->xmit_more will not > hurt performance, especially given skb->xmit_more is always 0 in this > kernel, I've still run my pktgen performance tests. > > Compared to baseline[1]: (averaged 5609929 pps) (details below signature) > * (1/5609929*10^9)-(1/5603728*10^9) = -0.197ns > > As expected, this API does not hurt performance (as -0.197ns is below > our accuracy levels). > > [1] http://thread.gmane.org/gmane.linux.network/327254/focus=327838 Thanks for validating Jesper :-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller 2014-08-26 6:28 ` Jesper Dangaard Brouer 2014-08-27 12:19 ` Jesper Dangaard Brouer @ 2014-08-27 12:31 ` Hannes Frederic Sowa 2014-08-27 13:23 ` Eric Dumazet 2014-08-27 20:45 ` David Miller 2014-08-27 18:28 ` Cong Wang 3 siblings, 2 replies; 29+ messages in thread From: Hannes Frederic Sowa @ 2014-08-27 12:31 UTC (permalink / raw) To: David Miller Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer On Mo, 2014-08-25 at 16:34 -0700, David Miller wrote: > Given Jesper's performance numbers, it's not the way to go. > > Instead, go with a signalling scheme via new boolean skb->xmit_more. > > This has several advantages: > > 1) Nearly trivial driver support, just protect the tail pointer > update with the skb->xmit_more check. One thing one should keep in mind is, that there must be a skb available to trigger the flush, maybe this will hurt us one day. Thinking more about it should we go with a coccinelle script and replace/extend ndo_start_xmit with an additional argument? We can also add a new function pointer and call that instead of ndo_start_xmit. I think only the callq *%rax hurts performance. Bye, Hannes ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 12:31 ` Hannes Frederic Sowa @ 2014-08-27 13:23 ` Eric Dumazet 2014-08-27 13:56 ` Jesper Dangaard Brouer 2014-08-27 20:46 ` David Miller 2014-08-27 20:45 ` David Miller 1 sibling, 2 replies; 29+ messages in thread From: Eric Dumazet @ 2014-08-27 13:23 UTC (permalink / raw) To: Hannes Frederic Sowa Cc: David Miller, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer On Wed, 2014-08-27 at 14:31 +0200, Hannes Frederic Sowa wrote: > One thing one should keep in mind is, that there must be a skb available > to trigger the flush, maybe this will hurt us one day. > > Thinking more about it should we go with a coccinelle script and > replace/extend ndo_start_xmit with an additional argument? This will be a pain for backports and things like that. skb->xmit_more is a bit annoying, because it consumes one bit in all skbs, while it could be one byte per cpu as ndo_start_xmit() is called while BH are disabled. It also forces a cache line dirtying, that will hurt qdisc like HTB where skbs can be cooked/enqueued by remote cpus. Also, most drivers will never benefit from this new infra. Mellanox mlx4 uses the Blue Frame trick to avoid the iowrite() cost. I am afraid Jesper tests are not complete to truly have a picture of the extra costs, because he made sure no false sharing was possible (A single cpu does everything and keeps skb in its cache) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 13:23 ` Eric Dumazet @ 2014-08-27 13:56 ` Jesper Dangaard Brouer 2014-08-27 14:09 ` Eric Dumazet 2014-08-27 20:48 ` David Miller 2014-08-27 20:46 ` David Miller 1 sibling, 2 replies; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-27 13:56 UTC (permalink / raw) To: Eric Dumazet Cc: Hannes Frederic Sowa, David Miller, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer On Wed, 27 Aug 2014 06:23:58 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > I am afraid Jesper tests are not complete to truly have a picture of the > extra costs, because he made sure no false sharing was possible (A > single cpu does everything and keeps skb in its cache) I fully agree. My tests a very artificial benchmarks only for the optimal case. I do worry a bit, if writing skb->xmit_more in a more cache cold scenarios could hurt us (as that cacheline seems to be read mostly), but this is not a problem right now because it will always be cache hot, as we always clear the entire SKB in alloc. -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 13:56 ` Jesper Dangaard Brouer @ 2014-08-27 14:09 ` Eric Dumazet 2014-08-27 20:48 ` David Miller 1 sibling, 0 replies; 29+ messages in thread From: Eric Dumazet @ 2014-08-27 14:09 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Hannes Frederic Sowa, David Miller, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman On Wed, 2014-08-27 at 15:56 +0200, Jesper Dangaard Brouer wrote: > I do worry a bit, if writing skb->xmit_more in a more cache cold > scenarios could hurt us (as that cacheline seems to be read mostly), > but this is not a problem right now because it will always be cache > hot, as we always clear the entire SKB in alloc. > Yes, but many of us use things like TCP stack and thousand of packets sitting in a possibly complex qdisc hierarchy before hitting the device ;) In these situations, skb is dirtied at TX completion time, but not in ndo_start_xmit() (where its only read) In the (qdisc->dequeue() -> ndo_start_xmit()) stage, only the first cache line of skb is typically dirtied. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 13:56 ` Jesper Dangaard Brouer 2014-08-27 14:09 ` Eric Dumazet @ 2014-08-27 20:48 ` David Miller 1 sibling, 0 replies; 29+ messages in thread From: David Miller @ 2014-08-27 20:48 UTC (permalink / raw) To: brouer Cc: eric.dumazet, hannes, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Wed, 27 Aug 2014 15:56:51 +0200 > I do worry a bit, if writing skb->xmit_more in a more cache cold > scenarios could hurt us (as that cacheline seems to be read mostly), > but this is not a problem right now because it will always be cache > hot, as we always clear the entire SKB in alloc. And as I mentioned to Eric, xmit_more sits right next to the queue number which is also dirtied in the TX path. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 13:23 ` Eric Dumazet 2014-08-27 13:56 ` Jesper Dangaard Brouer @ 2014-08-27 20:46 ` David Miller 1 sibling, 0 replies; 29+ messages in thread From: David Miller @ 2014-08-27 20:46 UTC (permalink / raw) To: eric.dumazet Cc: hannes, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 27 Aug 2014 06:23:58 -0700 > On Wed, 2014-08-27 at 14:31 +0200, Hannes Frederic Sowa wrote: > >> One thing one should keep in mind is, that there must be a skb available >> to trigger the flush, maybe this will hurt us one day. >> >> Thinking more about it should we go with a coccinelle script and >> replace/extend ndo_start_xmit with an additional argument? > > This will be a pain for backports and things like that. Agreed. > skb->xmit_more is a bit annoying, because it consumes one bit in all > skbs, while it could be one byte per cpu as ndo_start_xmit() is called > while BH are disabled. > > It also forces a cache line dirtying, that will hurt qdisc like HTB > where skbs can be cooked/enqueued by remote cpus. The xmit_more value sits in the same cache line as the queue number, which also gets dirtied in the TX fast path. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 12:31 ` Hannes Frederic Sowa 2014-08-27 13:23 ` Eric Dumazet @ 2014-08-27 20:45 ` David Miller 2014-08-28 1:42 ` Hannes Frederic Sowa 1 sibling, 1 reply; 29+ messages in thread From: David Miller @ 2014-08-27 20:45 UTC (permalink / raw) To: hannes Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer From: Hannes Frederic Sowa <hannes@stressinduktion.org> Date: Wed, 27 Aug 2014 14:31:12 +0200 > On Mo, 2014-08-25 at 16:34 -0700, David Miller wrote: >> Given Jesper's performance numbers, it's not the way to go. >> >> Instead, go with a signalling scheme via new boolean skb->xmit_more. >> >> This has several advantages: >> >> 1) Nearly trivial driver support, just protect the tail pointer >> update with the skb->xmit_more check. > > One thing one should keep in mind is, that there must be a skb available > to trigger the flush, maybe this will hurt us one day. > > Thinking more about it should we go with a coccinelle script and > replace/extend ndo_start_xmit with an additional argument? > > We can also add a new function pointer and call that instead of > ndo_start_xmit. I think only the callq *%rax hurts performance. I don't think we will have any problems here, the caller will always be the entity which analyzes the upcoming set of SKBs to submit and tag them properly. I really do not want to add a new OP and I even more so do not want to adjust the ndo_start_xmit() signature. It's effect is far reaching, and for absolutely no gain as far as I can see. Thanks. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 20:45 ` David Miller @ 2014-08-28 1:42 ` Hannes Frederic Sowa 2014-08-30 3:22 ` David Miller 0 siblings, 1 reply; 29+ messages in thread From: Hannes Frederic Sowa @ 2014-08-28 1:42 UTC (permalink / raw) To: David Miller Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer, John Fastabend On Mi, 2014-08-27 at 13:45 -0700, David Miller wrote: > From: Hannes Frederic Sowa <hannes@stressinduktion.org> > Date: Wed, 27 Aug 2014 14:31:12 +0200 > > > On Mo, 2014-08-25 at 16:34 -0700, David Miller wrote: > >> Given Jesper's performance numbers, it's not the way to go. > >> > >> Instead, go with a signalling scheme via new boolean skb->xmit_more. > >> > >> This has several advantages: > >> > >> 1) Nearly trivial driver support, just protect the tail pointer > >> update with the skb->xmit_more check. > > > > One thing one should keep in mind is, that there must be a skb available > > to trigger the flush, maybe this will hurt us one day. > > > > Thinking more about it should we go with a coccinelle script and > > replace/extend ndo_start_xmit with an additional argument? > > > > We can also add a new function pointer and call that instead of > > ndo_start_xmit. I think only the callq *%rax hurts performance. > > I don't think we will have any problems here, the caller will always > be the entity which analyzes the upcoming set of SKBs to submit and > tag them properly. > > I really do not want to add a new OP and I even more so do not want to > adjust the ndo_start_xmit() signature. It's effect is far reaching, > and for absolutely no gain as far as I can see. > > Thanks. Just brainstorming a bit: I wonder if we still might need a separate call for tx_flush, e.g. for af_packet if one wants to allow user space control of batching, MSG_MORE with tx hangcheck (also in case user space has control over it) or implement TCP_CORK alike option in af_packet. Even in the case we someday might allow concurrent dequeueing from a qdisc we might want to use xmit_more opportunistic (also user space might use xmit_more opportunistic if dequeueing from lockless data structures). I just want to ensure we thought about possible reuses of this driver api change. ;) To do so just some proposals: 1) export tx_flush function nonetheless but normal/fast path just depends on skb->xmit_more (so driver just does a static call to the flushing function, no indirection here). We would keep the ndo_op for flushing around, but ndo_start_xmit must still honor xmit_more by itself. af_packet sockets could then flush the queues separately if wanted. Also if ndo_op != NULL we know that driver supports batching. 2) add new op like ndo_start_xmit_flags op with an additional flags parameter and document that NULL skbs can happen. In this case tx-queue should only get flushed. By checking for the new method we also know that the driver supports batching. 3) provide special skb (only head) with no data to ndo_start_xmit which only clears the queue but prepare drivers to not enqueue any new data. I am concerned that if user space can control batching to the NIC we can have side-effects if user space did not flush the queue of the nic before and we dequeue more packets from qdisc into the same nic's tx queue. Allowing user space control over batching might make the conversion of drivers more painful, as driver might have to check for "dirtiness" before it tries to enqueue the packet internally. (To track queueing state of nic we also must track the state globally per queue which might also become a performance problem cache trashing wise, I am unsure about that; at least seems not to be the case with bql. My idea was a tx-queue dirty bit somewhere, or check bql state before submitting skb. But maybe one has to extend bql to keep track of unflushed vs. flushed skbs to make this safe.) If queuing failed because of missing tx_flush we either have to requeue the skb to the qdisc or do a fast retry, but I don't think that NICs will catch up that fast, no? Or can we just blame user space applications that they are wrong if they use xmit_more irresponsible? I liked the separate ndo ops approach because it seemed so very well extendable to me. :/ I know that these concerns a bit messy but wanted to share them nonetheless. Maybe they are all void in the end. ;) Also maybe these concerns can get solved by the addition of future AF_PACKETV4 ndo ops and we can just postpone this discussion. Bye, Hannes ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-28 1:42 ` Hannes Frederic Sowa @ 2014-08-30 3:22 ` David Miller 2014-08-30 10:23 ` Jesper Dangaard Brouer 2014-09-01 20:05 ` Hannes Frederic Sowa 0 siblings, 2 replies; 29+ messages in thread From: David Miller @ 2014-08-30 3:22 UTC (permalink / raw) To: hannes Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer, john.r.fastabend From: Hannes Frederic Sowa <hannes@stressinduktion.org> Date: Thu, 28 Aug 2014 03:42:54 +0200 > I wonder if we still might need a separate call for tx_flush, e.g. for > af_packet if one wants to allow user space control of batching, MSG_MORE > with tx hangcheck (also in case user space has control over it) or > implement TCP_CORK alike option in af_packet. I disagree with allowing the user to hold a device TX queue hostage across system calls, therefore the user should provide the entire batch in such a case. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-30 3:22 ` David Miller @ 2014-08-30 10:23 ` Jesper Dangaard Brouer 2014-09-01 20:05 ` Hannes Frederic Sowa 1 sibling, 0 replies; 29+ messages in thread From: Jesper Dangaard Brouer @ 2014-08-30 10:23 UTC (permalink / raw) To: David Miller Cc: hannes, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, john.r.fastabend, brouer On Fri, 29 Aug 2014 20:22:10 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Hannes Frederic Sowa <hannes@stressinduktion.org> > Date: Thu, 28 Aug 2014 03:42:54 +0200 > > > I wonder if we still might need a separate call for tx_flush, e.g. for > > af_packet if one wants to allow user space control of batching, MSG_MORE > > with tx hangcheck (also in case user space has control over it) or > > implement TCP_CORK alike option in af_packet. > > I disagree with allowing the user to hold a device TX queue hostage > across system calls, therefore the user should provide the entire > batch in such a case. I believe we should only do tailptr batching/delaying, when a batch of packets is readily available. I don't like the concept of speculative delaying the tailptr update, in hope a new packet will arrive shortly. We still might need a call for tx_flush, if some error case occurs while we are looping over our batch of packets. Most of these TX error situation should be hopefully be detected in the driver, and cause the driver to write the tailptr before returning. Like in commit 2367a17390 ("ixgbe: flush when in xmit_more mode and under descriptor pressure"). -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-30 3:22 ` David Miller 2014-08-30 10:23 ` Jesper Dangaard Brouer @ 2014-09-01 20:05 ` Hannes Frederic Sowa 2014-09-01 21:56 ` David Miller 1 sibling, 1 reply; 29+ messages in thread From: Hannes Frederic Sowa @ 2014-09-01 20:05 UTC (permalink / raw) To: David Miller Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer, john.r.fastabend On Fr, 2014-08-29 at 20:22 -0700, David Miller wrote: > From: Hannes Frederic Sowa <hannes@stressinduktion.org> > Date: Thu, 28 Aug 2014 03:42:54 +0200 > > > I wonder if we still might need a separate call for tx_flush, e.g. for > > af_packet if one wants to allow user space control of batching, MSG_MORE > > with tx hangcheck (also in case user space has control over it) or > > implement TCP_CORK alike option in af_packet. > > I disagree with allowing the user to hold a device TX queue hostage > across system calls, therefore the user should provide the entire > batch in such a case. Ok, granted. In regards to syscall latency this also is a stupid idea. mmaped tx approaches won't even pass these functions, so we don't care here. But as soon as we try to make Qdiscs absolutely lockless, we don't have any guard that we don't concurrently dequeue skbs from it and suddenly one Qdisc dequeue processing entity couldn't notify the driver that the end of the batching was reached. I think this could become a problem depending on how much of the locking is removed? Bye, Hannes ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-09-01 20:05 ` Hannes Frederic Sowa @ 2014-09-01 21:56 ` David Miller 2014-09-01 22:31 ` Hannes Frederic Sowa 0 siblings, 1 reply; 29+ messages in thread From: David Miller @ 2014-09-01 21:56 UTC (permalink / raw) To: hannes Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer, john.r.fastabend From: Hannes Frederic Sowa <hannes@stressinduktion.org> Date: Mon, 01 Sep 2014 22:05:42 +0200 > But as soon as we try to make Qdiscs absolutely lockless, we don't have > any guard that we don't concurrently dequeue skbs from it and suddenly > one Qdisc dequeue processing entity couldn't notify the driver that the > end of the batching was reached. I think this could become a problem > depending on how much of the locking is removed? I am certain that batching will require taking the device transmit lock over the ->ndo_start_xmit() invocations, and therefore the deferral decisions must atomically be made inside of that context. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-09-01 21:56 ` David Miller @ 2014-09-01 22:31 ` Hannes Frederic Sowa 2014-09-01 22:35 ` David Miller 0 siblings, 1 reply; 29+ messages in thread From: Hannes Frederic Sowa @ 2014-09-01 22:31 UTC (permalink / raw) To: David Miller Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer, john.r.fastabend On Mo, 2014-09-01 at 14:56 -0700, David Miller wrote: > From: Hannes Frederic Sowa <hannes@stressinduktion.org> > Date: Mon, 01 Sep 2014 22:05:42 +0200 > > > But as soon as we try to make Qdiscs absolutely lockless, we don't have > > any guard that we don't concurrently dequeue skbs from it and suddenly > > one Qdisc dequeue processing entity couldn't notify the driver that the > > end of the batching was reached. I think this could become a problem > > depending on how much of the locking is removed? > > I am certain that batching will require taking the device transmit > lock over the ->ndo_start_xmit() invocations, and therefore the > deferral decisions must atomically be made inside of that context. I'll stop to throw doubt into this discussion now. ;) I am afraid that TX-lock is per nic-queue and it won't work out that easy because skbs for different queues can reside in a Qdisc. Bye, Hannes ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-09-01 22:31 ` Hannes Frederic Sowa @ 2014-09-01 22:35 ` David Miller 0 siblings, 0 replies; 29+ messages in thread From: David Miller @ 2014-09-01 22:35 UTC (permalink / raw) To: hannes Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer, john.r.fastabend From: Hannes Frederic Sowa <hannes@stressinduktion.org> Date: Tue, 02 Sep 2014 00:31:09 +0200 > I am afraid that TX-lock is per nic-queue and it won't work out that > easy because skbs for different queues can reside in a Qdisc. Batching will help mostly when netif_queue_start() triggers, when we are in steady state stopping and re-starting the txq. What we really care about are the qdisc <--> txq one to one relationship which is by far the most common case. In the multiple txq to qdisc case, we still can benefit because packets for one txq will still tend to bunch op and give deferral opportunities. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller ` (2 preceding siblings ...) 2014-08-27 12:31 ` Hannes Frederic Sowa @ 2014-08-27 18:28 ` Cong Wang 2014-08-27 19:31 ` Tom Herbert 2014-08-27 20:51 ` David Miller 3 siblings, 2 replies; 29+ messages in thread From: Cong Wang @ 2014-08-27 18:28 UTC (permalink / raw) To: David Miller Cc: netdev, Tom Herbert, Jamal Hadi Salim, Hannes Frederic Sowa, Eric Dumazet, jeffrey.t.kirsher, rusty, Daniel Borkmann, brouer On Mon, Aug 25, 2014 at 4:34 PM, David Miller <davem@davemloft.net> wrote: > > Given Jesper's performance numbers, it's not the way to go. > > Instead, go with a signalling scheme via new boolean skb->xmit_more. > > This has several advantages: > > 1) Nearly trivial driver support, just protect the tail pointer > update with the skb->xmit_more check. > > 2) No extra indirect calls in the non-deferral cases. > First of all, I missed your discussion at kernel summit. Second of all, I am not familiar with hardware NIC drivers. But for me, it looks like you are trying to pend some more packets in a TX queue until the driver decides to flush them all in one shot. So if that is true, doesn't this mean the latency of first packet pending in this queue will increase and network traffic will be more bursty for the receiver?? Also, even if this is a cool feature in hardware driver, doesn't it make sense to allow users to disable it with ethtool? It looks like this is still WIP, since no one really sets ->xmit_more to 1 in the code. Thanks. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 18:28 ` Cong Wang @ 2014-08-27 19:31 ` Tom Herbert 2014-08-27 20:53 ` David Miller 2014-08-27 20:51 ` David Miller 1 sibling, 1 reply; 29+ messages in thread From: Tom Herbert @ 2014-08-27 19:31 UTC (permalink / raw) To: Cong Wang Cc: David Miller, netdev, Jamal Hadi Salim, Hannes Frederic Sowa, Eric Dumazet, Jeff Kirsher, Rusty Russell, Daniel Borkmann, brouer On Wed, Aug 27, 2014 at 11:28 AM, Cong Wang <cwang@twopensource.com> wrote: > On Mon, Aug 25, 2014 at 4:34 PM, David Miller <davem@davemloft.net> wrote: >> >> Given Jesper's performance numbers, it's not the way to go. >> >> Instead, go with a signalling scheme via new boolean skb->xmit_more. >> >> This has several advantages: >> >> 1) Nearly trivial driver support, just protect the tail pointer >> update with the skb->xmit_more check. >> >> 2) No extra indirect calls in the non-deferral cases. >> > > First of all, I missed your discussion at kernel summit. > > Second of all, I am not familiar with hardware NIC drivers. > > But for me, it looks like you are trying to pend some more packets > in a TX queue until the driver decides to flush them all in one shot. > So if that is true, doesn't this mean the latency of first packet pending > in this queue will increase and network traffic will be more bursty for > the receiver?? > I suspect this won't be an big issue. The dequeue is still work conserving and BQL limit already ensures that HW queue doesn't drain completely when packets are pending in the qdisc-- I doubt this will increase BQL limits, but that should be verified. We might see some latency increase for a batch sent on an idle link (possible with GSO)-- if this is a concern we could arrange flush on sending packets on idle links. > Also, even if this is a cool feature in hardware driver, doesn't it make > sense to allow users to disable it with ethtool? > > It looks like this is still WIP, since no one really sets ->xmit_more to 1 > in the code. > > Thanks. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 19:31 ` Tom Herbert @ 2014-08-27 20:53 ` David Miller 0 siblings, 0 replies; 29+ messages in thread From: David Miller @ 2014-08-27 20:53 UTC (permalink / raw) To: therbert Cc: cwang, netdev, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer From: Tom Herbert <therbert@google.com> Date: Wed, 27 Aug 2014 12:31:15 -0700 > On Wed, Aug 27, 2014 at 11:28 AM, Cong Wang <cwang@twopensource.com> wrote: >> On Mon, Aug 25, 2014 at 4:34 PM, David Miller <davem@davemloft.net> wrote: >>> >>> Given Jesper's performance numbers, it's not the way to go. >>> >>> Instead, go with a signalling scheme via new boolean skb->xmit_more. >>> >>> This has several advantages: >>> >>> 1) Nearly trivial driver support, just protect the tail pointer >>> update with the skb->xmit_more check. >>> >>> 2) No extra indirect calls in the non-deferral cases. >>> >> >> First of all, I missed your discussion at kernel summit. >> >> Second of all, I am not familiar with hardware NIC drivers. >> >> But for me, it looks like you are trying to pend some more packets >> in a TX queue until the driver decides to flush them all in one shot. >> So if that is true, doesn't this mean the latency of first packet pending >> in this queue will increase and network traffic will be more bursty for >> the receiver?? >> > I suspect this won't be an big issue. The dequeue is still work > conserving and BQL limit already ensures that HW queue doesn't drain > completely when packets are pending in the qdisc-- I doubt this will > increase BQL limits, but that should be verified. We might see some > latency increase for a batch sent on an idle link (possible with > GSO)-- if this is a concern we could arrange flush on sending packets > on idle links. That's also correct. The issue to handle specially is the initial send on a TX queue which is empty or close to being empty. Probably we want some kind of exponential backoff type scheme, so assuming we have an empty TX queue we'd trigger TX on the first packet, then the third, then the 7th. Assuming we had that many to send at once. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/2] Get rid of ndo_xmit_flush 2014-08-27 18:28 ` Cong Wang 2014-08-27 19:31 ` Tom Herbert @ 2014-08-27 20:51 ` David Miller 1 sibling, 0 replies; 29+ messages in thread From: David Miller @ 2014-08-27 20:51 UTC (permalink / raw) To: cwang Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer From: Cong Wang <cwang@twopensource.com> Date: Wed, 27 Aug 2014 11:28:25 -0700 > But for me, it looks like you are trying to pend some more packets > in a TX queue until the driver decides to flush them all in one shot. > So if that is true, doesn't this mean the latency of first packet pending > in this queue will increase and network traffic will be more bursty for > the receiver?? We intend to turn this so that it doesn't introduce latency. The situation where we have the largest opportunity to perform this batching is when the device queue has been stopped and is started back up. Usually at this moment the TX queue is %75 full and we have room now for a couple of packets. Therefore, delaying the triggering of the TX for this new set of packets will have no effect on latency because the device is still busy transmitting the rest of the TX queue. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2014-09-01 22:36 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller 2014-08-26 6:28 ` Jesper Dangaard Brouer 2014-08-26 10:13 ` Jesper Dangaard Brouer 2014-08-26 12:52 ` Jesper Dangaard Brouer 2014-08-26 16:43 ` Alexander Duyck 2014-08-27 7:48 ` Jesper Dangaard Brouer 2014-08-27 8:37 ` Jesper Dangaard Brouer 2014-08-26 14:40 ` Jamal Hadi Salim 2014-09-01 0:37 ` Rusty Russell 2014-08-27 12:19 ` Jesper Dangaard Brouer 2014-08-27 20:43 ` David Miller 2014-08-27 12:31 ` Hannes Frederic Sowa 2014-08-27 13:23 ` Eric Dumazet 2014-08-27 13:56 ` Jesper Dangaard Brouer 2014-08-27 14:09 ` Eric Dumazet 2014-08-27 20:48 ` David Miller 2014-08-27 20:46 ` David Miller 2014-08-27 20:45 ` David Miller 2014-08-28 1:42 ` Hannes Frederic Sowa 2014-08-30 3:22 ` David Miller 2014-08-30 10:23 ` Jesper Dangaard Brouer 2014-09-01 20:05 ` Hannes Frederic Sowa 2014-09-01 21:56 ` David Miller 2014-09-01 22:31 ` Hannes Frederic Sowa 2014-09-01 22:35 ` David Miller 2014-08-27 18:28 ` Cong Wang 2014-08-27 19:31 ` Tom Herbert 2014-08-27 20:53 ` David Miller 2014-08-27 20:51 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).