[PATCH 0/2] Get rid of ndo_xmit

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Get rid of ndo_xmit_flush
@ 2014-08-25 23:34 David Miller
  2014-08-26  6:28 ` Jesper Dangaard Brouer
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: David Miller @ 2014-08-25 23:34 UTC (permalink / raw)
  To: netdev
  Cc: therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

Given Jesper's performance numbers, it's not the way to go.

Instead, go with a signalling scheme via new boolean skb->xmit_more.

This has several advantages:

1) Nearly trivial driver support, just protect the tail pointer
   update with the skb->xmit_more check.

2) No extra indirect calls in the non-deferral cases.

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller
@ 2014-08-26  6:28 ` Jesper Dangaard Brouer
  2014-08-26 10:13   ` Jesper Dangaard Brouer
  2014-08-27 12:19 ` Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-26  6:28 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote:

> Given Jesper's performance numbers, it's not the way to go.
> 
> Instead, go with a signalling scheme via new boolean skb->xmit_more.

I'll do benchmarking based on this new API proposal today.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-26  6:28 ` Jesper Dangaard Brouer
@ 2014-08-26 10:13   ` Jesper Dangaard Brouer
  2014-08-26 12:52     ` Jesper Dangaard Brouer
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-26 10:13 UTC (permalink / raw)
  Cc: brouer, David Miller, netdev, therbert, jhs, hannes, edumazet,
	jeffrey.t.kirsher, rusty, dborkman


On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
> 
> > Given Jesper's performance numbers, it's not the way to go.
> > 
> > Instead, go with a signalling scheme via new boolean skb->xmit_more.
> 
> I'll do benchmarking based on this new API proposal today.

While establish an accurate baseline for my measurements.  I'm
starting to see too much variation in my trafgen measurements.
Meaning that we unfortunately cannot use it to measure variations on
the nanosec scale.

I'm measuring the packets per sec via "ifpps", and calculating an
average over the measurements, via the following oneliner:

 $ ifpps -clod eth5 -t 1000 | awk 'BEGIN{txsum=0; rxsum=0; n=0} /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps\n", $3, $11, n, rxsum/n, txsum/n }'

Below is measurements done on the *same* kerne:
 - M1: instant tx:1572766 pps n:215 average: tx:1573360 pps (reboot#1)
 - M2: instant tx:1561930 pps n:173 average: tx:1557064 pps (reboot#2)
 - M3: instant tx:1562088 pps n:300 average: tx:1559150 pps (reboot#2)
 - M4: instant tx:1564404 pps n:120 average: tx:1564948 pps (reboot#3)

 M1->M2: +6.65ns
 M1->M3: +5.79ns
 M1->M4: +3.42ns
 M3->M4: -2.38ns

I cannot explain the variations, but some options could be
 1) how well the SKB is cache-hot cached via kmem_cache
 2) other interrups on CPU#0 could disturb us
 3) interactions with scheduler
 4) interactions with transparent hugepages
 5) CPU "turbostat" interactions

M1 tx:1573360 pps translates into 636ns per packet, and 1% change
would translate into 6.36ns.  Perhaps we just cannot accurately
measure 1% improvement.

Trying to increase the sched priority of trafgen (supported via option
--prio-high) resulted in even worse performance results.  And kernel
starts to complain "BUG: soft lockup - CPU#0 stuck for 22s!".

With --prio-high the "instant" start to fluctuate a lot see:
 - instant rx:0 tx:1529260 pps n:191 average: rx:0 tx:1528885 pps
 - instant rx:0 tx:1512640 pps n:192 average: rx:0 tx:1528800 pps
 - instant rx:0 tx:1480050 pps n:193 average: rx:0 tx:1528548 pps
 - instant rx:0 tx:1526474 pps n:194 average: rx:0 tx:1528537 pps

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-26 10:13   ` Jesper Dangaard Brouer
@ 2014-08-26 12:52     ` Jesper Dangaard Brouer
  2014-08-26 16:43       ` Alexander Duyck
  2014-08-26 14:40     ` Jamal Hadi Salim
  2014-09-01  0:37     ` Rusty Russell
  2 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-26 12:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, therbert, jhs, hannes, edumazet,
	jeffrey.t.kirsher, rusty, dborkman


On Tue, 26 Aug 2014 12:13:47 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
> > 
> > > Given Jesper's performance numbers, it's not the way to go.
> > > 
> > > Instead, go with a signalling scheme via new boolean skb->xmit_more.
> > 
> > I'll do benchmarking based on this new API proposal today.
> 
> While establish an accurate baseline for my measurements.  I'm
> starting to see too much variation in my trafgen measurements.
> Meaning that we unfortunately cannot use it to measure variations on
> the nanosec scale.

Thus, we need to find a better more accurate measurement tool than
trafgen/af_packet.

Changed my PPS monitor "ifpps-oneliner" to calculate the nanosec
variation between the instant reading and the average.  For TX also
record the "max" and "min" variation value seen.

This should give us a better (instant) picture of how accurate the
measurement is.

ifpps -clod eth5 -t 1000 | \
 awk 'BEGIN{txsum=0; rxsum=0; n=0; txvar=0; txvar_min=0; txvar_max=0; rxvar=0;} \
 /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; \
   txvar=0; if (txsum/n>10 && $11>0) { \
     txvar=((1/(txsum/n)*10^9)-(1/$11*10^9)); \
     if (n>10 && txvar < txvar_min) {txvar_min=txvar}; \
     if (n>10 && txvar > txvar_max) {txvar_max=txvar}; \
   }; \
   rxvar=0; if (rxsum/n>10 && $3>0 ) { rxvar=((1/(rxsum/n)*10^9)-(1/$3*10^9))}; \
   printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps (instant variation TX %.3f ns (min:%.3f max:%.3f) RX %.3f ns)\n", $3, $11, n, rxsum/n, txsum/n, txvar, txvar_min, txvar_max, rxvar; \
   if (txvar > 2) {printf "WARNING instant variation high\n" } }'


Nanosec variation with trafgen:
-------------------------------

As can be seen, the min and max nanosec variation with trafgen is
higher than we would like:

Results: trafgen
 (sudo ethtool -C eth5 rx-usecs 1)
 instant rx:0 tx:1566064 pps n:152 average: rx:0 tx:1564534 pps
 (instant variation TX 0.624 ns (min:-6.336 max:1.766) RX 0.000 ns)

Results: trafgen
 (sudo ethtool -C eth5 rx-usecs 30)
 instant rx:0 tx:1576452 pps n:121 average: rx:0 tx:1575652 pps
 (instant variation TX 0.322 ns (min:-4.479 max:0.714) RX 0.000 ns)


Switching to pktgen
-------------------

I suspect a more accurate measurement tool will be "pktgen", because
we can cut out most of the things that can cause these variations
(like kmem_cache and cache-hot variations, and most sched variations).

The main problem with ixgbe is that, in this overload scenario, the
performance is limited by the TX ring size and cleanup intervals, as
described in:
 http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
 https://www.kernel.org/doc/Documentation/networking/pktgen.txt

Results below: Try to determine which ixgbe ethtool setting gives the
most stable PPS readings.  Notice the TX "min" and "max" nanosec
variations seen over the period.  Sampling over approx 120 sec.

The best setting seems to be:
 sudo ethtool -C eth5 rx-usecs 30
 sudo ethtool -G eth5 tx 512  #(default size)

Pktgen tests are single CPU performance numbers, script based on:
 https://github.com/netoptimizer/network-testing/blob/master/pktgen/example01.sh
 with CLONE_SKB="100000" (and single flow, const port number 9/discard)

Setting:
 sudo ethtool -G eth5 tx 512 #(Default setting)
 sudo ethtool -C eth5 rx-usecs 1 #(Default setting)
Result pktgen:
 * instant rx:1 tx:3933892 pps n:120 average: rx:1 tx:3934182 pps
   (instant variation TX -0.019 ns (min:-0.047 max:0.016) RX 0.000 ns)

The variation very small, but the performance is limited by the TX
ring buffer being full most of the time, TX cleanup being too slow.

Setting: (inc TX ring size)
 sudo ethtool -G eth5 tx 1024
 sudo ethtool -C eth5 rx-usecs 1 #(default setting)
Result pktgen:
 * instant rx:1 tx:5745632 pps n:118 average: rx:1 tx:5748818 pps
   (instant variation TX -0.096 ns (min:-0.293 max:0.897) RX 0.000 ns)

Setting:
 sudo ethtool -G eth5 tx 512
 sudo ethtool -C eth5 rx-usecs 20
Result pktgen:
 * instant rx:1 tx:5765168 pps n:120 average: rx:0 tx:5782242 pps
   (instant variation TX -0.512 ns (min:-1.008 max:1.599) RX 0.000 ns)

Setting:
 sudo ethtool -G eth5 tx 512
 sudo ethtool -C eth5 rx-usecs 30
Result pktgen:
 * instant rx:1 tx:5920856 pps n:114 average: rx:1 tx:5918350 pps
   (instant variation TX 0.071 ns (min:-0.177 max:0.135) RX 0.000 ns)

Setting:
 sudo ethtool -G eth5 tx 512
 sudo ethtool -C eth5 rx-usecs 40
Result pktgen:
 * instant rx:1 tx:5958408 pps n:120 average: rx:0 tx:5947908 pps
   (instant variation TX 0.296 ns (min:-1.410 max:0.595) RX 0.000 ns)

Setting:
 sudo ethtool -G eth5 tx 512
 sudo ethtool -C eth5 rx-usecs 50
Result pktgen:
 * instant rx:1 tx:5966964 pps n:120 average: rx:1 tx:5967306 pps
   (instant variation TX -0.010 ns (min:-1.330 max:0.169) RX 0.000 ns)

Setting:
 sudo ethtool -C eth5 rx-usecs 30
 sudo ethtool -G eth5 tx 1024
Result pktgen:
 instant rx:0 tx:5846252 pps n:120 average: rx:1 tx:5852464 pps
 (instant variation TX -0.182 ns (min:-0.467 max:2.249) RX 0.000 ns)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-26 12:52     ` Jesper Dangaard Brouer
@ 2014-08-26 16:43       ` Alexander Duyck
  2014-08-27  7:48         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 29+ messages in thread
From: Alexander Duyck @ 2014-08-26 16:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, therbert, jhs, hannes, edumazet,
	jeffrey.t.kirsher, rusty, dborkman

On 08/26/2014 05:52 AM, Jesper Dangaard Brouer wrote:
> 
> On Tue, 26 Aug 2014 12:13:47 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
>> On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>>>
>>>> Given Jesper's performance numbers, it's not the way to go.
>>>>
>>>> Instead, go with a signalling scheme via new boolean skb->xmit_more.
>>>
>>> I'll do benchmarking based on this new API proposal today.
>>
>> While establish an accurate baseline for my measurements.  I'm
>> starting to see too much variation in my trafgen measurements.
>> Meaning that we unfortunately cannot use it to measure variations on
>> the nanosec scale.
> 
> Thus, we need to find a better more accurate measurement tool than
> trafgen/af_packet.
> 
> Changed my PPS monitor "ifpps-oneliner" to calculate the nanosec
> variation between the instant reading and the average.  For TX also
> record the "max" and "min" variation value seen.
> 
> This should give us a better (instant) picture of how accurate the
> measurement is.
> 
> ifpps -clod eth5 -t 1000 | \
>  awk 'BEGIN{txsum=0; rxsum=0; n=0; txvar=0; txvar_min=0; txvar_max=0; rxvar=0;} \
>  /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; \
>    txvar=0; if (txsum/n>10 && $11>0) { \
>      txvar=((1/(txsum/n)*10^9)-(1/$11*10^9)); \
>      if (n>10 && txvar < txvar_min) {txvar_min=txvar}; \
>      if (n>10 && txvar > txvar_max) {txvar_max=txvar}; \
>    }; \
>    rxvar=0; if (rxsum/n>10 && $3>0 ) { rxvar=((1/(rxsum/n)*10^9)-(1/$3*10^9))}; \
>    printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps (instant variation TX %.3f ns (min:%.3f max:%.3f) RX %.3f ns)\n", $3, $11, n, rxsum/n, txsum/n, txvar, txvar_min, txvar_max, rxvar; \
>    if (txvar > 2) {printf "WARNING instant variation high\n" } }'
> 
> 
> Nanosec variation with trafgen:
> -------------------------------
> 
> As can be seen, the min and max nanosec variation with trafgen is
> higher than we would like:
> 
> Results: trafgen
>  (sudo ethtool -C eth5 rx-usecs 1)
>  instant rx:0 tx:1566064 pps n:152 average: rx:0 tx:1564534 pps
>  (instant variation TX 0.624 ns (min:-6.336 max:1.766) RX 0.000 ns)
> 
> Results: trafgen
>  (sudo ethtool -C eth5 rx-usecs 30)
>  instant rx:0 tx:1576452 pps n:121 average: rx:0 tx:1575652 pps
>  (instant variation TX 0.322 ns (min:-4.479 max:0.714) RX 0.000 ns)
> 
> 
> Switching to pktgen
> -------------------
> 
> I suspect a more accurate measurement tool will be "pktgen", because
> we can cut out most of the things that can cause these variations
> (like kmem_cache and cache-hot variations, and most sched variations).
> 
> The main problem with ixgbe is that, in this overload scenario, the
> performance is limited by the TX ring size and cleanup intervals, as
> described in:
>  http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
>  https://www.kernel.org/doc/Documentation/networking/pktgen.txt
> 
> Results below: Try to determine which ixgbe ethtool setting gives the
> most stable PPS readings.  Notice the TX "min" and "max" nanosec
> variations seen over the period.  Sampling over approx 120 sec.
> 
> The best setting seems to be:
>  sudo ethtool -C eth5 rx-usecs 30
>  sudo ethtool -G eth5 tx 512  #(default size)
> 
> Pktgen tests are single CPU performance numbers, script based on:
>  https://github.com/netoptimizer/network-testing/blob/master/pktgen/example01.sh
>  with CLONE_SKB="100000" (and single flow, const port number 9/discard)
> 
> Setting:
>  sudo ethtool -G eth5 tx 512 #(Default setting)
>  sudo ethtool -C eth5 rx-usecs 1 #(Default setting)
> Result pktgen:
>  * instant rx:1 tx:3933892 pps n:120 average: rx:1 tx:3934182 pps
>    (instant variation TX -0.019 ns (min:-0.047 max:0.016) RX 0.000 ns)
> 
> The variation very small, but the performance is limited by the TX
> ring buffer being full most of the time, TX cleanup being too slow.
> 
> Setting: (inc TX ring size)
>  sudo ethtool -G eth5 tx 1024
>  sudo ethtool -C eth5 rx-usecs 1 #(default setting)
> Result pktgen:
>  * instant rx:1 tx:5745632 pps n:118 average: rx:1 tx:5748818 pps
>    (instant variation TX -0.096 ns (min:-0.293 max:0.897) RX 0.000 ns)
> 
> Setting:
>  sudo ethtool -G eth5 tx 512
>  sudo ethtool -C eth5 rx-usecs 20
> Result pktgen:
>  * instant rx:1 tx:5765168 pps n:120 average: rx:0 tx:5782242 pps
>    (instant variation TX -0.512 ns (min:-1.008 max:1.599) RX 0.000 ns)
> 
> Setting:
>  sudo ethtool -G eth5 tx 512
>  sudo ethtool -C eth5 rx-usecs 30
> Result pktgen:
>  * instant rx:1 tx:5920856 pps n:114 average: rx:1 tx:5918350 pps
>    (instant variation TX 0.071 ns (min:-0.177 max:0.135) RX 0.000 ns)
> 
> Setting:
>  sudo ethtool -G eth5 tx 512
>  sudo ethtool -C eth5 rx-usecs 40
> Result pktgen:
>  * instant rx:1 tx:5958408 pps n:120 average: rx:0 tx:5947908 pps
>    (instant variation TX 0.296 ns (min:-1.410 max:0.595) RX 0.000 ns)
> 
> Setting:
>  sudo ethtool -G eth5 tx 512
>  sudo ethtool -C eth5 rx-usecs 50
> Result pktgen:
>  * instant rx:1 tx:5966964 pps n:120 average: rx:1 tx:5967306 pps
>    (instant variation TX -0.010 ns (min:-1.330 max:0.169) RX 0.000 ns)
> 
> Setting:
>  sudo ethtool -C eth5 rx-usecs 30
>  sudo ethtool -G eth5 tx 1024
> Result pktgen:
>  instant rx:0 tx:5846252 pps n:120 average: rx:1 tx:5852464 pps
>  (instant variation TX -0.182 ns (min:-0.467 max:2.249) RX 0.000 ns)
> 
> 

My advice would be to disable all C states and P states (including
turbo) if possible, and try using idle=poll.  Any processor frequency
and/or C state transitions will totally wreak havoc with trying to get
reliable results out of any performance test.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-26 16:43       ` Alexander Duyck
@ 2014-08-27  7:48         ` Jesper Dangaard Brouer
  2014-08-27  8:37           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-27  7:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, netdev, therbert, jhs, hannes, edumazet,
	jeffrey.t.kirsher, rusty, dborkman, brouer, Jeremy Eder

On Tue, 26 Aug 2014 09:43:48 -0700
Alexander Duyck <alexander.h.duyck@intel.com> wrote:

> On 08/26/2014 05:52 AM, Jesper Dangaard Brouer wrote:
> > 
[...]
> > 
> > Setting: (inc TX ring size)
> >  sudo ethtool -G eth5 tx 1024
> >  sudo ethtool -C eth5 rx-usecs 1 #(default setting)
> > Result pktgen:
> >  * instant rx:1 tx:5745632 pps n:118 average: rx:1 tx:5748818 pps
> >    (instant variation TX -0.096 ns (min:-0.293 max:0.897) RX 0.000 ns)
> > 
[...]
> > Setting:
> >  sudo ethtool -G eth5 tx 512
> >  sudo ethtool -C eth5 rx-usecs 30
> > Result pktgen:
> >  * instant rx:1 tx:5920856 pps n:114 average: rx:1 tx:5918350 pps
> >    (instant variation TX 0.071 ns (min:-0.177 max:0.135) RX 0.000 ns)
> > 
> 
> My advice would be to disable all C states and P states (including
> turbo) if possible, and try using idle=poll.  Any processor frequency
> and/or C state transitions will totally wreak havoc with trying to get
> reliable results out of any performance test.

Thanks for the advice.

Reading Jeremy Eder's blog post:
 http://www.breakage.org/2012/11/14/processor-max_cstate-intel_idle-max_cstate-and-devcpu_dma_latency/

It seems the best method for disabling these C and P states, and
keeping all CPUs in C0/C1 state is doing:

# tuned-adm profile latency-performance

Accuracy of: "ethtool -C eth5 rx-usecs 30" got slightly little better, (see
below) latency variations below 1ns min:-0.209 ns and max:0.114 ns.  Which is
very good, and should be good enough for us, to measure the upcoming
code changes.

Just increasing TX ring queue still have variations.

I'm also going to disable Hyper-threading in BIOS, even-though I'm only
using one CPU in these tests (I worry that some process running on a
sibling could disturb the accuracy).

Setting:
 * sudo ethtool -C eth5 rx-usecs 30
Result:
 * instant rx:1 tx:5603644 pps n:120 average: rx:0 tx:5603317 pps
   (instant variation TX 0.010 ns (min:-0.132 max:0.114) RX 0.000 ns)
 * instant rx:1 tx:5599820 pps n:300 average: rx:1 tx:5602382 pps
   (instant variation TX -0.082 ns (min:-0.209 max:0.114) RX 0.000 ns)

Setting:
 * sudo ethtool -G eth5 tx 1024
Result:
 * instant rx:1 tx:5398208 pps n:300 average: rx:1 tx:5404893 pps
   (instant variation TX -0.229 ns (min:-0.257 max:1.666) RX 0.000 ns)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27  7:48         ` Jesper Dangaard Brouer
@ 2014-08-27  8:37           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-27  8:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, David Miller, netdev, therbert, jhs, hannes,
	edumazet, jeffrey.t.kirsher, rusty, dborkman, Jeremy Eder

On Wed, 27 Aug 2014 09:48:59 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Reading Jeremy Eder's blog post:
>  http://www.breakage.org/2012/11/14/processor-max_cstate-intel_idle-max_cstate-and-devcpu_dma_latency/
> 
> It seems the best method for disabling these C and P states, and
> keeping all CPUs in C0/C1 state is doing:
> 
> # tuned-adm profile latency-performance
[...]
> I'm also going to disable Hyper-threading in BIOS, even-though I'm only
> using one CPU in these tests (I worry that some process running on a
> sibling could disturb the accuracy).

Disabling Hyper-Threading in BIOS helped the accuracy, now the latency
variation is below 0.1ns min:-0.031 and max:0.054 over 120 sec.

Still using:
 # tuned-adm profile latency-performance

> Setting:
>  * sudo ethtool -C eth5 rx-usecs 30
> Result:
>  * instant rx:1 tx:5603644 pps n:120 average: rx:0 tx:5603317 pps
>    (instant variation TX 0.010 ns (min:-0.132 max:0.114) RX 0.000 ns)
>  * instant rx:1 tx:5599820 pps n:300 average: rx:1 tx:5602382 pps
>    (instant variation TX -0.082 ns (min:-0.209 max:0.114) RX 0.000 ns)

With no HT:
 * instant rx:1 tx:5597616 pps n:120 average: rx:1 tx:5596283 pps
  (instant variation TX 0.043 ns (min:-0.031 max:0.054) RX 0.000 ns)

With this setup, I think we are ready to measure the effect of our API
changes with sufficient accuracy.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-26 10:13   ` Jesper Dangaard Brouer
  2014-08-26 12:52     ` Jesper Dangaard Brouer
@ 2014-08-26 14:40     ` Jamal Hadi Salim
  2014-09-01  0:37     ` Rusty Russell
  2 siblings, 0 replies; 29+ messages in thread
From: Jamal Hadi Salim @ 2014-08-26 14:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, therbert, hannes, edumazet,
	jeffrey.t.kirsher, rusty, dborkman

On 08/26/14 06:13, Jesper Dangaard Brouer wrote:
>
> On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>>
>>> Given Jesper's performance numbers, it's not the way to go.
>>>
>>> Instead, go with a signalling scheme via new boolean skb->xmit_more.
>>
>> I'll do benchmarking based on this new API proposal today.
>
> While establish an accurate baseline for my measurements.  I'm
> starting to see too much variation in my trafgen measurements.
> Meaning that we unfortunately cannot use it to measure variations on
> the nanosec scale.
>
> I'm measuring the packets per sec via "ifpps", and calculating an
> average over the measurements, via the following oneliner:
>
>   $ ifpps -clod eth5 -t 1000 | awk 'BEGIN{txsum=0; rxsum=0; n=0} /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps\n", $3, $11, n, rxsum/n, txsum/n }'
>
> Below is measurements done on the *same* kerne:
>   - M1: instant tx:1572766 pps n:215 average: tx:1573360 pps (reboot#1)
>   - M2: instant tx:1561930 pps n:173 average: tx:1557064 pps (reboot#2)
>   - M3: instant tx:1562088 pps n:300 average: tx:1559150 pps (reboot#2)
>   - M4: instant tx:1564404 pps n:120 average: tx:1564948 pps (reboot#3)
>
>   M1->M2: +6.65ns
>   M1->M3: +5.79ns
>   M1->M4: +3.42ns
>   M3->M4: -2.38ns
>
> I cannot explain the variations, but some options could be
>   1) how well the SKB is cache-hot cached via kmem_cache
>   2) other interrups on CPU#0 could disturb us
>   3) interactions with scheduler
>   4) interactions with transparent hugepages
>   5) CPU "turbostat" interactions
>

The clock source used in your system may be an issue as well.
I am not sure if that matters these days but it used to be.
I think back then you could pick ACPI, Jiffies etc to use
as a qdisc clock source.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-26 10:13   ` Jesper Dangaard Brouer
  2014-08-26 12:52     ` Jesper Dangaard Brouer
  2014-08-26 14:40     ` Jamal Hadi Salim
@ 2014-09-01  0:37     ` Rusty Russell
  2 siblings, 0 replies; 29+ messages in thread
From: Rusty Russell @ 2014-09-01  0:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, David Miller, netdev, therbert, jhs, hannes, edumazet,
	jeffrey.t.kirsher, dborkman

Jesper Dangaard Brouer <brouer@redhat.com> writes:
> On Tue, 26 Aug 2014 08:28:15 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>> 
>> > Given Jesper's performance numbers, it's not the way to go.
>> > 
>> > Instead, go with a signalling scheme via new boolean skb->xmit_more.
>> 
>> I'll do benchmarking based on this new API proposal today.
>
> While establish an accurate baseline for my measurements.  I'm
> starting to see too much variation in my trafgen measurements.
> Meaning that we unfortunately cannot use it to measure variations on
> the nanosec scale.
>
> I'm measuring the packets per sec via "ifpps", and calculating an
> average over the measurements, via the following oneliner:
>
>  $ ifpps -clod eth5 -t 1000 | awk 'BEGIN{txsum=0; rxsum=0; n=0} /[[:digit:]]/ {txsum+=$11;rxsum+=$3;n++; printf "instant rx:%u tx:%u pps n:%u average: rx:%d tx:%d pps\n", $3, $11, n, rxsum/n, txsum/n }'

FYI, this is what I use for this kind of thing:
 
       https://github.com/rustyrussell/stats

>From the README:
 This filter finds identical lines and collapses all the numbers into a
 range, average and standard deviation; it can also print out all the
 numbers in CSV form for import into spreadsheets, etc.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller
  2014-08-26  6:28 ` Jesper Dangaard Brouer
@ 2014-08-27 12:19 ` Jesper Dangaard Brouer
  2014-08-27 20:43   ` David Miller
  2014-08-27 12:31 ` Hannes Frederic Sowa
  2014-08-27 18:28 ` Cong Wang
  3 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-27 12:19 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

On Mon, 25 Aug 2014 16:34:58 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> Given Jesper's performance numbers, it's not the way to go.
> 
> Instead, go with a signalling scheme via new boolean skb->xmit_more.
> 
> This has several advantages:
> 
> 1) Nearly trivial driver support, just protect the tail pointer
>    update with the skb->xmit_more check.
> 
> 2) No extra indirect calls in the non-deferral cases.

Even-though it is obvious that this new API skb->xmit_more will not
hurt performance, especially given skb->xmit_more is always 0 in this
kernel, I've still run my pktgen performance tests.

Compared to baseline[1]: (averaged 5609929 pps) (details below signature)
 * (1/5609929*10^9)-(1/5603728*10^9) = -0.197ns

As expected, this API does not hurt performance (as -0.197ns is below
our accuracy levels).

[1] http://thread.gmane.org/gmane.linux.network/327254/focus=327838
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Results: on branch bulking02
----------------------------

Kernel with skb->xmit_more API.

Kernel at:
 * commit a3d1214688d ("neigh: document gc_thresh2")

With no HT:
 * ethtool -C eth5 rx-usecs 30
 * tuned-adm profile latency-performance
Results (pktgen):
 * instant rx:1 tx:5604056 pps n:250 average: rx:1 tx:5604013 pps
  (instant variation TX 0.001 ns (min:-0.114 max:0.104) RX 0.000 ns)
 * instant rx:1 tx:5603388 pps n:87 average: rx:1 tx:5603872 pps
  (instant variation TX -0.015 ns (min:-0.031 max:0.060) RX 0.000 ns)
 * instant rx:1 tx:5604332 pps n:429 average: rx:1 tx:5603300 pps
  (instant variation TX 0.033 ns (min:-0.094 max:0.881) RX 0.000 ns)
 * Average: (5604013+5603872+5603300)/3 = 5603728

Compared to baseline: (averaged 5609929 pps)
 * (1/5609929*10^9)-(1/5603728*10^9) = -0.197ns

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 12:19 ` Jesper Dangaard Brouer
@ 2014-08-27 20:43   ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2014-08-27 20:43 UTC (permalink / raw)
  To: brouer
  Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty,
	dborkman

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 27 Aug 2014 14:19:18 +0200

> On Mon, 25 Aug 2014 16:34:58 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
>> Given Jesper's performance numbers, it's not the way to go.
>> 
>> Instead, go with a signalling scheme via new boolean skb->xmit_more.
>> 
>> This has several advantages:
>> 
>> 1) Nearly trivial driver support, just protect the tail pointer
>>    update with the skb->xmit_more check.
>> 
>> 2) No extra indirect calls in the non-deferral cases.
> 
> Even-though it is obvious that this new API skb->xmit_more will not
> hurt performance, especially given skb->xmit_more is always 0 in this
> kernel, I've still run my pktgen performance tests.
> 
> Compared to baseline[1]: (averaged 5609929 pps) (details below signature)
>  * (1/5609929*10^9)-(1/5603728*10^9) = -0.197ns
> 
> As expected, this API does not hurt performance (as -0.197ns is below
> our accuracy levels).
> 
> [1] http://thread.gmane.org/gmane.linux.network/327254/focus=327838

Thanks for validating Jesper :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller
  2014-08-26  6:28 ` Jesper Dangaard Brouer
  2014-08-27 12:19 ` Jesper Dangaard Brouer
@ 2014-08-27 12:31 ` Hannes Frederic Sowa
  2014-08-27 13:23   ` Eric Dumazet
  2014-08-27 20:45   ` David Miller
  2014-08-27 18:28 ` Cong Wang
  3 siblings, 2 replies; 29+ messages in thread
From: Hannes Frederic Sowa @ 2014-08-27 12:31 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

On Mo, 2014-08-25 at 16:34 -0700, David Miller wrote:
> Given Jesper's performance numbers, it's not the way to go.
> 
> Instead, go with a signalling scheme via new boolean skb->xmit_more.
> 
> This has several advantages:
> 
> 1) Nearly trivial driver support, just protect the tail pointer
>    update with the skb->xmit_more check.

One thing one should keep in mind is, that there must be a skb available
to trigger the flush, maybe this will hurt us one day.

Thinking more about it should we go with a coccinelle script and
replace/extend ndo_start_xmit with an additional argument?

We can also add a new function pointer and call that instead of
ndo_start_xmit. I think only the callq *%rax hurts performance.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 12:31 ` Hannes Frederic Sowa
@ 2014-08-27 13:23   ` Eric Dumazet
  2014-08-27 13:56     ` Jesper Dangaard Brouer
  2014-08-27 20:46     ` David Miller
  2014-08-27 20:45   ` David Miller
  1 sibling, 2 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-08-27 13:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher,
	rusty, dborkman, brouer

On Wed, 2014-08-27 at 14:31 +0200, Hannes Frederic Sowa wrote:

> One thing one should keep in mind is, that there must be a skb available
> to trigger the flush, maybe this will hurt us one day.
> 
> Thinking more about it should we go with a coccinelle script and
> replace/extend ndo_start_xmit with an additional argument?

This will be a pain for backports and things like that.

skb->xmit_more is a bit annoying, because it consumes one bit in all
skbs, while it could be one byte per cpu as ndo_start_xmit() is called
while BH are disabled.

It also forces a cache line dirtying, that will hurt qdisc like HTB
where skbs can be cooked/enqueued by remote cpus.

Also, most drivers will never benefit from this new infra.

Mellanox mlx4 uses the Blue Frame trick to avoid the iowrite() cost.

I am afraid Jesper tests are not complete to truly have a picture of the
extra costs, because he made sure no false sharing was possible (A
single cpu does everything and keeps skb in its cache)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 13:23   ` Eric Dumazet
@ 2014-08-27 13:56     ` Jesper Dangaard Brouer
  2014-08-27 14:09       ` Eric Dumazet
  2014-08-27 20:48       ` David Miller
  2014-08-27 20:46     ` David Miller
  1 sibling, 2 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-27 13:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, David Miller, netdev, therbert, jhs,
	edumazet, jeffrey.t.kirsher, rusty, dborkman, brouer

On Wed, 27 Aug 2014 06:23:58 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I am afraid Jesper tests are not complete to truly have a picture of the
> extra costs, because he made sure no false sharing was possible (A
> single cpu does everything and keeps skb in its cache)

I fully agree. My tests a very artificial benchmarks only for the
optimal case.

I do worry a bit, if writing skb->xmit_more in a more cache cold
scenarios could hurt us (as that cacheline seems to be read mostly),
but this is not a problem right now because it will always be cache
hot, as we always clear the entire SKB in alloc.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 13:56     ` Jesper Dangaard Brouer
@ 2014-08-27 14:09       ` Eric Dumazet
  2014-08-27 20:48       ` David Miller
  1 sibling, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2014-08-27 14:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Hannes Frederic Sowa, David Miller, netdev, therbert, jhs,
	edumazet, jeffrey.t.kirsher, rusty, dborkman

On Wed, 2014-08-27 at 15:56 +0200, Jesper Dangaard Brouer wrote:

> I do worry a bit, if writing skb->xmit_more in a more cache cold
> scenarios could hurt us (as that cacheline seems to be read mostly),
> but this is not a problem right now because it will always be cache
> hot, as we always clear the entire SKB in alloc.
> 


Yes, but many of us use things like TCP stack and thousand of packets
sitting in a possibly complex qdisc hierarchy before hitting the
device ;)

In these situations, skb is dirtied at TX completion time, but not in
ndo_start_xmit() (where its only read)

In the (qdisc->dequeue() -> ndo_start_xmit()) stage, only the first
cache line of skb is typically dirtied.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 13:56     ` Jesper Dangaard Brouer
  2014-08-27 14:09       ` Eric Dumazet
@ 2014-08-27 20:48       ` David Miller
  1 sibling, 0 replies; 29+ messages in thread
From: David Miller @ 2014-08-27 20:48 UTC (permalink / raw)
  To: brouer
  Cc: eric.dumazet, hannes, netdev, therbert, jhs, edumazet,
	jeffrey.t.kirsher, rusty, dborkman

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 27 Aug 2014 15:56:51 +0200

> I do worry a bit, if writing skb->xmit_more in a more cache cold
> scenarios could hurt us (as that cacheline seems to be read mostly),
> but this is not a problem right now because it will always be cache
> hot, as we always clear the entire SKB in alloc.

And as I mentioned to Eric, xmit_more sits right next to the queue
number which is also dirtied in the TX path.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 13:23   ` Eric Dumazet
  2014-08-27 13:56     ` Jesper Dangaard Brouer
@ 2014-08-27 20:46     ` David Miller
  1 sibling, 0 replies; 29+ messages in thread
From: David Miller @ 2014-08-27 20:46 UTC (permalink / raw)
  To: eric.dumazet
  Cc: hannes, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 27 Aug 2014 06:23:58 -0700

> On Wed, 2014-08-27 at 14:31 +0200, Hannes Frederic Sowa wrote:
> 
>> One thing one should keep in mind is, that there must be a skb available
>> to trigger the flush, maybe this will hurt us one day.
>> 
>> Thinking more about it should we go with a coccinelle script and
>> replace/extend ndo_start_xmit with an additional argument?
> 
> This will be a pain for backports and things like that.

Agreed.

> skb->xmit_more is a bit annoying, because it consumes one bit in all
> skbs, while it could be one byte per cpu as ndo_start_xmit() is called
> while BH are disabled.
> 
> It also forces a cache line dirtying, that will hurt qdisc like HTB
> where skbs can be cooked/enqueued by remote cpus.

The xmit_more value sits in the same cache line as the queue number,
which also gets dirtied in the TX fast path.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 12:31 ` Hannes Frederic Sowa
  2014-08-27 13:23   ` Eric Dumazet
@ 2014-08-27 20:45   ` David Miller
  2014-08-28  1:42     ` Hannes Frederic Sowa
  1 sibling, 1 reply; 29+ messages in thread
From: David Miller @ 2014-08-27 20:45 UTC (permalink / raw)
  To: hannes
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 27 Aug 2014 14:31:12 +0200

> On Mo, 2014-08-25 at 16:34 -0700, David Miller wrote:
>> Given Jesper's performance numbers, it's not the way to go.
>> 
>> Instead, go with a signalling scheme via new boolean skb->xmit_more.
>> 
>> This has several advantages:
>> 
>> 1) Nearly trivial driver support, just protect the tail pointer
>>    update with the skb->xmit_more check.
> 
> One thing one should keep in mind is, that there must be a skb available
> to trigger the flush, maybe this will hurt us one day.
> 
> Thinking more about it should we go with a coccinelle script and
> replace/extend ndo_start_xmit with an additional argument?
> 
> We can also add a new function pointer and call that instead of
> ndo_start_xmit. I think only the callq *%rax hurts performance.

I don't think we will have any problems here, the caller will always
be the entity which analyzes the upcoming set of SKBs to submit and
tag them properly.

I really do not want to add a new OP and I even more so do not want to
adjust the ndo_start_xmit() signature.  It's effect is far reaching,
and for absolutely no gain as far as I can see.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 20:45   ` David Miller
@ 2014-08-28  1:42     ` Hannes Frederic Sowa
  2014-08-30  3:22       ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Hannes Frederic Sowa @ 2014-08-28  1:42 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer, John Fastabend

On Mi, 2014-08-27 at 13:45 -0700, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 27 Aug 2014 14:31:12 +0200
> 
> > On Mo, 2014-08-25 at 16:34 -0700, David Miller wrote:
> >> Given Jesper's performance numbers, it's not the way to go.
> >> 
> >> Instead, go with a signalling scheme via new boolean skb->xmit_more.
> >> 
> >> This has several advantages:
> >> 
> >> 1) Nearly trivial driver support, just protect the tail pointer
> >>    update with the skb->xmit_more check.
> > 
> > One thing one should keep in mind is, that there must be a skb available
> > to trigger the flush, maybe this will hurt us one day.
> > 
> > Thinking more about it should we go with a coccinelle script and
> > replace/extend ndo_start_xmit with an additional argument?
> > 
> > We can also add a new function pointer and call that instead of
> > ndo_start_xmit. I think only the callq *%rax hurts performance.
> 
> I don't think we will have any problems here, the caller will always
> be the entity which analyzes the upcoming set of SKBs to submit and
> tag them properly.
> 
> I really do not want to add a new OP and I even more so do not want to
> adjust the ndo_start_xmit() signature.  It's effect is far reaching,
> and for absolutely no gain as far as I can see.
> 
> Thanks.

Just brainstorming a bit:

I wonder if we still might need a separate call for tx_flush, e.g. for
af_packet if one wants to allow user space control of batching, MSG_MORE
with tx hangcheck (also in case user space has control over it) or
implement TCP_CORK alike option in af_packet. Even in the case we
someday might allow concurrent dequeueing from a qdisc we might want to
use xmit_more opportunistic (also user space might use xmit_more
opportunistic if dequeueing from lockless data structures). I just want
to ensure we thought about possible reuses of this driver api change. ;)

To do so just some proposals:

1) export tx_flush function nonetheless but normal/fast path just depends on
skb->xmit_more (so driver just does a static call to the flushing
function, no indirection here). We would keep the ndo_op for flushing
around, but ndo_start_xmit must still honor xmit_more by itself.
af_packet sockets could then flush the queues separately if wanted. Also if
ndo_op != NULL we know that driver supports batching.

2) add new op like ndo_start_xmit_flags op with an additional flags
parameter and document that NULL skbs can happen. In this case tx-queue
should only get flushed. By checking for the new method we also know
that the driver supports batching.

3) provide special skb (only head) with no data to ndo_start_xmit which
only clears the queue but prepare drivers to not enqueue any new data.
I am concerned that if user space can control batching to the NIC we can
have side-effects if user space did not flush the queue of the nic
before and we dequeue more packets from qdisc into the same nic's tx
queue.

Allowing user space control over batching might make the conversion of
drivers more painful, as driver might have to check for "dirtiness"
before it tries to enqueue the packet internally. (To track queueing
state of nic we also must track the state globally per queue which might
also become a performance problem cache trashing wise, I am unsure about
that; at least seems not to be the case with bql. My idea was a tx-queue
dirty bit somewhere, or check bql state before submitting skb. But maybe
one has to extend bql to keep track of unflushed vs. flushed skbs to
make this safe.)

If queuing failed because of missing tx_flush we either have to requeue
the skb to the qdisc or do a fast retry, but I don't think that NICs
will catch up that fast, no? Or can we just blame user space
applications that they are wrong if they use xmit_more irresponsible?

I liked the separate ndo ops approach because it seemed so very well
extendable to me. :/

I know that these concerns a bit messy but wanted to share them
nonetheless. Maybe they are all void in the end. ;)

Also maybe these concerns can get solved by the addition of future
AF_PACKETV4 ndo ops and we can just postpone this discussion.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-28  1:42     ` Hannes Frederic Sowa
@ 2014-08-30  3:22       ` David Miller
  2014-08-30 10:23         ` Jesper Dangaard Brouer
  2014-09-01 20:05         ` Hannes Frederic Sowa
  0 siblings, 2 replies; 29+ messages in thread
From: David Miller @ 2014-08-30  3:22 UTC (permalink / raw)
  To: hannes
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer, john.r.fastabend

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 28 Aug 2014 03:42:54 +0200

> I wonder if we still might need a separate call for tx_flush, e.g. for
> af_packet if one wants to allow user space control of batching, MSG_MORE
> with tx hangcheck (also in case user space has control over it) or
> implement TCP_CORK alike option in af_packet.

I disagree with allowing the user to hold a device TX queue hostage
across system calls, therefore the user should provide the entire
batch in such a case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-30  3:22       ` David Miller
@ 2014-08-30 10:23         ` Jesper Dangaard Brouer
  2014-09-01 20:05         ` Hannes Frederic Sowa
  1 sibling, 0 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2014-08-30 10:23 UTC (permalink / raw)
  To: David Miller
  Cc: hannes, netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, john.r.fastabend, brouer

On Fri, 29 Aug 2014 20:22:10 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 28 Aug 2014 03:42:54 +0200
> 
> > I wonder if we still might need a separate call for tx_flush, e.g. for
> > af_packet if one wants to allow user space control of batching, MSG_MORE
> > with tx hangcheck (also in case user space has control over it) or
> > implement TCP_CORK alike option in af_packet.
> 
> I disagree with allowing the user to hold a device TX queue hostage
> across system calls, therefore the user should provide the entire
> batch in such a case.

I believe we should only do tailptr batching/delaying, when a batch of
packets is readily available.  I don't like the concept of speculative
delaying the tailptr update, in hope a new packet will arrive shortly.

We still might need a call for tx_flush, if some error case occurs while
we are looping over our batch of packets.  Most of these TX error
situation should be hopefully be detected in the driver, and cause the
driver to write the tailptr before returning.  Like in commit
2367a17390 ("ixgbe: flush when in xmit_more mode and under descriptor
pressure").

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-30  3:22       ` David Miller
  2014-08-30 10:23         ` Jesper Dangaard Brouer
@ 2014-09-01 20:05         ` Hannes Frederic Sowa
  2014-09-01 21:56           ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-01 20:05 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer, john.r.fastabend

On Fr, 2014-08-29 at 20:22 -0700, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 28 Aug 2014 03:42:54 +0200
> 
> > I wonder if we still might need a separate call for tx_flush, e.g. for
> > af_packet if one wants to allow user space control of batching, MSG_MORE
> > with tx hangcheck (also in case user space has control over it) or
> > implement TCP_CORK alike option in af_packet.
> 
> I disagree with allowing the user to hold a device TX queue hostage
> across system calls, therefore the user should provide the entire
> batch in such a case.

Ok, granted. In regards to syscall latency this also is a stupid idea.
mmaped tx approaches won't even pass these functions, so we don't care
here.

But as soon as we try to make Qdiscs absolutely lockless, we don't have
any guard that we don't concurrently dequeue skbs from it and suddenly
one Qdisc dequeue processing entity couldn't notify the driver that the
end of the batching was reached. I think this could become a problem
depending on how much of the locking is removed?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-09-01 20:05         ` Hannes Frederic Sowa
@ 2014-09-01 21:56           ` David Miller
  2014-09-01 22:31             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2014-09-01 21:56 UTC (permalink / raw)
  To: hannes
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer, john.r.fastabend

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Mon, 01 Sep 2014 22:05:42 +0200

> But as soon as we try to make Qdiscs absolutely lockless, we don't have
> any guard that we don't concurrently dequeue skbs from it and suddenly
> one Qdisc dequeue processing entity couldn't notify the driver that the
> end of the batching was reached. I think this could become a problem
> depending on how much of the locking is removed?

I am certain that batching will require taking the device transmit
lock over the ->ndo_start_xmit() invocations, and therefore the
deferral decisions must atomically be made inside of that context.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-09-01 21:56           ` David Miller
@ 2014-09-01 22:31             ` Hannes Frederic Sowa
  2014-09-01 22:35               ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Hannes Frederic Sowa @ 2014-09-01 22:31 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer, john.r.fastabend

On Mo, 2014-09-01 at 14:56 -0700, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Mon, 01 Sep 2014 22:05:42 +0200
> 
> > But as soon as we try to make Qdiscs absolutely lockless, we don't have
> > any guard that we don't concurrently dequeue skbs from it and suddenly
> > one Qdisc dequeue processing entity couldn't notify the driver that the
> > end of the batching was reached. I think this could become a problem
> > depending on how much of the locking is removed?
> 
> I am certain that batching will require taking the device transmit
> lock over the ->ndo_start_xmit() invocations, and therefore the
> deferral decisions must atomically be made inside of that context.

I'll stop to throw doubt into this discussion now. ;)

I am afraid that TX-lock is per nic-queue and it won't work out that
easy because skbs for different queues can reside in a Qdisc.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-09-01 22:31             ` Hannes Frederic Sowa
@ 2014-09-01 22:35               ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2014-09-01 22:35 UTC (permalink / raw)
  To: hannes
  Cc: netdev, therbert, jhs, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer, john.r.fastabend

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Tue, 02 Sep 2014 00:31:09 +0200

> I am afraid that TX-lock is per nic-queue and it won't work out that
> easy because skbs for different queues can reside in a Qdisc.

Batching will help mostly when netif_queue_start() triggers, when we
are in steady state stopping and re-starting the txq.

What we really care about are the qdisc <--> txq one to one
relationship which is by far the most common case.

In the multiple txq to qdisc case, we still can benefit because
packets for one txq will still tend to bunch op and give deferral
opportunities.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller
                   ` (2 preceding siblings ...)
  2014-08-27 12:31 ` Hannes Frederic Sowa
@ 2014-08-27 18:28 ` Cong Wang
  2014-08-27 19:31   ` Tom Herbert
  2014-08-27 20:51   ` David Miller
  3 siblings, 2 replies; 29+ messages in thread
From: Cong Wang @ 2014-08-27 18:28 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Tom Herbert, Jamal Hadi Salim, Hannes Frederic Sowa,
	Eric Dumazet, jeffrey.t.kirsher, rusty, Daniel Borkmann, brouer

On Mon, Aug 25, 2014 at 4:34 PM, David Miller <davem@davemloft.net> wrote:
>
> Given Jesper's performance numbers, it's not the way to go.
>
> Instead, go with a signalling scheme via new boolean skb->xmit_more.
>
> This has several advantages:
>
> 1) Nearly trivial driver support, just protect the tail pointer
>    update with the skb->xmit_more check.
>
> 2) No extra indirect calls in the non-deferral cases.
>

First of all, I missed your discussion at kernel summit.

Second of all, I am not familiar with hardware NIC drivers.

But for me, it looks like you are trying to pend some more packets
in a TX queue until the driver decides to flush them all in one shot.
So if that is true, doesn't this mean the latency of first packet pending
in this queue will increase and network traffic will be more bursty for
the receiver??

Also, even if this is a cool feature in hardware driver, doesn't it make
sense to allow users to disable it with ethtool?

It looks like this is still WIP, since no one really sets ->xmit_more to 1
in the code.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 18:28 ` Cong Wang
@ 2014-08-27 19:31   ` Tom Herbert
  2014-08-27 20:53     ` David Miller
  2014-08-27 20:51   ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2014-08-27 19:31 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, netdev, Jamal Hadi Salim, Hannes Frederic Sowa,
	Eric Dumazet, Jeff Kirsher, Rusty Russell, Daniel Borkmann,
	brouer

On Wed, Aug 27, 2014 at 11:28 AM, Cong Wang <cwang@twopensource.com> wrote:
> On Mon, Aug 25, 2014 at 4:34 PM, David Miller <davem@davemloft.net> wrote:
>>
>> Given Jesper's performance numbers, it's not the way to go.
>>
>> Instead, go with a signalling scheme via new boolean skb->xmit_more.
>>
>> This has several advantages:
>>
>> 1) Nearly trivial driver support, just protect the tail pointer
>>    update with the skb->xmit_more check.
>>
>> 2) No extra indirect calls in the non-deferral cases.
>>
>
> First of all, I missed your discussion at kernel summit.
>
> Second of all, I am not familiar with hardware NIC drivers.
>
> But for me, it looks like you are trying to pend some more packets
> in a TX queue until the driver decides to flush them all in one shot.
> So if that is true, doesn't this mean the latency of first packet pending
> in this queue will increase and network traffic will be more bursty for
> the receiver??
>
I suspect this won't be an big issue. The dequeue is still work
conserving and BQL limit already ensures that HW queue doesn't drain
completely when packets are pending in the qdisc-- I doubt this will
increase BQL limits, but that should be verified. We might see some
latency increase for a batch sent on an idle link (possible with
GSO)-- if this is a concern we could arrange flush on sending packets
on idle links.

> Also, even if this is a cool feature in hardware driver, doesn't it make
> sense to allow users to disable it with ethtool?
>
> It looks like this is still WIP, since no one really sets ->xmit_more to 1
> in the code.
>
> Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 19:31   ` Tom Herbert
@ 2014-08-27 20:53     ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2014-08-27 20:53 UTC (permalink / raw)
  To: therbert
  Cc: cwang, netdev, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

From: Tom Herbert <therbert@google.com>
Date: Wed, 27 Aug 2014 12:31:15 -0700

> On Wed, Aug 27, 2014 at 11:28 AM, Cong Wang <cwang@twopensource.com> wrote:
>> On Mon, Aug 25, 2014 at 4:34 PM, David Miller <davem@davemloft.net> wrote:
>>>
>>> Given Jesper's performance numbers, it's not the way to go.
>>>
>>> Instead, go with a signalling scheme via new boolean skb->xmit_more.
>>>
>>> This has several advantages:
>>>
>>> 1) Nearly trivial driver support, just protect the tail pointer
>>>    update with the skb->xmit_more check.
>>>
>>> 2) No extra indirect calls in the non-deferral cases.
>>>
>>
>> First of all, I missed your discussion at kernel summit.
>>
>> Second of all, I am not familiar with hardware NIC drivers.
>>
>> But for me, it looks like you are trying to pend some more packets
>> in a TX queue until the driver decides to flush them all in one shot.
>> So if that is true, doesn't this mean the latency of first packet pending
>> in this queue will increase and network traffic will be more bursty for
>> the receiver??
>>
> I suspect this won't be an big issue. The dequeue is still work
> conserving and BQL limit already ensures that HW queue doesn't drain
> completely when packets are pending in the qdisc-- I doubt this will
> increase BQL limits, but that should be verified. We might see some
> latency increase for a batch sent on an idle link (possible with
> GSO)-- if this is a concern we could arrange flush on sending packets
> on idle links.

That's also correct.

The issue to handle specially is the initial send on a TX queue which
is empty or close to being empty.

Probably we want some kind of exponential backoff type scheme, so
assuming we have an empty TX queue we'd trigger TX on the first
packet, then the third, then the 7th.  Assuming we had that many to
send at once.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] Get rid of ndo_xmit_flush
  2014-08-27 18:28 ` Cong Wang
  2014-08-27 19:31   ` Tom Herbert
@ 2014-08-27 20:51   ` David Miller
  1 sibling, 0 replies; 29+ messages in thread
From: David Miller @ 2014-08-27 20:51 UTC (permalink / raw)
  To: cwang
  Cc: netdev, therbert, jhs, hannes, edumazet, jeffrey.t.kirsher, rusty,
	dborkman, brouer

From: Cong Wang <cwang@twopensource.com>
Date: Wed, 27 Aug 2014 11:28:25 -0700

> But for me, it looks like you are trying to pend some more packets
> in a TX queue until the driver decides to flush them all in one shot.
> So if that is true, doesn't this mean the latency of first packet pending
> in this queue will increase and network traffic will be more bursty for
> the receiver??

We intend to turn this so that it doesn't introduce latency.

The situation where we have the largest opportunity to perform
this batching is when the device queue has been stopped and is
started back up.

Usually at this moment the TX queue is %75 full and we have room
now for a couple of packets.

Therefore, delaying the triggering of the TX for this new set of
packets will have no effect on latency because the device is still
busy transmitting the rest of the TX queue.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-09-01 22:36 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-25 23:34 [PATCH 0/2] Get rid of ndo_xmit_flush David Miller
2014-08-26  6:28 ` Jesper Dangaard Brouer
2014-08-26 10:13   ` Jesper Dangaard Brouer
2014-08-26 12:52     ` Jesper Dangaard Brouer
2014-08-26 16:43       ` Alexander Duyck
2014-08-27  7:48         ` Jesper Dangaard Brouer
2014-08-27  8:37           ` Jesper Dangaard Brouer
2014-08-26 14:40     ` Jamal Hadi Salim
2014-09-01  0:37     ` Rusty Russell
2014-08-27 12:19 ` Jesper Dangaard Brouer
2014-08-27 20:43   ` David Miller
2014-08-27 12:31 ` Hannes Frederic Sowa
2014-08-27 13:23   ` Eric Dumazet
2014-08-27 13:56     ` Jesper Dangaard Brouer
2014-08-27 14:09       ` Eric Dumazet
2014-08-27 20:48       ` David Miller
2014-08-27 20:46     ` David Miller
2014-08-27 20:45   ` David Miller
2014-08-28  1:42     ` Hannes Frederic Sowa
2014-08-30  3:22       ` David Miller
2014-08-30 10:23         ` Jesper Dangaard Brouer
2014-09-01 20:05         ` Hannes Frederic Sowa
2014-09-01 21:56           ` David Miller
2014-09-01 22:31             ` Hannes Frederic Sowa
2014-09-01 22:35               ` David Miller
2014-08-27 18:28 ` Cong Wang
2014-08-27 19:31   ` Tom Herbert
2014-08-27 20:53     ` David Miller
2014-08-27 20:51   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).