[PATCH 0/3] net: Byte queue limit patch series

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] net: Byte queue limit patch series
@ 2011-04-26  4:38 Tom Herbert
  2011-04-26  5:56 ` Bill Fink
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Tom Herbert @ 2011-04-26  4:38 UTC (permalink / raw)
  To: davem, netdev

This patch series implements byte queue limits (bql) for NIC TX queues.

Byte queue limits are a mechanism to limit the size of the transmit
hardware queue on a NIC by number of bytes. The goal of these byte
limits is too reduce latency caused by excessive queuing in hardware
without sacrificing throughput.

Hardware queuing limits are typically specified in terms of a number
hardware descriptors, each of which has a variable size. The variability
of the size of individual queued items can have a very wide range. For
instance with the e1000 NIC the size could range from 64 bytes to 4K
(with TSO enabled). This variability makes it next to impossible to
choose a single queue limit that prevents starvation and provides lowest
possible latency.

The objective of byte queue limits is to set the limit to be the
minimum needed to prevent starvation between successive transmissions to
the hardware. The latency between two transmissions can be variable in a
system. It is dependent on interrupt frequency, NAPI polling latencies,
scheduling of the queuing discipline, lock contention, etc. Therefore we
propose that byte queue limits should be dynamic and change in
iaccordance with networking stack latencies a system encounters.

Patches to implement this:
Patch 1: Dynamic queue limits (dql) library.  This provides the general
queuing algorithm.
Patch 2: netdev changes that use dlq to support byte queue limits.
Patch 3: Support in forcedeth drvier for byte queue limits.

The effects of BQL are demonstrated in the benchmark results below.
These were made running 200 stream of netperf RR tests:

140000 rr size
BQL: 80-215K bytes in queue, 856 tps, 3.26%
No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

14000 rr size
BQ: 25-55K bytes in queue, 8500 tps
No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu

1400 rr size
BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
No BQL: 29-117K 85738 tps, 7.67% cpu

140 rr size
BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

1 rr size
BQL: 0-3K in queue, 338811 tps, 41.41% cpu
No BQL: 0-3K in queue, 339947 42.36% cpu

The amount of queuing in the NIC is reduced up to 90%, and I haven't
yet seen a consistent negative impact in terms of throughout or
CPU utilization.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] net: Byte queue limit patch series
  2011-04-26  4:38 [PATCH 0/3] net: Byte queue limit patch series Tom Herbert
@ 2011-04-26  5:56 ` Bill Fink
  2011-04-26  6:17   ` Eric Dumazet
  2011-04-26  6:14 ` Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Bill Fink @ 2011-04-26  5:56 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev

On Mon, 25 Apr 2011, Tom Herbert wrote:

> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency caused by excessive queuing in hardware
> without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> iaccordance with networking stack latencies a system encounters.
> 
> Patches to implement this:
> Patch 1: Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> Patch 2: netdev changes that use dlq to support byte queue limits.
> Patch 3: Support in forcedeth drvier for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> These were made running 200 stream of netperf RR tests:
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

	tps	+0.23 %

> 14000 rr size
> BQ: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu

	tps	-0.27 %

> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu

	tps	+0.98 %

> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

	tps	-0.81 %

> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu

	tps	-0.33 %

> The amount of queuing in the NIC is reduced up to 90%, and I haven't
> yet seen a consistent negative impact in terms of throughout or
> CPU utilization.

I don't quite follow your conclusion from your data.
While there was a sweet spot for the 1400 rr size, other
smaller rr took a hit.  Now all the tps changes were
within 1 %, so perhaps that isn't considered significant
(I'm not qualified to make that call).  But if that's
the case, then the effective latency change seen by the
user isn't significant either, although the amount of
queuing in the NIC is admittedly significantly reduced
for a rr size of 1400 or larger.

					-Bill

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] net: Byte queue limit patch series
  2011-04-26  4:38 [PATCH 0/3] net: Byte queue limit patch series Tom Herbert
  2011-04-26  5:56 ` Bill Fink
@ 2011-04-26  6:14 ` Eric Dumazet
  2011-04-26 16:57 ` Rick Jones
  2011-04-29 18:54 ` David Miller
  3 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2011-04-26  6:14 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev

Le lundi 25 avril 2011 à 21:38 -0700, Tom Herbert a écrit :
> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency caused by excessive queuing in hardware
> without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> iaccordance with networking stack latencies a system encounters.
> 
> Patches to implement this:
> Patch 1: Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> Patch 2: netdev changes that use dlq to support byte queue limits.
> Patch 3: Support in forcedeth drvier for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> These were made running 200 stream of netperf RR tests:
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu
> 
> 14000 rr size
> BQ: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu
> 
> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu
> 
> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu
> 
> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu
> 
> The amount of queuing in the NIC is reduced up to 90%, and I haven't
> yet seen a consistent negative impact in terms of throughout or
> CPU utilization.

Hi Tom

Thats a focus on thoughput, adding some extra latency (because of new
fields to access/dirty in tx path and tx completion path), especially on
setups where many cpus are sending data on one device. I suspect this is
the price to pay to fight bufferbloat.

We can try to make this non so expensive.

Maybe try to separate the DQL structure into two parts, one use on TX
path (inside the already dirtied cache line in netdev_queue structure
(_xmit_lock, xmit_lock_owner, trans_start)), and the other one in TX
completion path ?


This new limit schem also favors streams using super packets. Your
workload use 200 identical clients, it would be nice to mix DNS trafic
(small UDP frames) in them, and check how they behave when queue is
full, while it was almost never full before...




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] net: Byte queue limit patch series
  2011-04-26  5:56 ` Bill Fink
@ 2011-04-26  6:17   ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2011-04-26  6:17 UTC (permalink / raw)
  To: Bill Fink; +Cc: Tom Herbert, davem, netdev

Le mardi 26 avril 2011 à 01:56 -0400, Bill Fink a écrit :

> I don't quite follow your conclusion from your data.
> While there was a sweet spot for the 1400 rr size, other
> smaller rr took a hit.  Now all the tps changes were
> within 1 %, so perhaps that isn't considered significant
> (I'm not qualified to make that call).  But if that's
> the case, then the effective latency change seen by the
> user isn't significant either, although the amount of
> queuing in the NIC is admittedly significantly reduced
> for a rr size of 1400 or larger.

Tom point was to show that we can reduce latency (because size of
netdevice queue is smaller) without changing tps ;)




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] net: Byte queue limit patch series
  2011-04-26  4:38 [PATCH 0/3] net: Byte queue limit patch series Tom Herbert
  2011-04-26  5:56 ` Bill Fink
  2011-04-26  6:14 ` Eric Dumazet
@ 2011-04-26 16:57 ` Rick Jones
  2011-04-29 18:54 ` David Miller
  3 siblings, 0 replies; 8+ messages in thread
From: Rick Jones @ 2011-04-26 16:57 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev

On Mon, 2011-04-25 at 21:38 -0700, Tom Herbert wrote:
> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency caused by excessive queuing in hardware
> without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> iaccordance with networking stack latencies a system encounters.
> 
> Patches to implement this:
> Patch 1: Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> Patch 2: netdev changes that use dlq to support byte queue limits.
> Patch 3: Support in forcedeth drvier for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> These were made running 200 stream of netperf RR tests:
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

That is both the request and the response being set to 140000 yes?

> 14000 rr size
> BQ: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu
> 
> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu
> 
> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

What, no 14?-)

> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu
> 
> The amount of queuing in the NIC is reduced up to 90%, and I haven't
> yet seen a consistent negative impact in terms of throughout or
> CPU utilization.

Presumably this will also have a positive (imo) effect on the maximum
size to which a bulk transfer's window will grow under auto tuning yes?

How about a "burst mode" TCP_RR test?

happy benchmarking,

rick jones


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] net: Byte queue limit patch series
  2011-04-26  4:38 [PATCH 0/3] net: Byte queue limit patch series Tom Herbert
                   ` (2 preceding siblings ...)
  2011-04-26 16:57 ` Rick Jones
@ 2011-04-29 18:54 ` David Miller
  2011-05-02  2:41   ` Tom Herbert
  3 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2011-04-29 18:54 UTC (permalink / raw)
  To: therbert; +Cc: netdev

From: Tom Herbert <therbert@google.com>
Date: Mon, 25 Apr 2011 21:38:06 -0700 (PDT)

> This patch series implements byte queue limits (bql) for NIC TX queues.

I like the idea, I don't like how much drivers have to be involved
in this.

TX queue handling is already too involved for driver writers, so
having them need to get all of these new hooks right is a non-starter.

I don't even think it's necessary to be honest, I think you can hide
to bulk of it.

alloc_etherdev_mq() can initialize the per-txq bql instances.

Add new interface, netdev_free_tx_skb(txq, skb) which can do the
completion accounting.  Actually the 'txq' argument is probably
superfluous as it can be obtained from the skb itself.

dev_hard_start_xmit() can do the "sent" processing.

Then the only thing you're doing is replacing dev_kfree_skb*()
in the TX path of the driver with the new netdev_free_tx_skb()
thing.

Maybe you can add a "netdev_tx_queue_reset()" hook for asynchronous
device resets.  Otherwise the bql reset can occur in generic code
right at ->ndo_open().

Finally, you manage the bql limit logic in the existing generic netdev
TX start/stop interfaces.  If the user asks for "start" but the bql
is overlimit, simply ignore the request.  The driver will just signal
another "start" when the next TX packet completes.

Similarly, when the qdisc is queuing up a packet to
dev_hard_start_xmit() you can, for example, preemptively do a "stop"
on the queue if you find bql is overlimit.

Thanks Tom.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] net: Byte queue limit patch series
  2011-04-29 18:54 ` David Miller
@ 2011-05-02  2:41   ` Tom Herbert
  2011-05-02  3:49     ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Herbert @ 2011-05-02  2:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Fri, Apr 29, 2011 at 11:54 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Mon, 25 Apr 2011 21:38:06 -0700 (PDT)
>
>> This patch series implements byte queue limits (bql) for NIC TX queues.
>
> I like the idea, I don't like how much drivers have to be involved
> in this.
>
> TX queue handling is already too involved for driver writers, so
> having them need to get all of these new hooks right is a non-starter.
>
> I don't even think it's necessary to be honest, I think you can hide
> to bulk of it.
>
> alloc_etherdev_mq() can initialize the per-txq bql instances.
>
> Add new interface, netdev_free_tx_skb(txq, skb) which can do the
> completion accounting.  Actually the 'txq' argument is probably
> superfluous as it can be obtained from the skb itself.
>
Okay, but I think the call at the end of TX completion processing is
still probably needed.  The algorithm is trying to determine the
number of bytes that completed at each interrupt.

> dev_hard_start_xmit() can do the "sent" processing.
>
> Then the only thing you're doing is replacing dev_kfree_skb*()
> in the TX path of the driver with the new netdev_free_tx_skb()
> thing.
>
> Maybe you can add a "netdev_tx_queue_reset()" hook for asynchronous
> device resets.  Otherwise the bql reset can occur in generic code
> right at ->ndo_open().
>
Reset probably isn't necessary anyway if all the skb's are properly
completed with netdev_free_tx_skb.

> Finally, you manage the bql limit logic in the existing generic netdev
> TX start/stop interfaces.  If the user asks for "start" but the bql
> is overlimit, simply ignore the request.  The driver will just signal
> another "start" when the next TX packet completes.
>
> Similarly, when the qdisc is queuing up a packet to
> dev_hard_start_xmit() you can, for example, preemptively do a "stop"
> on the queue if you find bql is overlimit.
>
Unfortunately, there is still an additional complexity if we don't
piggy back on the logic in the driver to stop the queue.  I believe
that either this would require another queue state for queue being
stopped for bql which looks pretty cumbersome, so that wrapping this
in a qdisc might be a better possibility.

Tom

> Thanks Tom.
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] net: Byte queue limit patch series
  2011-05-02  2:41   ` Tom Herbert
@ 2011-05-02  3:49     ` David Miller
  0 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2011-05-02  3:49 UTC (permalink / raw)
  To: therbert; +Cc: netdev

From: Tom Herbert <therbert@google.com>
Date: Sun, 1 May 2011 19:41:34 -0700

> On Fri, Apr 29, 2011 at 11:54 AM, David Miller <davem@davemloft.net> wrote:
>> Add new interface, netdev_free_tx_skb(txq, skb) which can do the
>> completion accounting.  Actually the 'txq' argument is probably
>> superfluous as it can be obtained from the skb itself.
>>
> Okay, but I think the call at the end of TX completion processing is
> still probably needed.  The algorithm is trying to determine the
> number of bytes that completed at each interrupt.

Ok, then call it something generic like netdev_tx_complete() which
can serve other purposes in the future and not be bql specific.

>> Finally, you manage the bql limit logic in the existing generic netdev
>> TX start/stop interfaces.  If the user asks for "start" but the bql
>> is overlimit, simply ignore the request.  The driver will just signal
>> another "start" when the next TX packet completes.
>>
>> Similarly, when the qdisc is queuing up a packet to
>> dev_hard_start_xmit() you can, for example, preemptively do a "stop"
>> on the queue if you find bql is overlimit.
>>
> Unfortunately, there is still an additional complexity if we don't
> piggy back on the logic in the driver to stop the queue.  I believe
> that either this would require another queue state for queue being
> stopped for bql which looks pretty cumbersome, so that wrapping this
> in a qdisc might be a better possibility.

I'll leave it up to you what approach to try next.

Even thought I sort of side with you that bql is a largely seperate
facility from what we usually do with qdiscs, this TX completion
event could be very useful as an input to qdisc decision making.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-05-02  3:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-26  4:38 [PATCH 0/3] net: Byte queue limit patch series Tom Herbert
2011-04-26  5:56 ` Bill Fink
2011-04-26  6:17   ` Eric Dumazet
2011-04-26  6:14 ` Eric Dumazet
2011-04-26 16:57 ` Rick Jones
2011-04-29 18:54 ` David Miller
2011-05-02  2:41   ` Tom Herbert
2011-05-02  3:49     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).