[PATCH v4 0/10] bql: Byte Queue Limits

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/10] bql: Byte Queue Limits
@ 2011-11-29  2:32 Tom Herbert
  2011-11-29  4:23 ` Dave Taht
  2011-11-29 16:46 ` Eric Dumazet
  0 siblings, 2 replies; 26+ messages in thread
From: Tom Herbert @ 2011-11-29  2:32 UTC (permalink / raw)
  To: davem, netdev

Changes from last version:
  - Fixed obj leak in netdev_queue_add_kobject (suggested by shemminger)
  - Change dql to use unsigned int (32 bit) values (suggested by eric)
  - Added adj_limit field to dql structure.  This computed as
    limit + num_completed.  In dql_avail this is used to determine
    availability with one less arithmetic op. 
  - Use UINT_MAX for limit constants.
  - Change netdev_sent_queue to not have a number of packets argument,
    one packet is assumed.  (suggested by shemminger)
  - Added more detail about locking requirements for dql
  - Moves netdev->state field to written fields part of netdev structure
  - Fixed function prototypes in dql.h.

----

This patch series implements byte queue limits (bql) for NIC TX queues.

Byte queue limits are a mechanism to limit the size of the transmit
hardware queue on a NIC by number of bytes. The goal of these byte
limits is too reduce latency (HOL blocking) caused by excessive queuing
in hardware (aka buffer bloat) without sacrificing throughput.

Hardware queuing limits are typically specified in terms of a number
hardware descriptors, each of which has a variable size. The variability
of the size of individual queued items can have a very wide range. For
instance with the e1000 NIC the size could range from 64 bytes to 4K
(with TSO enabled). This variability makes it next to impossible to
choose a single queue limit that prevents starvation and provides lowest
possible latency.

The objective of byte queue limits is to set the limit to be the
minimum needed to prevent starvation between successive transmissions to
the hardware. The latency between two transmissions can be variable in a
system. It is dependent on interrupt frequency, NAPI polling latencies,
scheduling of the queuing discipline, lock contention, etc. Therefore we
propose that byte queue limits should be dynamic and change in
accordance with networking stack latencies a system encounters.  BQL
should not need to take the underlying link speed as input, it should
automatically adjust to whatever the speed is (even if that in itself is
dynamic).

Patches to implement this:
- Dynamic queue limits (dql) library.  This provides the general
queuing algorithm.
- netdev changes that use dlq to support byte queue limits.
- Support in drivers for byte queue limits.

The effects of BQL are demonstrated in the benchmark results below.

--- High priority versus low priority traffic:

In this test 100 netperf TCP_STREAMs were started to saturate the link.
A single instance of a netperf TCP_RR was run with high priority set.
Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
1024.  tps for the high priority RR is listed.

No BQL, tso on: 3000-3200K bytes in queue: 36 tps
BQL, tso on: 156-194K bytes in queue, 535 tps
No BQL, tso off: 453-454K bytes int queue, 234 tps
BQL, tso off: 66K bytes in queue, 914 tps

---  Various RR sizes

These tests were done running 200 stream of netperf RR tests.  The
results demonstrate the reduction in queuing and also illustrates 
the overhead due to BQL (in small RR sizes).

140000 rr size
BQL: 80-215K bytes in queue, 856 tps, 3.26%
No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

14000 rr size
BQL: 25-55K bytes in queue, 8500 tps
No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu

1400 rr size
BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
No BQL: 29-117K 85738 tps, 7.67% cpu

140 rr size
BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

1 rr size
BQL: 0-3K in queue, 338811 tps, 41.41% cpu
No BQL: 0-3K in queue, 339947 42.36% cpu

So the amount of queuing in the NIC can be reduced up to 90% or more.
Accordingly, the latency for high priority packets in the prescence
of low priority bulk throughput traffic can be reduced by 90% or more.

Since BQL accounting is in the transmit path for every packet, and the
function to recompute the byte limit is run once per transmit
completion-- there will be some overhead in using BQL.  So far, Ive see
the overhead to be in the range of 1-3% for CPU utilization and maximum
pps.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  2:32 [PATCH v4 0/10] bql: Byte Queue Limits Tom Herbert
@ 2011-11-29  4:23 ` Dave Taht
  2011-11-29  7:02   ` Eric Dumazet
  2011-11-29 16:46 ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Taht @ 2011-11-29  4:23 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev

> In this test 100 netperf TCP_STREAMs were started to saturate the link.
> A single instance of a netperf TCP_RR was run with high priority set.
> Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
> 1024.  tps for the high priority RR is listed.
>
> No BQL, tso on: 3000-3200K bytes in queue: 36 tps
> BQL, tso on: 156-194K bytes in queue, 535 tps

> No BQL, tso off: 453-454K bytes int queue, 234 tps
> BQL, tso off: 66K bytes in queue, 914 tps


Jeeze. Under what circumstances is tso a win? I've always
had great trouble with it, as some e1000 cards do it rather badly.

I assume these are while running at GigE speeds?

What of 100Mbit? 10GigE? (I will duplicate your tests
at 100Mbit, but as for 10gigE...)

I would suggest TCP_MAERTS as well to saturate the
link in the other direction.

And then both TCP_STREAM and
TCP_MAERTS at the same time while doing RR.


-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  4:23 ` Dave Taht
@ 2011-11-29  7:02   ` Eric Dumazet
  2011-11-29  7:07     ` Eric Dumazet
                       ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Eric Dumazet @ 2011-11-29  7:02 UTC (permalink / raw)
  To: Dave Taht; +Cc: Tom Herbert, davem, netdev

Le mardi 29 novembre 2011 à 05:23 +0100, Dave Taht a écrit :
> > In this test 100 netperf TCP_STREAMs were started to saturate the link.
> > A single instance of a netperf TCP_RR was run with high priority set.
> > Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
> > 1024.  tps for the high priority RR is listed.
> >
> > No BQL, tso on: 3000-3200K bytes in queue: 36 tps
> > BQL, tso on: 156-194K bytes in queue, 535 tps
> 
> > No BQL, tso off: 453-454K bytes int queue, 234 tps
> > BQL, tso off: 66K bytes in queue, 914 tps
> 
> 
> Jeeze. Under what circumstances is tso a win? I've always
> had great trouble with it, as some e1000 cards do it rather badly.
> 
> I assume these are while running at GigE speeds?
> 
> What of 100Mbit? 10GigE? (I will duplicate your tests
> at 100Mbit, but as for 10gigE...)
> 

TSO on means a low priority 65Kbytes packet can be in TX ring right
before the high priority packet. If you cant afford the delay, you lose.

There is no mystery here.

If you want low latencies :
- TSO must be disabled so that packets are at most one ethernet frame. 
- You adjust BQL limit to small value
- You even can lower MTU to get even more better latencies.

If you want good throughput from your [10]GigE and low cpu cost, TSO
should be enabled.

If you want to be smart, you could have a dynamic behavior :

Let TSO on as long as no high priority low latency producer is running
(if low latency packets are locally generated)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  7:02   ` Eric Dumazet
@ 2011-11-29  7:07     ` Eric Dumazet
  2011-11-29  7:23     ` John Fastabend
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2011-11-29  7:07 UTC (permalink / raw)
  To: Dave Taht; +Cc: Tom Herbert, davem, netdev

Le mardi 29 novembre 2011 à 08:02 +0100, Eric Dumazet a écrit :

> TSO on means a low priority 65Kbytes packet can be in TX ring right
> before the high priority packet. If you cant afford the delay, you lose.
> 


By the way, I hope TSO is off for wifi adapters.

At least here its off..

# ethtool -k wlan0
Offload parameters for wlan0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: off
tx-vlan-offload: off
ntuple-filters: off
receive-hashing: off


0c:00.0 Network controller: Broadcom Corporation BCM4311 802.11a/b/g (rev 01)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  7:02   ` Eric Dumazet
  2011-11-29  7:07     ` Eric Dumazet
@ 2011-11-29  7:23     ` John Fastabend
  2011-11-29  7:45       ` Eric Dumazet
  2011-11-29  8:37       ` Dave Taht
  2011-11-29 14:24     ` Ben Hutchings
  2011-11-29 17:28     ` Rick Jones
  3 siblings, 2 replies; 26+ messages in thread
From: John Fastabend @ 2011-11-29  7:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Taht, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

On 11/28/2011 11:02 PM, Eric Dumazet wrote:
> Le mardi 29 novembre 2011 à 05:23 +0100, Dave Taht a écrit :
>>> In this test 100 netperf TCP_STREAMs were started to saturate the link.
>>> A single instance of a netperf TCP_RR was run with high priority set.
>>> Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
>>> 1024.  tps for the high priority RR is listed.
>>>
>>> No BQL, tso on: 3000-3200K bytes in queue: 36 tps
>>> BQL, tso on: 156-194K bytes in queue, 535 tps
>>
>>> No BQL, tso off: 453-454K bytes int queue, 234 tps
>>> BQL, tso off: 66K bytes in queue, 914 tps
>>
>>
>> Jeeze. Under what circumstances is tso a win? I've always
>> had great trouble with it, as some e1000 cards do it rather badly.
>>
>> I assume these are while running at GigE speeds?
>>
>> What of 100Mbit? 10GigE? (I will duplicate your tests
>> at 100Mbit, but as for 10gigE...)
>>
> 
> TSO on means a low priority 65Kbytes packet can be in TX ring right
> before the high priority packet. If you cant afford the delay, you lose.
> 
> There is no mystery here.
> 
> If you want low latencies :
> - TSO must be disabled so that packets are at most one ethernet frame. 
> - You adjust BQL limit to small value
> - You even can lower MTU to get even more better latencies.
> 
> If you want good throughput from your [10]GigE and low cpu cost, TSO
> should be enabled.
> 
> If you want to be smart, you could have a dynamic behavior :
> 
> Let TSO on as long as no high priority low latency producer is running
> (if low latency packets are locally generated)
> 
> 

I wonder if we should consider enabling TSO/GSO per queue or per traffic
class on devices that support this. At least in devices that support
multiple traffic classes it seems to be a common usage case to put bulk
storage traffic (iSCSI) on a traffic class and low latency traffic on a
separate traffic class, VoIP for example.

John.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  7:23     ` John Fastabend
@ 2011-11-29  7:45       ` Eric Dumazet
  2011-11-29  8:03         ` John Fastabend
  2011-11-29  8:37       ` Dave Taht
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2011-11-29  7:45 UTC (permalink / raw)
  To: John Fastabend
  Cc: Dave Taht, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

Le lundi 28 novembre 2011 à 23:23 -0800, John Fastabend a écrit :

> I wonder if we should consider enabling TSO/GSO per queue or per traffic
> class on devices that support this. At least in devices that support
> multiple traffic classes it seems to be a common usage case to put bulk
> storage traffic (iSCSI) on a traffic class and low latency traffic on a
> separate traffic class, VoIP for example.
> 

It all depends on how device itself is doing its mux from queues to
ethernet wire. If queue 0 starts transmit of one 64KB 'super packet',
will queue 1 be able to insert a litle frame between the frames of queue
0 ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  7:45       ` Eric Dumazet
@ 2011-11-29  8:03         ` John Fastabend
  0 siblings, 0 replies; 26+ messages in thread
From: John Fastabend @ 2011-11-29  8:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Taht, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

On 11/28/2011 11:45 PM, Eric Dumazet wrote:
> Le lundi 28 novembre 2011 à 23:23 -0800, John Fastabend a écrit :
> 
>> I wonder if we should consider enabling TSO/GSO per queue or per traffic
>> class on devices that support this. At least in devices that support
>> multiple traffic classes it seems to be a common usage case to put bulk
>> storage traffic (iSCSI) on a traffic class and low latency traffic on a
>> separate traffic class, VoIP for example.
>>
> 
> It all depends on how device itself is doing its mux from queues to
> ethernet wire. If queue 0 starts transmit of one 64KB 'super packet',
> will queue 1 be able to insert a litle frame between the frames of queue
> 0 ?
> 

Yes this works at least on the ixgbe supported 82599 device as you
would hope. 'super packets' from queues can and will be interleaved,
perhaps with standard sized packets, depending on the currently
configured arbitration scheme. So with multiple traffic classes we
can make a link strict 'low latency' class to TX frames as soon as
they are available.

Also I would expect this to work correctly on any of the coined
CNA devices, the bnx2x devices for example. I'll probably see what
can be done after finishing up some other things first.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  7:23     ` John Fastabend
  2011-11-29  7:45       ` Eric Dumazet
@ 2011-11-29  8:37       ` Dave Taht
  2011-11-29  8:43         ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Taht @ 2011-11-29  8:37 UTC (permalink / raw)
  To: John Fastabend
  Cc: Eric Dumazet, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

On Tue, Nov 29, 2011 at 8:23 AM, John Fastabend
<john.r.fastabend@intel.com> wrote:
>
> I wonder if we should consider enabling TSO/GSO per queue or per traffic
> class on devices that support this. At least in devices that support
> multiple traffic classes it seems to be a common usage case to put bulk
> storage traffic (iSCSI) on a traffic class and low latency traffic on a
> separate traffic class, VoIP for example.


VOIP is a drop in the bucket.

Turning TSO off on TCP exiting the datacenter (or more specifically),
destined anywhere there is potential tx/rx bandwidth disparity
would be goooooood.


>
> John.



--
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  8:37       ` Dave Taht
@ 2011-11-29  8:43         ` Eric Dumazet
  2011-11-29  8:51           ` Dave Taht
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2011-11-29  8:43 UTC (permalink / raw)
  To: Dave Taht
  Cc: John Fastabend, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

Le mardi 29 novembre 2011 à 09:37 +0100, Dave Taht a écrit :
> On Tue, Nov 29, 2011 at 8:23 AM, John Fastabend
> <john.r.fastabend@intel.com> wrote:
> >
> > I wonder if we should consider enabling TSO/GSO per queue or per traffic
> > class on devices that support this. At least in devices that support
> > multiple traffic classes it seems to be a common usage case to put bulk
> > storage traffic (iSCSI) on a traffic class and low latency traffic on a
> > separate traffic class, VoIP for example.
> 
> 
> VOIP is a drop in the bucket.
> 
> Turning TSO off on TCP exiting the datacenter (or more specifically),
> destined anywhere there is potential tx/rx bandwidth disparity
> would be goooooood.
> 

If your cpu is fast enough (and they are most of the time), this makes
no difference at all.

Instead of consuming 3% of cpu with TSO, you'll consume 10% or 15% and
no difference seen on the wire.

Really, if you want to avoid bursts, TSO has litle to do with them.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  8:43         ` Eric Dumazet
@ 2011-11-29  8:51           ` Dave Taht
  2011-11-29 14:57             ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Taht @ 2011-11-29  8:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Fastabend, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

On Tue, Nov 29, 2011 at 9:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 29 novembre 2011 à 09:37 +0100, Dave Taht a écrit :
>> On Tue, Nov 29, 2011 at 8:23 AM, John Fastabend
>> <john.r.fastabend@intel.com> wrote:
>> >
>> > I wonder if we should consider enabling TSO/GSO per queue or per traffic
>> > class on devices that support this. At least in devices that support
>> > multiple traffic classes it seems to be a common usage case to put bulk
>> > storage traffic (iSCSI) on a traffic class and low latency traffic on a
>> > separate traffic class, VoIP for example.
>>
>>
>> VOIP is a drop in the bucket.
>>
>> Turning TSO off on TCP exiting the datacenter (or more specifically),
>> destined anywhere there is potential tx/rx bandwidth disparity
>> would be goooooood.
>>
>
> If your cpu is fast enough (and they are most of the time), this makes
> no difference at all.
>
> Instead of consuming 3% of cpu with TSO, you'll consume 10% or 15% and
> no difference seen on the wire.

Perhaps I don't understand the gross effects of TSO very well, but if you have
100 streams coming from a server, destined to X different destinations,
and you FQ to each on a per packet basis, you end up impacting the downstream
receive buffers throughout much less than if you send each stream as a burst.

>
> Really, if you want to avoid bursts, TSO has litle to do with them.

If I'm misunderstanding the downstream effects of TSO, I stand corrected.

>
>
>
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  7:02   ` Eric Dumazet
  2011-11-29  7:07     ` Eric Dumazet
  2011-11-29  7:23     ` John Fastabend
@ 2011-11-29 14:24     ` Ben Hutchings
  2011-11-29 14:29       ` Eric Dumazet
  2011-11-29 17:28     ` Rick Jones
  3 siblings, 1 reply; 26+ messages in thread
From: Ben Hutchings @ 2011-11-29 14:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Taht, Tom Herbert, davem, netdev

On Tue, 2011-11-29 at 08:02 +0100, Eric Dumazet wrote:
> Le mardi 29 novembre 2011 à 05:23 +0100, Dave Taht a écrit :
> > > In this test 100 netperf TCP_STREAMs were started to saturate the link.
> > > A single instance of a netperf TCP_RR was run with high priority set.
> > > Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
> > > 1024.  tps for the high priority RR is listed.
> > >
> > > No BQL, tso on: 3000-3200K bytes in queue: 36 tps
> > > BQL, tso on: 156-194K bytes in queue, 535 tps
> > 
> > > No BQL, tso off: 453-454K bytes int queue, 234 tps
> > > BQL, tso off: 66K bytes in queue, 914 tps
> > 
> > 
> > Jeeze. Under what circumstances is tso a win? I've always
> > had great trouble with it, as some e1000 cards do it rather badly.
> > 
> > I assume these are while running at GigE speeds?
> > 
> > What of 100Mbit? 10GigE? (I will duplicate your tests
> > at 100Mbit, but as for 10gigE...)
> > 
> 
> TSO on means a low priority 65Kbytes packet can be in TX ring right
> before the high priority packet. If you cant afford the delay, you lose.
> 
> There is no mystery here.
> 
> If you want low latencies :
> - TSO must be disabled so that packets are at most one ethernet frame. 
[...]

Not if you separate hardware queues by priority (and your high priority
packets are non-TCP or PuSHed).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 14:24     ` Ben Hutchings
@ 2011-11-29 14:29       ` Eric Dumazet
  2011-11-29 16:06         ` Dave Taht
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2011-11-29 14:29 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Dave Taht, Tom Herbert, davem, netdev

Le mardi 29 novembre 2011 à 14:24 +0000, Ben Hutchings a écrit :

> Not if you separate hardware queues by priority (and your high priority
> packets are non-TCP or PuSHed).

I mostly have tg3 , bnx2 cards, mono queues...

I presume Dave, working on small Wifi/ADSL routers have same kind of
hardware.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  8:51           ` Dave Taht
@ 2011-11-29 14:57             ` Eric Dumazet
  2011-11-29 16:24               ` Dave Taht
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2011-11-29 14:57 UTC (permalink / raw)
  To: Dave Taht
  Cc: John Fastabend, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

Le mardi 29 novembre 2011 à 09:51 +0100, Dave Taht a écrit :

> Perhaps I don't understand the gross effects of TSO very well, but if you have
> 100 streams coming from a server, destined to X different destinations,
> and you FQ to each on a per packet basis, you end up impacting the downstream
> receive buffers throughout much less than if you send each stream as a burst.

TSO makes packets larger, to lower cpu use in different layers (netfilter, qdisc, ...).

Imagine you could have MSS=65000 on your ethernet wire.

If you need to send a high prio packet while a prior big one is
in-flight on a dumb device (a single TX FIFO), there is nothing you
can do but wait last bit of big packet hit the wire.

Even with one flow you lose. Hundred flows dont matter
(as long as you have proper classification in Qdisc layer, of course)

Most setups dont care.

The ones caring dedicate a link for exclusive use, making sure it wont
cross loaded trunks. (Heartbeats in clusters)

Even disabling TSO wont be enough for them, if a single tcp flow can compete with them.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 14:29       ` Eric Dumazet
@ 2011-11-29 16:06         ` Dave Taht
  2011-11-29 16:41           ` Ben Hutchings
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Taht @ 2011-11-29 16:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, Tom Herbert, davem, netdev

On Tue, Nov 29, 2011 at 3:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 29 novembre 2011 à 14:24 +0000, Ben Hutchings a écrit :
>
>> Not if you separate hardware queues by priority (and your high priority
>> packets are non-TCP or PuSHed).
>
> I mostly have tg3 , bnx2 cards, mono queues...
>
> I presume Dave, working on small Wifi/ADSL routers have same kind of
> hardware.

Nothing but mono queues here on wired - 4 queues on wireless, however.

and a focus on trying to make sure the
10Gig guys don't swamp the 128Kbit to 100Mbit guys, and everything in
between that bandwidth range is what I care about, mostly against GigE
servers...

( I'm still waiting on some 10Gig hw donations to arrive)

However the hardware array is much larger than you presume.

We have a variety of hardware, ranging from 7 cerowrt routers located
in the bloatlab #1 at ISC.org,  where there are also a couple x86_64
based multicore servers, and a variety of related (mostly wireless)
hardware, such as a bunch of OLPCs.

Bloatlab #1 is in California, and connected to the internet via 10gige
and on a dedicated gigE connection all it's own.

http://www.bufferbloat.net/projects/cerowrt/wiki/BloatLab_1

With overly reduced TX rings to combat bufferbloat, the best the
routers in the lab can do is about 290Mbit. They have excellent TCP_RR
stats, though. With larger rings, they do 540Mbit+. It's my hope with
BQL on the router to get closer to the larger figure.

One of the x86 machines in the lab does TSO and it's ugly...

I'm now based in Paris specifically to be testing FQ and AQM solutions
over the 170 ms LFN between here and there and have been working on
QFQ + RED (while awaiting 'RED light' both at 100Mbit line rates and
at software simulated rates below that common to actual end user
connectivity to the internet.

http://www.bufferbloat.net/issues/312

I have 3 additional routers and  several e1000e machines here in Paris.

And I'm checking into the interactions of all this against everything
else against a variety of models. ISC has made the bloatlab is
available to all, I note, if anyone wants to run a test there, let me
know....

>
>
>

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 14:57             ` Eric Dumazet
@ 2011-11-29 16:24               ` Dave Taht
  2011-11-29 17:06                 ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Taht @ 2011-11-29 16:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Fastabend, Tom Herbert, davem@davemloft.net,
	netdev@vger.kernel.org

On Tue, Nov 29, 2011 at 3:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 29 novembre 2011 à 09:51 +0100, Dave Taht a écrit :
>
>> Perhaps I don't understand the gross effects of TSO very well, but if you have
>> 100 streams coming from a server, destined to X different destinations,
>> and you FQ to each on a per packet basis, you end up impacting the downstream
>> receive buffers throughout much less than if you send each stream as a burst.
>
> TSO makes packets larger, to lower cpu use in different layers (netfilter, qdisc, ...).
>
> Imagine you could have MSS=65000 on your ethernet wire.
>
> If you need to send a high prio packet while a prior big one is
> in-flight on a dumb device (a single TX FIFO), there is nothing you
> can do but wait last bit of big packet hit the wire.
>
> Even with one flow you lose. Hundred flows dont matter
> (as long as you have proper classification in Qdisc layer, of course)

People keep talking about 'prioritization' as if it can apply.

It doesn't. Prioritization and classification are nearly hopeless
exercises when you have high rate streams. It worked at low rates for
some traffic, but...

The focus for fixing bufferbloat is

               "better queueing"...

and what that translates out to is some form of

fair queuing

- at the moment I'm enthralled with QFQ btw -

coupled with some form of active queue management that works. (RED
used to work but was rather flawed - it's still better than the
alternative of drop tail)

It doesn't necessarily translate out to more unmanaged dumb queues, it
may translate
out to more *managed* queues.

I wouldn't mind TSO AT ALL if the hardware did some of the above
underneath it. I've heard some rumblings that that might happen.  We
spent all that engineering time making TCP go fast and minimized the
hardware impact of that - why not spend a little more time - in the
next generation of hw/sw - making TCP work *better* on the network?

Cisco did that in the 90s, what's so hard about trying now, in
software and/or hardware?

Now look, this thread has got way off the original topic, which was on
BQL, and BQL mostly rocks. The MIAD (as opposed to AIMD) controller in
it bothers me, but that can be looked at harder later.

and I want to go back to making it work on top of my testbed so I can
finally see models matching reality and vice versa and keep working on
producing demonstrable results

that can help fix some problems in software now for end users,
gateways, routers, servers, and data centers... reducing latencies
from trips around the moon to around your living room...

...and get slammed into hardware someday, if there is ever market
demand for things like interactive video, gaming, or voice
applications that just work.

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 16:06         ` Dave Taht
@ 2011-11-29 16:41           ` Ben Hutchings
  0 siblings, 0 replies; 26+ messages in thread
From: Ben Hutchings @ 2011-11-29 16:41 UTC (permalink / raw)
  To: Dave Taht; +Cc: Eric Dumazet, Tom Herbert, davem, netdev

On Tue, 2011-11-29 at 17:06 +0100, Dave Taht wrote:
> On Tue, Nov 29, 2011 at 3:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mardi 29 novembre 2011 à 14:24 +0000, Ben Hutchings a écrit :
> >
> >> Not if you separate hardware queues by priority (and your high priority
> >> packets are non-TCP or PuSHed).
> >
> > I mostly have tg3 , bnx2 cards, mono queues...
> >
> > I presume Dave, working on small Wifi/ADSL routers have same kind of
> > hardware.
> 
> Nothing but mono queues here on wired - 4 queues on wireless, however.
> 
> and a focus on trying to make sure the
> 10Gig guys don't swamp the 128Kbit to 100Mbit guys, and everything in
> between that bandwidth range is what I care about, mostly against GigE
> servers...

I'm not objecting to that, just the assertion that TSO can be a problem
even on 10G hardware.  In fact it makes a big improvement to CPU
efficiency (even if you do it in the driver, it can be better than GSO)
and almost all 10G hardware has multiple queues which can be used to
avoid the latency penalty.

> ( I'm still waiting on some 10Gig hw donations to arrive)
[...]

If you have a proposal to do interesting things with 10G hardware and
drivers then I can forward it for consideration here.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  2:32 [PATCH v4 0/10] bql: Byte Queue Limits Tom Herbert
  2011-11-29  4:23 ` Dave Taht
@ 2011-11-29 16:46 ` Eric Dumazet
  2011-11-29 17:47   ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2011-11-29 16:46 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev

Le lundi 28 novembre 2011 à 18:32 -0800, Tom Herbert a écrit :
> Changes from last version:
>   - Fixed obj leak in netdev_queue_add_kobject (suggested by shemminger)
>   - Change dql to use unsigned int (32 bit) values (suggested by eric)
>   - Added adj_limit field to dql structure.  This computed as
>     limit + num_completed.  In dql_avail this is used to determine
>     availability with one less arithmetic op. 
>   - Use UINT_MAX for limit constants.
>   - Change netdev_sent_queue to not have a number of packets argument,
>     one packet is assumed.  (suggested by shemminger)
>   - Added more detail about locking requirements for dql
>   - Moves netdev->state field to written fields part of netdev structure
>   - Fixed function prototypes in dql.h.
> 
> ----
> 
> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency (HOL blocking) caused by excessive queuing
> in hardware (aka buffer bloat) without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> accordance with networking stack latencies a system encounters.  BQL
> should not need to take the underlying link speed as input, it should
> automatically adjust to whatever the speed is (even if that in itself is
> dynamic).
> 
> Patches to implement this:
> - Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> - netdev changes that use dlq to support byte queue limits.
> - Support in drivers for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> 
> --- High priority versus low priority traffic:
> 
> In this test 100 netperf TCP_STREAMs were started to saturate the link.
> A single instance of a netperf TCP_RR was run with high priority set.
> Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
> 1024.  tps for the high priority RR is listed.
> 
> No BQL, tso on: 3000-3200K bytes in queue: 36 tps
> BQL, tso on: 156-194K bytes in queue, 535 tps
> No BQL, tso off: 453-454K bytes int queue, 234 tps
> BQL, tso off: 66K bytes in queue, 914 tps
> 
> ---  Various RR sizes
> 
> These tests were done running 200 stream of netperf RR tests.  The
> results demonstrate the reduction in queuing and also illustrates 
> the overhead due to BQL (in small RR sizes).
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu
> 
> 14000 rr size
> BQL: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu
> 
> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu
> 
> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu
> 
> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu
> 
> So the amount of queuing in the NIC can be reduced up to 90% or more.
> Accordingly, the latency for high priority packets in the prescence
> of low priority bulk throughput traffic can be reduced by 90% or more.
> 
> Since BQL accounting is in the transmit path for every packet, and the
> function to recompute the byte limit is run once per transmit
> completion-- there will be some overhead in using BQL.  So far, Ive see
> the overhead to be in the range of 1-3% for CPU utilization and maximum
> pps.


I did sucessful tests with tg3 (I'll provide the patch for bnx2 shortly)

Some details probably can be polished, but I believe your v4 is ready
for inclusion.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks !

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 16:24               ` Dave Taht
@ 2011-11-29 17:06                 ` David Laight
  0 siblings, 0 replies; 26+ messages in thread
From: David Laight @ 2011-11-29 17:06 UTC (permalink / raw)
  To: Dave Taht, Eric Dumazet; +Cc: John Fastabend, Tom Herbert, davem, netdev

...
> We spent all that engineering time making TCP go fast and minimized
the
> hardware impact of that - why not spend a little more time - in the
> next generation of hw/sw - making TCP work *better* on the network?

One problem I've seen is that a lot of the 'make TCP go fast'
changes have been focused on bulk transfer over long(ish) latency
links - typical for ftp and http downloads.

Interactive (command+response) works moderately, but async
data requests suffer badly.

Typically these connections will have Nagle disabled (because
you can't stand the repeated timeouts), and may be between
very local systems so the RTT is efectively zero and packet
loss unexpected.
Under these conditions the 'slow start' and 'delayed acks'
conspire against you.

What is more, if you have a high request rate there is little
that can be done to merge tx packets, even the sender is
willing to let some data be queued until (say) the next 1ms
clock tick.

I have seen 30000 packets/sec on a single tcp connection!
(The sender doesn't know there is another message to send.)
The sender was a dual, running 'while :; do :; done'
reduced the packet count considerably!

	David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29  7:02   ` Eric Dumazet
                       ` (2 preceding siblings ...)
  2011-11-29 14:24     ` Ben Hutchings
@ 2011-11-29 17:28     ` Rick Jones
  3 siblings, 0 replies; 26+ messages in thread
From: Rick Jones @ 2011-11-29 17:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Taht, Tom Herbert, davem, netdev

On 11/28/2011 11:02 PM, Eric Dumazet wrote:
> Le mardi 29 novembre 2011 à 05:23 +0100, Dave Taht a écrit :
>>> In this test 100 netperf TCP_STREAMs were started to saturate the link.
>>> A single instance of a netperf TCP_RR was run with high priority set.
>>> Queuing discipline in pfifo_fast, NIC is e1000 with TX ring size set to
>>> 1024.  tps for the high priority RR is listed.
>>>
>>> No BQL, tso on: 3000-3200K bytes in queue: 36 tps
>>> BQL, tso on: 156-194K bytes in queue, 535 tps
>>
>>> No BQL, tso off: 453-454K bytes int queue, 234 tps
>>> BQL, tso off: 66K bytes in queue, 914 tps
>>
>>
>> Jeeze. Under what circumstances is tso a win? I've always
>> had great trouble with it, as some e1000 cards do it rather badly.

It is a win when one is sending bulk(ish) data and wish to avoid the 
trips up and down the protocol stack to save CPU cycles.

TSO is sometimes called "poor man's Jumbo Frames"  as it seeks to 
achieve the same goal - fewer trips down the protocol stack per KB of 
data transferred.

>> I assume these are while running at GigE speeds?
>>
>> What of 100Mbit? 10GigE? (I will duplicate your tests
>> at 100Mbit, but as for 10gigE...)
>>
>
> TSO on means a low priority 65Kbytes packet can be in TX ring right
> before the high priority packet. If you cant afford the delay, you lose.
>
> There is no mystery here.
>
> If you want low latencies :
> - TSO must be disabled so that packets are at most one ethernet frame.
> - You adjust BQL limit to small value
> - You even can lower MTU to get even more better latencies.
>
> If you want good throughput from your [10]GigE and low cpu cost, TSO
> should be enabled.

Outbound throughput. If you want good inbound throughput you want GRO/LRO.

> If you want to be smart, you could have a dynamic behavior :
>
> Let TSO on as long as no high priority low latency producer is running
> (if low latency packets are locally generated)

I'd probably leave that to the administrator rather than try to clutter 
things with additional logic.

*If* I were to add additional logic, I might have an interface 
communicate its "maximum TSO size" up the stack in a manner to too 
dissimilar from MTU.  That way one can control just how much time a 
TSO'd segment would consume.

rick jones

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 16:46 ` Eric Dumazet
@ 2011-11-29 17:47   ` David Miller
  2011-11-29 18:31     ` Tom Herbert
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2011-11-29 17:47 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 29 Nov 2011 17:46:24 +0100

> I did sucessful tests with tg3 (I'll provide the patch for bnx2 shortly)
> 
> Some details probably can be polished, but I believe your v4 is ready
> for inclusion.
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Agreed, all applied to net-next, thanks!

Tom, please keep an eye out for regression or suggestion reports.

Thanks again.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 17:47   ` David Miller
@ 2011-11-29 18:31     ` Tom Herbert
  2011-12-01 16:50       ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Tom Herbert @ 2011-11-29 18:31 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Tue, Nov 29, 2011 at 9:47 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 29 Nov 2011 17:46:24 +0100
>
>> I did sucessful tests with tg3 (I'll provide the patch for bnx2 shortly)
>>
>> Some details probably can be polished, but I believe your v4 is ready
>> for inclusion.
>>
>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Agreed, all applied to net-next, thanks!
>
> Tom, please keep an eye out for regression or suggestion reports.
>

Will do.  I am well aware of how invasive this is in the data path ;-)
 I'll add a doc describing BQL also.

By the way, the way to disable BQL at runtime is the 'echo max >
/sys/class/net/eth<n>/queues/tx-<m>/byte_queue_limits/limit_min

> Thanks again.
>

Thanks Dave, Eric, Dave Taht, and everyone for reviewing this.

Tom

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-11-29 18:31     ` Tom Herbert
@ 2011-12-01 16:50       ` Kirill Smelkov
  2011-12-01 18:00         ` David Miller
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2011-12-01 16:50 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, eric.dumazet, netdev

On Tue, Nov 29, 2011 at 10:31:03AM -0800, Tom Herbert wrote:
> On Tue, Nov 29, 2011 at 9:47 AM, David Miller <davem@davemloft.net> wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Tue, 29 Nov 2011 17:46:24 +0100
> >
> >> I did sucessful tests with tg3 (I'll provide the patch for bnx2 shortly)
> >>
> >> Some details probably can be polished, but I believe your v4 is ready
> >> for inclusion.
> >>
> >> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > Agreed, all applied to net-next, thanks!
> >
> > Tom, please keep an eye out for regression or suggestion reports.
> >
> 
> Will do.  I am well aware of how invasive this is in the data path ;-)
>  I'll add a doc describing BQL also.
> 
> By the way, the way to disable BQL at runtime is the 'echo max >
> /sys/class/net/eth<n>/queues/tx-<m>/byte_queue_limits/limit_min

One "regression" is it is now not possible to disable BQL at compile time,
because CONFIG_BQL can't be set to "n" via usual ways.

Description and patch below. Thanks.


---- 8< ----
From: Kirill Smelkov <kirr@mns.spb.ru>
Date: Thu, 1 Dec 2011 20:36:06 +0400
Subject: [PATCH] net: Allow users to set CONFIG_BQL=n

Commit 114cf580 (bql: Byte queue limits) added new config option without
description, which means neither `make oldconfig` nor `make menuconfig`
ask for it -- the option is simply set to default y automatically.

Make the option actually configurable via adding stub description.

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 net/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/Kconfig b/net/Kconfig
index 2d99873..c120631 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -240,7 +240,7 @@ config NETPRIO_CGROUP
 	  a per-interface basis
 
 config BQL
-	boolean
+	boolean "Byte Queue Limits"
 	depends on SYSFS
 	select DQL
 	default y
-- 
1.7.8.rc4.327.g599a2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-12-01 16:50       ` Kirill Smelkov
@ 2011-12-01 18:00         ` David Miller
  2011-12-02 11:22           ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2011-12-01 18:00 UTC (permalink / raw)
  To: kirr; +Cc: therbert, eric.dumazet, netdev

From: Kirill Smelkov <kirr@mns.spb.ru>
Date: Thu, 1 Dec 2011 20:50:18 +0400

> One "regression" is it is now not possible to disable BQL at compile time,
> because CONFIG_BQL can't be set to "n" via usual ways.
> 
> Description and patch below. Thanks.

It's intentional, and your patch will not be applied.

The Kconfig entry and option is merely for expressing internal dependencies,
not for providing a way to disable the code.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-12-01 18:00         ` David Miller
@ 2011-12-02 11:22           ` Kirill Smelkov
  2011-12-02 11:57             ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2011-12-02 11:22 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, eric.dumazet, netdev

On Thu, Dec 01, 2011 at 01:00:45PM -0500, David Miller wrote:
> From: Kirill Smelkov <kirr@mns.spb.ru>
> Date: Thu, 1 Dec 2011 20:50:18 +0400
> 
> > One "regression" is it is now not possible to disable BQL at compile time,
> > because CONFIG_BQL can't be set to "n" via usual ways.
> > 
> > Description and patch below. Thanks.
> 
> It's intentional, and your patch will not be applied.
> 
> The Kconfig entry and option is merely for expressing internal dependencies,
> not for providing a way to disable the code.

I'm maybe wrong somewhere - sorry then, but why there is e.g.

     static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
                                            unsigned int bytes)
     {
    +#ifdef CONFIG_BQL
    +       dql_queued(&dev_queue->dql, bytes);
    +       if (unlikely(dql_avail(&dev_queue->dql) < 0)) {
    +               set_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state);
    +               if (unlikely(dql_avail(&dev_queue->dql) >= 0))
    +                       clear_bit(__QUEUE_STATE_STACK_XOFF,
    +                           &dev_queue->state);
    +       }
    +#endif
     }


and that netdev_tx_sent_queue() is called on every xmit.

I wanted to save cycles on my small/slow hardware and compile BQL out in
case I know the system is not going to use it. I'm netdev newcomer, so
sorry if I ask naive questions, but what's wrong it?


Thanks,
Kirill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-12-02 11:22           ` Kirill Smelkov
@ 2011-12-02 11:57             ` Eric Dumazet
  2011-12-02 12:26               ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2011-12-02 11:57 UTC (permalink / raw)
  To: Kirill Smelkov; +Cc: David Miller, therbert, netdev

Le vendredi 02 décembre 2011 à 15:22 +0400, Kirill Smelkov a écrit :
> On Thu, Dec 01, 2011 at 01:00:45PM -0500, David Miller wrote:
> > From: Kirill Smelkov <kirr@mns.spb.ru>
> > Date: Thu, 1 Dec 2011 20:50:18 +0400
> > 
> > > One "regression" is it is now not possible to disable BQL at compile time,
> > > because CONFIG_BQL can't be set to "n" via usual ways.
> > > 
> > > Description and patch below. Thanks.
> > 
> > It's intentional, and your patch will not be applied.
> > 
> > The Kconfig entry and option is merely for expressing internal dependencies,
> > not for providing a way to disable the code.
> 
> I'm maybe wrong somewhere - sorry then, but why there is e.g.
> 
>      static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
>                                             unsigned int bytes)
>      {
>     +#ifdef CONFIG_BQL
>     +       dql_queued(&dev_queue->dql, bytes);
>     +       if (unlikely(dql_avail(&dev_queue->dql) < 0)) {
>     +               set_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state);
>     +               if (unlikely(dql_avail(&dev_queue->dql) >= 0))
>     +                       clear_bit(__QUEUE_STATE_STACK_XOFF,
>     +                           &dev_queue->state);
>     +       }
>     +#endif
>      }
> 
> 
> and that netdev_tx_sent_queue() is called on every xmit.
> 
> I wanted to save cycles on my small/slow hardware and compile BQL out in
> case I know the system is not going to use it. I'm netdev newcomer, so
> sorry if I ask naive questions, but what's wrong it?
> 

Its something like 4 or 5 instructions (granted your NIC is BQL
enabled). Really nothing compared to thousand of instructions per packet
spent in the stack and driver.

And BQL will benefit especially on your small/slow hardware, even if you
dont believe so or know yet.

BQL is probably the major new netdev functionality of the year.

Of course, you are free to patch your own kernel if you desperatly need
to save a few cycles per packet.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/10] bql: Byte Queue Limits
  2011-12-02 11:57             ` Eric Dumazet
@ 2011-12-02 12:26               ` Kirill Smelkov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill Smelkov @ 2011-12-02 12:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev

On Fri, Dec 02, 2011 at 12:57:51PM +0100, Eric Dumazet wrote:
> Le vendredi 02 décembre 2011 à 15:22 +0400, Kirill Smelkov a écrit :
> > On Thu, Dec 01, 2011 at 01:00:45PM -0500, David Miller wrote:
> > > From: Kirill Smelkov <kirr@mns.spb.ru>
> > > Date: Thu, 1 Dec 2011 20:50:18 +0400
> > > 
> > > > One "regression" is it is now not possible to disable BQL at compile time,
> > > > because CONFIG_BQL can't be set to "n" via usual ways.
> > > > 
> > > > Description and patch below. Thanks.
> > > 
> > > It's intentional, and your patch will not be applied.
> > > 
> > > The Kconfig entry and option is merely for expressing internal dependencies,
> > > not for providing a way to disable the code.
> > 
> > I'm maybe wrong somewhere - sorry then, but why there is e.g.
> > 
> >      static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
> >                                             unsigned int bytes)
> >      {
> >     +#ifdef CONFIG_BQL
> >     +       dql_queued(&dev_queue->dql, bytes);
> >     +       if (unlikely(dql_avail(&dev_queue->dql) < 0)) {
> >     +               set_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state);
> >     +               if (unlikely(dql_avail(&dev_queue->dql) >= 0))
> >     +                       clear_bit(__QUEUE_STATE_STACK_XOFF,
> >     +                           &dev_queue->state);
> >     +       }
> >     +#endif
> >      }
> > 
> > 
> > and that netdev_tx_sent_queue() is called on every xmit.
> > 
> > I wanted to save cycles on my small/slow hardware and compile BQL out in
> > case I know the system is not going to use it. I'm netdev newcomer, so
> > sorry if I ask naive questions, but what's wrong it?
> > 
> 
> Its something like 4 or 5 instructions (granted your NIC is BQL
> enabled). Really nothing compared to thousand of instructions per packet
> spent in the stack and driver.
> 
> And BQL will benefit especially on your small/slow hardware, even if you
> dont believe so or know yet.
> 
> BQL is probably the major new netdev functionality of the year.
> 
> Of course, you are free to patch your own kernel if you desperatly need
> to save a few cycles per packet.

Thanks for describing this Eric. I'll try BQL and see how it goes as you
suggest. Please note that the confusion here could be avoided if there
was a tiny bit of documentation.


Thanks again,
Kirill

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-12-02 12:29 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-29  2:32 [PATCH v4 0/10] bql: Byte Queue Limits Tom Herbert
2011-11-29  4:23 ` Dave Taht
2011-11-29  7:02   ` Eric Dumazet
2011-11-29  7:07     ` Eric Dumazet
2011-11-29  7:23     ` John Fastabend
2011-11-29  7:45       ` Eric Dumazet
2011-11-29  8:03         ` John Fastabend
2011-11-29  8:37       ` Dave Taht
2011-11-29  8:43         ` Eric Dumazet
2011-11-29  8:51           ` Dave Taht
2011-11-29 14:57             ` Eric Dumazet
2011-11-29 16:24               ` Dave Taht
2011-11-29 17:06                 ` David Laight
2011-11-29 14:24     ` Ben Hutchings
2011-11-29 14:29       ` Eric Dumazet
2011-11-29 16:06         ` Dave Taht
2011-11-29 16:41           ` Ben Hutchings
2011-11-29 17:28     ` Rick Jones
2011-11-29 16:46 ` Eric Dumazet
2011-11-29 17:47   ` David Miller
2011-11-29 18:31     ` Tom Herbert
2011-12-01 16:50       ` Kirill Smelkov
2011-12-01 18:00         ` David Miller
2011-12-02 11:22           ` Kirill Smelkov
2011-12-02 11:57             ` Eric Dumazet
2011-12-02 12:26               ` Kirill Smelkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).