[net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
@ 2014-09-30  8:53 Jesper Dangaard Brouer
  2014-09-30 11:07 ` Jamal Hadi Salim
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-30  8:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.
Besides the bytelimit from BQL, also limit bulking to maximum 7
packets to avoid any issues with available descriptor in HW and
any HoL issues measured at 100Mbit/s.

For now, as a conservative approach, don't bulk dequeue GSO and
segmented GSO packets, because testing showed requeuing occuring
with segmented GSO packets.

Jointed work with Hannes, Daniel and Florian.

Testing:
--------
 Demonstrating the performance improvement of qdisc dequeue bulking, is
tricky because the effect only "kicks-in" once the qdisc system have a
backlog. Thus, for a backlog to form, we need either 1) to exceed wirespeed
of the link or 2) exceed the capability of the device driver.

For practical use-cases, the measureable effect of this will be a
reduction in CPU usage

01-TCP_STREAM:
--------------
Testing effect for TCP involves disabling TSO and GSO, because TCP
already benefit from bulking, via TSO and especially for GSO segmented
packets.  This patch view TSO/GSO as a seperate kind of bulking, and
avoid further bulking of these packet types.

The measured perf diff benefit (at 10Gbit/s) for TCP_STREAM were 4.66%
less CPU used on calls to _raw_spin_lock() (mostly from sch_direct_xmit).

Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
 * Disabled bulking: 8.30% soft 88.75% idle
 * Enabled  bulking: 7.80% soft 89.36% idle

02-UDP_STREAM
-------------
The measured perf diff benefit for UDP_STREAM were 5.26% less CPU used
on calls to _raw_spin_lock().  UDP_STREAM with packet size -m 1472 (to
avoid sending UDP/IP fragments).

03-trafgen driver test
----------------------
The performance of the 10Gbit/s ixgbe driver is limited due to
updating the HW ring-queue tail-pointer on every packet.  As
previously demonstrated with pktgen.

Using trafgen to send RAW frames from userspace (via AF_PACKET), and
forcing it through qdisc path (with option --qdisc-path and -t0),
sending with 12 CPUs.

I can demonstrate this driver layer limitation:
 * 12.8 Mpps with no qdisc bulking
 * 14.8 Mpps with qdisc bulking (full 10G-wirespeed)

04-netperf-wrapper
------------------
I've designed some tests for the tool "netperf-wrapper" that tries to
measure the Head-of-Line (HoL) blocking effect.

The bulking "budget" limit of 7 packets are choosen, based on testing
with netperf-wrapper.  With driver igb at 100Mbit/s and a budget of
max 7 packets I cannot measure any added HoL latency.  Increasing this
value to 8 packets, I observed a fluctuating added 0.24ms delay
corrosponding to 3000 bytes at 100Mbit/s.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

---
V5:
 - Changed "budget" limit to 7 packets (from 8)
 - Choosing to use skb->len, because BQL uses that
 - Added testing results to patch desc

V4:
- Patch rewritten in the Red Hat Neuchatel office jointed work with
  Hannes, Daniel and Florian.
- Conservative approach of only using on BQL enabled drivers
- No user tunable parameter, but limit bulking to 8 packets.
- For now, avoid bulking GSO packets packets.

V3:
 - Correct use of BQL
 - Some minor adjustments based on feedback.
 - Default setting only bulk dequeue 1 extra packet (2 packets).

V2:
 - Restruct functions, split out functionality
 - Use BQL bytelimit to avoid overshooting driver limits


 include/net/sch_generic.h |   16 +++++++++++++++
 net/sched/sch_generic.c   |   47 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e65b8e0..0654dfa 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -6,6 +6,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pkt_sched.h>
 #include <linux/pkt_cls.h>
+#include <linux/dynamic_queue_limits.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 
@@ -111,6 +112,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 	qdisc->__state &= ~__QDISC___STATE_RUNNING;
 }
 
+static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
+{
+	return qdisc->flags & TCQ_F_ONETXQUEUE;
+}
+
+static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
+{
+#ifdef CONFIG_BQL
+	/* Non-BQL migrated drivers will return 0, too. */
+	return dql_avail(&txq->dql);
+#else
+	return 0;
+#endif
+}
+
 static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)
 {
 	return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 11b28f6..0fa8f0e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }
 
+static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
+					    struct sk_buff *head_skb,
+					    int bytelimit)
+{
+	struct sk_buff *skb, *tail_skb = head_skb;
+	int budget = 7; /* For now, conservatively choosen limit */
+
+	while (bytelimit > 0 && --budget > 0) {
+		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
+		if (tail_skb->next || skb_is_gso(tail_skb))
+			break;
+
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+
+		bytelimit -= skb->len; /* covers GSO len */
+		skb = validate_xmit_skb(skb, qdisc_dev(q));
+		if (!skb)
+			break;
+
+		/* "skb" can be a skb list after validate call above
+		 * (GSO segmented), but it is okay to append it to
+		 * current tail_skb->next, because next round will exit
+		 * in-case "tail_skb->next" is a skb list.
+		 */
+		tail_skb->next = skb;
+		tail_skb = skb;
+	}
+
+	return head_skb;
+}
+
+/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
+ * A requeued skb (via q->gso_skb) can also be a SKB list.
+ */
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
 	struct sk_buff *skb = q->gso_skb;
@@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 		} else
 			skb = NULL;
 	} else {
-		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
+		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+		    !netif_xmit_frozen_or_stopped(txq)) {
+			int bytelimit = qdisc_avail_bulklimit(txq);
+
 			skb = q->dequeue(q);
-			if (skb)
+			if (skb) {
+				bytelimit -= skb->len;
 				skb = validate_xmit_skb(skb, qdisc_dev(q));
+			}
+			if (skb && qdisc_may_bulk(q))
+				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30  8:53 [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
@ 2014-09-30 11:07 ` Jamal Hadi Salim
  2014-09-30 18:20   ` David Miller
  2014-09-30 11:48 ` Eric Dumazet
  2014-09-30 12:34 ` Eric Dumazet
  2 siblings, 1 reply; 38+ messages in thread
From: Jamal Hadi Salim @ 2014-09-30 11:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Alexander Duyck, John Fastabend, Dave Taht, toke

On 09/30/14 04:53, Jesper Dangaard Brouer wrote:

[..]
> To avoid overshooting the HW limits, which results in requeuing, the
> patch limits the amount of bytes dequeued, based on the drivers BQL
> limits.  In-effect bulking will only happen for BQL enabled drivers.
> Besides the bytelimit from BQL, also limit bulking to maximum 7
> packets to avoid any issues with available descriptor in HW and
> any HoL issues measured at 100Mbit/s.
>

[..]

>
> The measured perf diff benefit (at 10Gbit/s) for TCP_STREAM were 4.66%
> less CPU used on calls to _raw_spin_lock() (mostly from sch_direct_xmit).
>
> Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
>   * Disabled bulking: 8.30% soft 88.75% idle
>   * Enabled  bulking: 7.80% soft 89.36% idle
>

I know you have put a lot of hard work here and i hate to do
this to you, but:
The base test case is to surely *not to allow* the bulk code to be
executed at all. i.e when you say "disabled bulking" it
should mean not calling qdisc_may_bulk or qdisc_avail_bulklimit()
because that code was not there initially.  My gut feeling is you
will find that your numbers for "Disabled bulking" to be a lot lower
than you show.
This is because no congestion likely happens at 24x TCP_STREAM
running at 10G with a modern day cpu and therefore you will end up
consuming more cpu.

Note, there are benefits as you have shown - but i would not
consider those to be standard use cases (actully one which would
have shown clear win is the VM thing Rusty was after). For this reason
my view is that i should be able to disable via ifdef bulking (yes, I
know DaveM hates ifdefs ;->).

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30 11:07 ` Jamal Hadi Salim
@ 2014-09-30 18:20   ` David Miller
  2014-10-01 13:17     ` Jamal Hadi Salim
  0 siblings, 1 reply; 38+ messages in thread
From: David Miller @ 2014-09-30 18:20 UTC (permalink / raw)
  To: jhs
  Cc: brouer, netdev, therbert, eric.dumazet, hannes, fw, dborkman,
	alexander.duyck, john.r.fastabend, dave.taht, toke

From: Jamal Hadi Salim <jhs@mojatatu.com>
Date: Tue, 30 Sep 2014 07:07:37 -0400

> Note, there are benefits as you have shown - but i would not
> consider those to be standard use cases (actully one which would
> have shown clear win is the VM thing Rusty was after).

I completely disagree, you will see at least decreased cpu utilization
for a very common case, bulk single stream transfers.

> For this reason my view is that i should be able to disable via
> ifdef bulking (yes, I know DaveM hates ifdefs ;->).

Vehemntly _disagree_. :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30 18:20   ` David Miller
@ 2014-10-01 13:17     ` Jamal Hadi Salim
  2014-10-01 14:55       ` Tom Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Jamal Hadi Salim @ 2014-10-01 13:17 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, netdev, therbert, eric.dumazet, hannes, fw, dborkman,
	alexander.duyck, john.r.fastabend, dave.taht, toke

On 09/30/14 14:20, David Miller wrote:
> From: Jamal Hadi Salim <jhs@mojatatu.com>
> Date: Tue, 30 Sep 2014 07:07:37 -0400
>
>> Note, there are benefits as you have shown - but i would not
>> consider those to be standard use cases (actully one which would
>> have shown clear win is the VM thing Rusty was after).
>
> I completely disagree, you will see at least decreased cpu utilization
> for a very common case, bulk single stream transfers.
>


So lets say the common use case is:
= modern day cpu (pick some random cpu)
= 1-10 Gbps ethernet (not 100mbps)
= 1-24 tcp or udp bulk (you said one, Jesper had 24 which sounds better)

Run with test cases:
a) unchanged (no bulking code added at all)
vs
b) bulking code added and used
vs
c) bulking code added and *not* used

Jesper's results are comparing #b and #c.

And if #b + #c are slightly worse or equal then we have a win;->

Again, I do believe things like traffic generators or the VM io
or something like tuntap that crosses user space will have a clear
benefit (but are those common use cases?).

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 13:17     ` Jamal Hadi Salim
@ 2014-10-01 14:55       ` Tom Herbert
  2014-10-01 15:34         ` Jamal Hadi Salim
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2014-10-01 14:55 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: David Miller, Jesper Dangaard Brouer, Linux Netdev List,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On Wed, Oct 1, 2014 at 6:17 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 09/30/14 14:20, David Miller wrote:
>>
>> From: Jamal Hadi Salim <jhs@mojatatu.com>
>> Date: Tue, 30 Sep 2014 07:07:37 -0400
>>
>>> Note, there are benefits as you have shown - but i would not
>>> consider those to be standard use cases (actully one which would
>>> have shown clear win is the VM thing Rusty was after).
>>
>>
>> I completely disagree, you will see at least decreased cpu utilization
>> for a very common case, bulk single stream transfers.
>>
>
>
> So lets say the common use case is:
> = modern day cpu (pick some random cpu)
> = 1-10 Gbps ethernet (not 100mbps)
> = 1-24 tcp or udp bulk (you said one, Jesper had 24 which sounds better)
>
> Run with test cases:
> a) unchanged (no bulking code added at all)
> vs
> b) bulking code added and used
> vs
> c) bulking code added and *not* used
>
> Jesper's results are comparing #b and #c.
>
> And if #b + #c are slightly worse or equal then we have a win;->
>
> Again, I do believe things like traffic generators or the VM io
> or something like tuntap that crosses user space will have a clear
> benefit (but are those common use cases?).
>
You're making this much more complicated that it actually is. The
algorithm is simple-- queue wakes up, finds out how exactly many bytes
to dequeue, and performs dequeue of enough packets under one lock. The
should be a benefit when transmitting high rate as we know that
reducing locking is generally a win. It needs to be tested, but this
should be independent of whether application is a VM, in userspace or
kernel, packetgen, etc.

Tom

> cheers,
> jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 14:55       ` Tom Herbert
@ 2014-10-01 15:34         ` Jamal Hadi Salim
  2014-10-01 17:28           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jamal Hadi Salim @ 2014-10-01 15:34 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Jesper Dangaard Brouer, Linux Netdev List,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On 10/01/14 10:55, Tom Herbert wrote:
> On Wed, Oct 1, 2014 at 6:17 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On 09/30/14 14:20, David Miller wrote:
>>>
>>> From: Jamal Hadi Salim <jhs@mojatatu.com>
>>> Date: Tue, 30 Sep 2014 07:07:37 -0400
>>>
>>>> Note, there are benefits as you have shown - but i would not
>>>> consider those to be standard use cases (actully one which would
>>>> have shown clear win is the VM thing Rusty was after).
>>>
>>>
>>> I completely disagree, you will see at least decreased cpu utilization
>>> for a very common case, bulk single stream transfers.
>>>
>>
>>
>> So lets say the common use case is:
>> = modern day cpu (pick some random cpu)
>> = 1-10 Gbps ethernet (not 100mbps)
>> = 1-24 tcp or udp bulk (you said one, Jesper had 24 which sounds better)
>>
>> Run with test cases:
>> a) unchanged (no bulking code added at all)
>> vs
>> b) bulking code added and used
>> vs
>> c) bulking code added and *not* used
>>
>> Jesper's results are comparing #b and #c.
>>
>> And if #b + #c are slightly worse or equal then we have a win;->
>>

BTW: meant to say if #b and #c are slightly worse than #a then we have
a win.

>> Again, I do believe things like traffic generators or the VM io
>> or something like tuntap that crosses user space will have a clear
>> benefit (but are those common use cases?).
>>
> You're making this much more complicated that it actually is. The
> algorithm is simple-- queue wakes up, finds out how exactly many bytes
> to dequeue, and performs dequeue of enough packets under one lock.

It is not about bql.
The issue is: if i am going to attempt to do a bulk transfer
every single time (with new code) and for the common use case the
result is "no need to do bulking" then you just added extra code that
is unnecessary for that common case.
Even a single extra if statement at high packet rate is still
costly and would be easy to observe.

>The
> should be a benefit when transmitting high rate as we know that
> reducing locking is generally a win.

You mean amortizing the cost of the lock not removing a lock?
Yes, of course. That is if the added code ends up being hit
meaningfully. Jesper said (and it was my experience as well)
that it was _hard_ to achieve bulking in such a case.
The fear here is in the common case (if we say the
bulk transfer is a common case) infact that code is reduced to be
a per-packet as opposed to a burst of packets, then there is
no win.
The tests should clarify, no?

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 15:34         ` Jamal Hadi Salim
@ 2014-10-01 17:28           ` Jesper Dangaard Brouer
  2014-10-01 18:55             ` Jamal Hadi Salim
  2014-10-01 19:47             ` David Miller
  0 siblings, 2 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-10-01 17:28 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen, brouer

On Wed, 01 Oct 2014 11:34:55 -0400 Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> The issue is: if i am going to attempt to do a bulk transfer
> every single time (with new code) and for the common use case the
> result is "no need to do bulking" then you just added extra code that
> is unnecessary for that common case.
> Even a single extra if statement at high packet rate is still
> costly and would be easy to observe.
[...]
>
> That is if the added code ends up being hit
> meaningfully. Jesper said (and it was my experience as well)
> that it was _hard_ to achieve bulking in such a case.

Let me close this one...

The bulk dequeue code *is* being hit meaningfully.  It is _only_ going
to be activated when (q->qlen > 0) see __dev_xmit_skb() in dev.c.
When there is no queue pfifo_fast is allowed to directly call
sch_direct_xmit() without a enqueue/dequeue cycle.  Thus, this code
would not be hit for every packet.

Thus, code is activated only when q->qlen is >= 1. And I have already
shown that we see a win with just bulking 2 packets:
 "Subject: qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue"
 http://thread.gmane.org/gmane.linux.network/331152/focus=331154

I actually believe, that I have already measured and shown that going
though the bulk dequeue is a win, this should show direct_xmit vs.
bulking:
 "Subject: qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen"
 http://thread.gmane.org/gmane.linux.network/331152

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 17:28           ` Jesper Dangaard Brouer
@ 2014-10-01 18:55             ` Jamal Hadi Salim
  2014-10-01 19:47               ` Jesper Dangaard Brouer
  2014-10-01 19:47             ` David Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Jamal Hadi Salim @ 2014-10-01 18:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On 10/01/14 13:28, Jesper Dangaard Brouer wrote:

> Thus, code is activated only when q->qlen is >= 1. And I have already
> shown that we see a win with just bulking 2 packets:

If you can get 2 packets, indeed you win. If you can on average get >1
over a long period, you still win.
You have clearly demonstrated you can do that with traffic
generators (udp or in kernel pktgen). I was more worried about the
common use case scenario (handwaved as 1-24 TCP streams).

The key here is: *if you never hit bulking* then the cost is
_per packet_ for sch_direct_xmit bypass.
Question is what is that cost for the common case as defined above?
Can you hit a bulk level >1 on 1-24 TCP streams?
I would be happy if your answer is *yes*. If your answer is no (since
it is hard to achieve) - then how far off is it from before your
patches (since now you have added at minimal a branch check).
I think it is fair for you to quantify that, no?
Feature is still useful for the other cases.

Note:
This is what i referred to as the "no animals were hurt during the
making of these patches" statement. I am sorry again for raining on
the parade.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 18:55             ` Jamal Hadi Salim
@ 2014-10-01 19:47               ` Jesper Dangaard Brouer
  2014-10-01 20:05                 ` Jamal Hadi Salim
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-10-01 19:47 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen, brouer

On Wed, 01 Oct 2014 14:55:09 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 10/01/14 13:28, Jesper Dangaard Brouer wrote:
> 
> > Thus, code is activated only when q->qlen is >= 1. And I have already
> > shown that we see a win with just bulking 2 packets:
> 
> If you can get 2 packets, indeed you win. If you can on average get >1
> over a long period, you still win.
> You have clearly demonstrated you can do that with traffic
> generators (udp or in kernel pktgen). I was more worried about the
> common use case scenario (handwaved as 1-24 TCP streams).
> 
> The key here is: *if you never hit bulking* then the cost is
> _per packet_ for sch_direct_xmit bypass.
> Question is what is that cost for the common case as defined above?
> Can you hit a bulk level >1 on 1-24 TCP streams?
> I would be happy if your answer is *yes*. 

Answer is yes.  It is very easy with simple netperf TCP_STREAM to cause
queueing >1 packet in the qdisc layer.  If tuned (according to my blog,
unloading netfilter etc.) then a single netperf TCP_STREAM will max out
10Gbit/s and cause a standing queue.

I'm monitoring backlog of qdiscs, and I always see >1 backlog, I never
saw a standing queue of 1 packet in my testing.  Either the backlog
area is high 100-200 packets, or 0 backlog.  (With fake pktgen/trafgen
style tests, it's possible to cause 1000 backlog).


> If your answer is no (since
> it is hard to achieve) - then how far off is it from before your
> patches (since now you have added at minimal a branch check).
> I think it is fair for you to quantify that, no?
> Feature is still useful for the other cases.
> 
> Note:
> This is what i referred to as the "no animals were hurt during the
> making of these patches" statement. I am sorry again for raining on
> the parade.

I'm starting to think this patch is the most carefully tested patch on
netdev for a very very long time... I'm getting tired of this testing
going on for so long.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 19:47               ` Jesper Dangaard Brouer
@ 2014-10-01 20:05                 ` Jamal Hadi Salim
  2014-10-01 20:32                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jamal Hadi Salim @ 2014-10-01 20:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On 10/01/14 15:47, Jesper Dangaard Brouer wrote:

>
> Answer is yes.  It is very easy with simple netperf TCP_STREAM to cause
> queueing >1 packet in the qdisc layer.

If that is the case, I withdraw any doubts i had.
Can you please specify this in your commit logs for patch 0?

> If tuned (according to my blog,
> unloading netfilter etc.) then a single netperf TCP_STREAM will max out
> 10Gbit/s and cause a standing queue.
>

You should describe such tuning in the patch log (hard to read
blogs for more than 30 seconds; write a paper if you want to provide
more details).

> I'm monitoring backlog of qdiscs, and I always see >1 backlog, I never
> saw a standing queue of 1 packet in my testing.  Either the backlog
> area is high 100-200 packets, or 0 backlog.  (With fake pktgen/trafgen
> style tests, it's possible to cause 1000 backlog).

It would be nice to actually collect such stats. Monitoring the backlog
via dumping qdisc stats is a good start - but actually keeping traces
of average bulk size is more useful.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 20:05                 ` Jamal Hadi Salim
@ 2014-10-01 20:32                   ` Jesper Dangaard Brouer
  2014-10-02  5:18                     ` Dave Taht
  2014-10-02 12:53                     ` Jamal Hadi Salim
  0 siblings, 2 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-10-01 20:32 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen, brouer

On Wed, 01 Oct 2014 16:05:31 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 10/01/14 15:47, Jesper Dangaard Brouer wrote:
> 
> >
> > Answer is yes.  It is very easy with simple netperf TCP_STREAM to cause
> > queueing >1 packet in the qdisc layer.
> 
> If that is the case, I withdraw any doubts i had.
> Can you please specify this in your commit logs for patch 0?

I'll try to make it more explicit.
Will resubmit patchset shortly...

Notice it is not difficult cause a queue to form, but it is tricky (not
difficult) to correctly test this patchset.  Perhaps you misread my
statement earlier as "it was difficult to test and cause a queue to form"?


> > If tuned (according to my blog,
> > unloading netfilter etc.) then a single netperf TCP_STREAM will max out
> > 10Gbit/s and cause a standing queue.
> >
> 
> You should describe such tuning in the patch log (hard to read
> blogs for more than 30 seconds; write a paper if you want to provide
> more details).

I think you could read this blog in 30 sec:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html

My cover letter and testing section... will take you longer that 30
sec, it have grown quite large (and Eric will not even read it :-P ;-))

Believe or not, I've actually restricted and reduced the testing
section.  If you want the hole verbose version of my testing for the
upcoming V6 patch, look at this:

 http://people.netfilter.org/hawk/qdisc/measure12_internal_V6_patch/
 http://people.netfilter.org/hawk/qdisc/measure13_V6_patch_NObulk/

And use netperf-wrapper to dive into the data.
A quick setup guide:
 http://netoptimizer.blogspot.dk/2014/09/mini-tutorial-for-netperf-wrapper-setup.html


> > I'm monitoring backlog of qdiscs, and I always see >1 backlog, I never
> > saw a standing queue of 1 packet in my testing.  Either the backlog
> > area is high 100-200 packets, or 0 backlog.  (With fake pktgen/trafgen
> > style tests, it's possible to cause 1000 backlog).
> 
> It would be nice to actually collect such stats. Monitoring the backlog
> via dumping qdisc stats is a good start - but actually keeping traces
> of average bulk size is more useful.

I usually also monitors the BQL limits during these tests.

 grep -H . /sys/class/net/eth4/queues/tx-*/byte_queue_limits/{inflight,limit}

To Toke:
 Perhaps we could convince Toke, to add a netperf-wrapper recorder for
the BQL inflight and limit?  (It would be really cool to plot together)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 20:32                   ` Jesper Dangaard Brouer
@ 2014-10-02  5:18                     ` Dave Taht
  2014-10-02  7:44                       ` Jesper Dangaard Brouer
  2014-10-02 13:02                       ` Jamal Hadi Salim
  2014-10-02 12:53                     ` Jamal Hadi Salim
  1 sibling, 2 replies; 38+ messages in thread
From: Dave Taht @ 2014-10-02  5:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jamal Hadi Salim, Tom Herbert, David Miller, Linux Netdev List,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann, Alexander Duyck, John Fastabend,
	Toke Høiland-Jørgensen

>> > I'm monitoring backlog of qdiscs, and I always see >1 backlog, I never
>> > saw a standing queue of 1 packet in my testing.  Either the backlog
>> > area is high 100-200 packets, or 0 backlog.  (With fake pktgen/trafgen
>> > style tests, it's possible to cause 1000 backlog).
>>
>> It would be nice to actually collect such stats. Monitoring the backlog
>> via dumping qdisc stats is a good start - but actually keeping traces
>> of average bulk size is more useful.

I am not huge on averages. Network theorists tend to think about things in
terms of fluid models. Van Jacobson's analogy of a water fountain's
operation is very profound...

While it is nearly impossible for a conventional Van Neuman time sliced
CPU + network to actually act that way, things like BQL and dedicated
pipelining systems like those in DPDK are getting closer to that ideal.

An example of where averages let you down is on the classic 5 minute
data reduction things like mrtg do, where you might see `60% of the
bandwidth (capacity/5 minutes) in use, yet still see drops because
over shorter intervals (capacity/10ms) you have bursts arriving.

Far better to sample queue occupancy at the highest rate you can
manage without heisenbugging the results and look at the detailed
curves.

> I usually also monitors the BQL limits during these tests.
>
>  grep -H . /sys/class/net/eth4/queues/tx-*/byte_queue_limits/{inflight,limit}
>
> To Toke:
>  Perhaps we could convince Toke, to add a netperf-wrapper recorder for
> the BQL inflight and limit?  (It would be really cool to plot together)

I just added a command line mode and support for timestamped limit and
inflight output to bqlmon

Get it here:

https://github.com/dtaht/bqlmon

That should be sufficient for netperf-wrapper to poll it efficiently.
As to how to graph it,
I suppose finding the max(limit) across the entire sample set and
scaling inflight appropriately would be good.

What test(s) would be best to combine it with? rrul? tcp_square_wave?

-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-02  5:18                     ` Dave Taht
@ 2014-10-02  7:44                       ` Jesper Dangaard Brouer
  2014-10-02 10:08                         ` Toke Høiland-Jørgensen
  2014-10-02 13:02                       ` Jamal Hadi Salim
  1 sibling, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-10-02  7:44 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jamal Hadi Salim, Tom Herbert, David Miller, Linux Netdev List,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann, Alexander Duyck, John Fastabend,
	Toke Høiland-Jørgensen, brouer, Florian Fainelli

On Wed, 1 Oct 2014 22:18:05 -0700 Dave Taht <dave.taht@gmail.com> wrote:

> > I usually also monitors the BQL limits during these tests.
> >
> >  grep -H . /sys/class/net/eth4/queues/tx-*/byte_queue_limits/{inflight,limit}
> >
> > To Toke:
> >  Perhaps we could convince Toke, to add a netperf-wrapper recorder for
> > the BQL inflight and limit?  (It would be really cool to plot together)
> 
> I just added a command line mode and support for timestamped limit and
> inflight output to bqlmon
> 
> Get it here:
> 
> https://github.com/dtaht/bqlmon
>
> That should be sufficient for netperf-wrapper to poll it efficiently.

https://github.com/dtaht/bqlmon/commit/ace44b55e2e521af88ba9927bf30fe615024073d#diff-aacbd6867cca49695bfe76ad2bcd45d1R407

Now, that we are in control of the output format.  I would output
something that is easier to parse for netperf-wrapper, e.g. some json
compliant string (it is future extendable and easy to parse for Toke).
Besides the output format, I like you patch to bqlmon.

Perhaps Toke have an opinion on the output formatting from bqlmon?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-02  7:44                       ` Jesper Dangaard Brouer
@ 2014-10-02 10:08                         ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 38+ messages in thread
From: Toke Høiland-Jørgensen @ 2014-10-02 10:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Dave Taht, Jamal Hadi Salim, Tom Herbert, David Miller,
	Linux Netdev List, Eric Dumazet, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Alexander Duyck,
	John Fastabend, Florian Fainelli

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> Perhaps Toke have an opinion on the output formatting from bqlmon?

Anything that outputs one record per line and is parsable by regular
expression is fine with me. I don't currently have a runner that
captures json output, so doing that would actually be more work for me
to support presently :)

-Toke

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-02  5:18                     ` Dave Taht
  2014-10-02  7:44                       ` Jesper Dangaard Brouer
@ 2014-10-02 13:02                       ` Jamal Hadi Salim
  1 sibling, 0 replies; 38+ messages in thread
From: Jamal Hadi Salim @ 2014-10-02 13:02 UTC (permalink / raw)
  To: Dave Taht, Jesper Dangaard Brouer
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Toke Høiland-Jørgensen

On 10/02/14 01:18, Dave Taht wrote:

> I am not huge on averages. Network theorists tend to think about things in
> terms of fluid models. Van Jacobson's analogy of a water fountain's
> operation is very profound...
>
> While it is nearly impossible for a conventional Van Neuman time sliced
> CPU + network to actually act that way, things like BQL and dedicated
> pipelining systems like those in DPDK are getting closer to that ideal.
>
> An example of where averages let you down is on the classic 5 minute
> data reduction things like mrtg do, where you might see `60% of the
> bandwidth (capacity/5 minutes) in use, yet still see drops because
> over shorter intervals (capacity/10ms) you have bursts arriving.
>

I think in this case, averages makes sense because we are measuring
resource utilization - CPU (ab)use. Assuming it costs more when
we dont bulk and we make up for it when we bulk, then over the
measurement period we can find if it is an overall win.
Same thing with bandwidth utilization - instantenous bursts dont tell
you the overall utilization (so a double leaky bucket gives you
a better picture).

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 20:32                   ` Jesper Dangaard Brouer
  2014-10-02  5:18                     ` Dave Taht
@ 2014-10-02 12:53                     ` Jamal Hadi Salim
  1 sibling, 0 replies; 38+ messages in thread
From: Jamal Hadi Salim @ 2014-10-02 12:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, David Miller, Linux Netdev List, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On 10/01/14 16:32, Jesper Dangaard Brouer wrote:
> On Wed, 01 Oct 2014 16:05:31 -0400

> I'll try to make it more explicit.
> Will resubmit patchset shortly...
>

Thanks for providing the clarity.

> Notice it is not difficult cause a queue to form, but it is tricky (not
> difficult) to correctly test this patchset.  Perhaps you misread my
> statement earlier as "it was difficult to test and cause a queue to form"?
>

The conflict maybe what "difficult" or "common" means.
I know from experience that it is difficult under *normal*
circumstances to create the overload. Example you had to turn
off GSO/TSO to see it for 10G ;-> iow you had to go out of your way to
turn off a useful feature. So you are no longer dealing with "common".
Which is fine by me - I just wanted you to specify that was the
case. I also wanted you to run the testcase which said "this is
how things were before my patch" for 10G (with GSO/TSO). But i
dont think you are in the mood for that and if your patch goes
in, then the reaction to any regression to that wouldnt take long.
If something breaks people would start whining in a short period.
So I am ok with it.

> I think you could read this blog in 30 sec:
>   http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
>

Ok, yes that message was clear in 30 seconds or less ;-> Immediate
gratification.
It is what i would do (havent tried tweaking c states) - but off the
bat, I would do what you did when i want to run serious networking.

> My cover letter and testing section... will take you longer that 30
> sec, it have grown quite large (and Eric will not even read it :-P ;-))
>

But it is referenced forever - so i can go back and read it. A url
may return a 404 in 5 years.

> Believe or not, I've actually restricted and reduced the testing
> section.

I agree there is an upper bound to how much testing you can do.
Looking forward to seeing a paper with all the nice details.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-10-01 17:28           ` Jesper Dangaard Brouer
  2014-10-01 18:55             ` Jamal Hadi Salim
@ 2014-10-01 19:47             ` David Miller
  1 sibling, 0 replies; 38+ messages in thread
From: David Miller @ 2014-10-01 19:47 UTC (permalink / raw)
  To: brouer
  Cc: jhs, therbert, netdev, eric.dumazet, hannes, fw, dborkman,
	alexander.duyck, john.r.fastabend, dave.taht, toke

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 1 Oct 2014 19:28:40 +0200

> Thus, code is activated only when q->qlen is >= 1. And I have already
> shown that we see a win with just bulking 2 packets:
>  "Subject: qdisc/UDP_STREAM: measuring effect of qdisc bulk dequeue"
>  http://thread.gmane.org/gmane.linux.network/331152/focus=331154
> 
> I actually believe, that I have already measured and shown that going
> though the bulk dequeue is a win, this should show direct_xmit vs.
> bulking:
>  "Subject: qdisc/trafgen: Measuring effect of qdisc bulk dequeue, with trafgen"
>  http://thread.gmane.org/gmane.linux.network/331152

+1 Amen...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30  8:53 [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
  2014-09-30 11:07 ` Jamal Hadi Salim
@ 2014-09-30 11:48 ` Eric Dumazet
  2014-09-30 12:25   ` Florian Westphal
  2014-09-30 22:07   ` Jesper Dangaard Brouer
  2014-09-30 12:34 ` Eric Dumazet
  2 siblings, 2 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 11:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Tue, 2014-09-30 at 10:53 +0200, Jesper Dangaard Brouer wrote:
> Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> sending/processing an entire skb list.
> 
> This patch implements qdisc bulk dequeue, by allowing multiple packets
> to be dequeued in dequeue_skb().
> 
> The optimization principle for this is two fold, (1) to amortize
> locking cost and (2) avoid expensive tailptr update for notifying HW.
>  (1) Several packets are dequeued while holding the qdisc root_lock,
> amortizing locking cost over several packet.  The dequeued SKB list is
> processed under the TXQ lock in dev_hard_start_xmit(), thus also
> amortizing the cost of the TXQ lock.
>  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> API to delay HW tailptr update, which also reduces the cost per
> packet.
> 
> One restriction of the new API is that every SKB must belong to the
> same TXQ.  This patch takes the easy way out, by restricting bulk
> dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> qdisc only have attached a single TXQ.
> 
> Some detail about the flow; dev_hard_start_xmit() will process the skb
> list, and transmit packets individually towards the driver (see
> xmit_one()).  In case the driver stops midway in the list, the
> remaining skb list is returned by dev_hard_start_xmit().  In
> sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> 
> To avoid overshooting the HW limits, which results in requeuing, the
> patch limits the amount of bytes dequeued, based on the drivers BQL
> limits.  In-effect bulking will only happen for BQL enabled drivers.
> Besides the bytelimit from BQL, also limit bulking to maximum 7
> packets to avoid any issues with available descriptor in HW and
> any HoL issues measured at 100Mbit/s.
> 
> For now, as a conservative approach, don't bulk dequeue GSO and
> segmented GSO packets, because testing showed requeuing occuring
> with segmented GSO packets.
> 
> Jointed work with Hannes, Daniel and Florian.
> 
> Testing:
> --------
>  Demonstrating the performance improvement of qdisc dequeue bulking, is
> tricky because the effect only "kicks-in" once the qdisc system have a
> backlog. Thus, for a backlog to form, we need either 1) to exceed wirespeed
> of the link or 2) exceed the capability of the device driver.
> 
> For practical use-cases, the measureable effect of this will be a
> reduction in CPU usage
> 
> 01-TCP_STREAM:
> --------------
> Testing effect for TCP involves disabling TSO and GSO, because TCP
> already benefit from bulking, via TSO and especially for GSO segmented
> packets.  This patch view TSO/GSO as a seperate kind of bulking, and
> avoid further bulking of these packet types.
> 
> The measured perf diff benefit (at 10Gbit/s) for TCP_STREAM were 4.66%
> less CPU used on calls to _raw_spin_lock() (mostly from sch_direct_xmit).
> 
> Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
>  * Disabled bulking: 8.30% soft 88.75% idle
>  * Enabled  bulking: 7.80% soft 89.36% idle
> 
> 02-UDP_STREAM
> -------------
> The measured perf diff benefit for UDP_STREAM were 5.26% less CPU used
> on calls to _raw_spin_lock().  UDP_STREAM with packet size -m 1472 (to
> avoid sending UDP/IP fragments).
> 
> 03-trafgen driver test
> ----------------------
> The performance of the 10Gbit/s ixgbe driver is limited due to
> updating the HW ring-queue tail-pointer on every packet.  As
> previously demonstrated with pktgen.
> 
> Using trafgen to send RAW frames from userspace (via AF_PACKET), and
> forcing it through qdisc path (with option --qdisc-path and -t0),
> sending with 12 CPUs.
> 
> I can demonstrate this driver layer limitation:
>  * 12.8 Mpps with no qdisc bulking
>  * 14.8 Mpps with qdisc bulking (full 10G-wirespeed)
> 
> 04-netperf-wrapper
> ------------------
> I've designed some tests for the tool "netperf-wrapper" that tries to
> measure the Head-of-Line (HoL) blocking effect.
> 
> The bulking "budget" limit of 7 packets are choosen, based on testing
> with netperf-wrapper.  With driver igb at 100Mbit/s and a budget of
> max 7 packets I cannot measure any added HoL latency.  Increasing this
> value to 8 packets, I observed a fluctuating added 0.24ms delay
> corrosponding to 3000 bytes at 100Mbit/s.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> 
> ---
> V5:
>  - Changed "budget" limit to 7 packets (from 8)
>  - Choosing to use skb->len, because BQL uses that
>  - Added testing results to patch desc
> 
> V4:
> - Patch rewritten in the Red Hat Neuchatel office jointed work with
>   Hannes, Daniel and Florian.
> - Conservative approach of only using on BQL enabled drivers
> - No user tunable parameter, but limit bulking to 8 packets.
> - For now, avoid bulking GSO packets packets.
> 
> V3:
>  - Correct use of BQL
>  - Some minor adjustments based on feedback.
>  - Default setting only bulk dequeue 1 extra packet (2 packets).
> 
> V2:
>  - Restruct functions, split out functionality
>  - Use BQL bytelimit to avoid overshooting driver limits
> 
> 
>  include/net/sch_generic.h |   16 +++++++++++++++
>  net/sched/sch_generic.c   |   47 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index e65b8e0..0654dfa 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -6,6 +6,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/pkt_sched.h>
>  #include <linux/pkt_cls.h>
> +#include <linux/dynamic_queue_limits.h>
>  #include <net/gen_stats.h>
>  #include <net/rtnetlink.h>
>  
> @@ -111,6 +112,21 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
>  	qdisc->__state &= ~__QDISC___STATE_RUNNING;
>  }
>  
> +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> +{
> +	return qdisc->flags & TCQ_F_ONETXQUEUE;
> +}
> +
> +static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
> +{
> +#ifdef CONFIG_BQL
> +	/* Non-BQL migrated drivers will return 0, too. */
> +	return dql_avail(&txq->dql);
> +#else
> +	return 0;
> +#endif
> +}
> +
>  static inline bool qdisc_is_throttled(const struct Qdisc *qdisc)
>  {
>  	return test_bit(__QDISC_STATE_THROTTLED, &qdisc->state) ? true : false;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 11b28f6..0fa8f0e 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -56,6 +56,42 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  	return 0;
>  }
>  
> +static struct sk_buff *try_bulk_dequeue_skb(struct Qdisc *q,
> +					    struct sk_buff *head_skb,
> +					    int bytelimit)
> +{
> +	struct sk_buff *skb, *tail_skb = head_skb;
> +	int budget = 7; /* For now, conservatively choosen limit */
> +
> +	while (bytelimit > 0 && --budget > 0) {
> +		/* For now, don't bulk dequeue GSO (or GSO segmented) pkts */
> +		if (tail_skb->next || skb_is_gso(tail_skb))
> +			break;
> +
> +		skb = q->dequeue(q);
> +		if (!skb)
> +			break;
> +
> +		bytelimit -= skb->len; /* covers GSO len */
> +		skb = validate_xmit_skb(skb, qdisc_dev(q));
> +		if (!skb)
> +			break;
> +
> +		/* "skb" can be a skb list after validate call above
> +		 * (GSO segmented), but it is okay to append it to
> +		 * current tail_skb->next, because next round will exit
> +		 * in-case "tail_skb->next" is a skb list.
> +		 */
> +		tail_skb->next = skb;
> +		tail_skb = skb;
> +	}
> +
> +	return head_skb;
> +}
> +
> +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
> + * A requeued skb (via q->gso_skb) can also be a SKB list.
> + */
>  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>  {
>  	struct sk_buff *skb = q->gso_skb;
> @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>  		} else
>  			skb = NULL;
>  	} else {
> -		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> +		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
> +		    !netif_xmit_frozen_or_stopped(txq)) {
> +			int bytelimit = qdisc_avail_bulklimit(txq);
> +
>  			skb = q->dequeue(q);
> -			if (skb)
> +			if (skb) {
> +				bytelimit -= skb->len;
>  				skb = validate_xmit_skb(skb, qdisc_dev(q));
> +			}
> +			if (skb && qdisc_may_bulk(q))
> +				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
>  		}
>  	}

Okay...

In my opinion, you guys missed some opportunities :

1) Ideally the algo should try to call validate_xmit_skb() outside of
the locks (qdisc lock or device lock), because eventually doing checksum
computation or full segmentation should allow other cpus doing enqueues.

2) TSO support. Have you ideas how to perform this ?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30 11:48 ` Eric Dumazet
@ 2014-09-30 12:25   ` Florian Westphal
  2014-09-30 12:38     ` Eric Dumazet
  2014-09-30 22:07   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 38+ messages in thread
From: Florian Westphal @ 2014-09-30 12:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >  	struct sk_buff *skb = q->gso_skb;
> > @@ -70,10 +106,17 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
> >  		} else
> >  			skb = NULL;
> >  	} else {
> > -		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> > +		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
> > +		    !netif_xmit_frozen_or_stopped(txq)) {
> > +			int bytelimit = qdisc_avail_bulklimit(txq);
> > +
> >  			skb = q->dequeue(q);
> > -			if (skb)
> > +			if (skb) {
> > +				bytelimit -= skb->len;
> >  				skb = validate_xmit_skb(skb, qdisc_dev(q));
> > +			}
> > +			if (skb && qdisc_may_bulk(q))
> > +				skb = try_bulk_dequeue_skb(q, skb, bytelimit);
> >  		}
> >  	}
> 
> Okay...
> 
> In my opinion, you guys missed some opportunities :
> 
> 1) Ideally the algo should try to call validate_xmit_skb() outside of
> the locks (qdisc lock or device lock), because eventually doing checksum
> computation or full segmentation should allow other cpus doing enqueues.

You're right, but this is not related to this patch specifically.
But sure, we can look into moving validate_xmit_skb calls outside of the
qdisc root locked sections.

> 2) TSO support. Have you ideas how to perform this ?

Yes, the idea was to just remove the skb_is_gso() test and limit
dql_avail() result to e.g. 128k to avoid a '8-fat-gso-skbs-in-a-row' scenario.

Did you have any other ideas in mind?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30 12:25   ` Florian Westphal
@ 2014-09-30 12:38     ` Eric Dumazet
  2014-09-30 12:41       ` Florian Westphal
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 12:38 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Tue, 2014-09-30 at 14:25 +0200, Florian Westphal wrote:
> You're right, but this is not related to this patch specifically.
> But sure, we can look into moving validate_xmit_skb calls outside of the
> qdisc root locked sections.

Yes please.

> 
> > 2) TSO support. Have you ideas how to perform this ?
> 
> Yes, the idea was to just remove the skb_is_gso() test and limit
> dql_avail() result to e.g. 128k to avoid a '8-fat-gso-skbs-in-a-row' scenario.

Really I think you are mistaken about GSO/TSO being fat skb. This is not
a qdisc concern, but packets producer. Think of TSO autosizing which
already takes care of this.

> 
> Did you have any other ideas in mind?

I guess my question was more like : Should I wait you sending patches,
or should I do this ?

We are approaching merge window, and I would like to get this sorted out
for linux-3.18

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30 12:38     ` Eric Dumazet
@ 2014-09-30 12:41       ` Florian Westphal
  0 siblings, 0 replies; 38+ messages in thread
From: Florian Westphal @ 2014-09-30 12:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Jesper Dangaard Brouer, netdev, David S. Miller,
	Tom Herbert, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I guess my question was more like : Should I wait you sending patches,
> or should I do this ?

Sure, if you have time to work on this go ahead.

Thanks Eric!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30 11:48 ` Eric Dumazet
  2014-09-30 12:25   ` Florian Westphal
@ 2014-09-30 22:07   ` Jesper Dangaard Brouer
  2014-09-30 22:15     ` Eric Dumazet
  1 sibling, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-30 22:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke, brouer

On Tue, 30 Sep 2014 04:48:56 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:
 
> In my opinion, you guys missed some opportunities :

This is a conservative first approach.  The merge window is approaching
fast, is is annoying you are blocking a save conservative approach.


> 1) Ideally the algo should try to call validate_xmit_skb() outside of
> the locks (qdisc lock or device lock), because eventually doing checksum
> computation or full segmentation should allow other cpus doing enqueues.

Save it for later please, don't complicate things further.


> 2) TSO support. Have you ideas how to perform this ?

Yes, we already have patches floating which implements bulking of TSO
and GSO packets.  You will get that.

Please one step at a time, also for bisect-ability...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30 22:07   ` Jesper Dangaard Brouer
@ 2014-09-30 22:15     ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 22:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Wed, 2014-10-01 at 00:07 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 30 Sep 2014 04:48:56 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>  
> > In my opinion, you guys missed some opportunities :
> 
> This is a conservative first approach.  The merge window is approaching
> fast, is is annoying you are blocking a save conservative approach.

I am not blocking at all.

I am only disappointed V5 is no different from V4.

I will shut up then since it seems I am blocking things, or so you
believe.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-30  8:53 [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
  2014-09-30 11:07 ` Jamal Hadi Salim
  2014-09-30 11:48 ` Eric Dumazet
@ 2014-09-30 12:34 ` Eric Dumazet
  2014-09-30 12:51   ` [net-next PATCH] dql: add a burst attribute Eric Dumazet
  2 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 12:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Tue, 2014-09-30 at 10:53 +0200, Jesper Dangaard Brouer wrote:

> For now, as a conservative approach, don't bulk dequeue GSO and
> segmented GSO packets, because testing showed requeuing occuring
> with segmented GSO packets.

Note that GSO happens after qdisc dequeue.

We should not care if part of a GSO train is stopped because TX ring is
full. If we stop a GSO train, then we only lower GRO aggregation for the
receiver.

Normally, if BQL is properly working, we should practically not hit TX
ring buffer limit, unless you send very small packets just to stress the
thing (synthetic benchmark, not a real workload)

We should not stop a GSO train because of BQL : Remind that BQL is to
avoid head of line blocking, but there is no way another high prio
packet can be sent on the wire before one if the packet resulting of GSO
segments.

So telling that requeueing is happening with GSO is kind of irrelevant.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [net-next PATCH] dql: add a burst attribute
  2014-09-30 12:34 ` Eric Dumazet
@ 2014-09-30 12:51   ` Eric Dumazet
  2014-09-30 13:46     ` Florian Westphal
  2014-09-30 21:29     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 12:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

From: Eric Dumazet <edumazet@google.com>

Add a new dql attribute, to control how much packets we are allowed to
burst from qdisc to device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
Jesper, please consider using this instead of 7 or 8 values you hard coded.

 include/linux/dynamic_queue_limits.h |    1 
 lib/dynamic_queue_limits.c           |    1 
 net/core/net-sysfs.c                 |   31 +++++++++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index a4be70398ce1..8509053eb330 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -42,6 +42,7 @@ struct dql {
 	unsigned int	num_queued;		/* Total ever queued */
 	unsigned int	adj_limit;		/* limit + num_completed */
 	unsigned int	last_obj_cnt;		/* Count at last queuing */
+	unsigned int	burst;
 
 	/* Fields accessed only by completion path (dql_completed) */
 
diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
index 0777c5a45fa0..acc0d6105191 100644
--- a/lib/dynamic_queue_limits.c
+++ b/lib/dynamic_queue_limits.c
@@ -132,6 +132,7 @@ int dql_init(struct dql *dql, unsigned hold_time)
 	dql->max_limit = DQL_MAX_LIMIT;
 	dql->min_limit = 0;
 	dql->slack_hold_time = hold_time;
+	dql->burst = 7;
 	dql_reset(dql);
 	return 0;
 }
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 9dd06699b09c..80cf6f166706 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -978,6 +978,36 @@ static struct netdev_queue_attribute bql_hold_time_attribute =
 	__ATTR(hold_time, S_IRUGO | S_IWUSR, bql_show_hold_time,
 	    bql_set_hold_time);
 
+static ssize_t bql_show_burst(struct netdev_queue *queue,
+			      struct netdev_queue_attribute *attr,
+			      char *buf)
+{
+	const struct dql *dql = &queue->dql;
+
+	return sprintf(buf, "%u\n", dql->burst);
+}
+
+static ssize_t bql_set_burst(struct netdev_queue *queue,
+			     struct netdev_queue_attribute *attribute,
+			     const char *buf, size_t len)
+{
+	struct dql *dql = &queue->dql;
+	unsigned int value;
+	int err;
+
+	err = kstrtouint(buf, 10, &value);
+	if (err < 0)
+		return err;
+
+	dql->burst = value;
+
+	return len;
+}
+
+static struct netdev_queue_attribute bql_burst_attribute =
+	__ATTR(burst, S_IRUGO | S_IWUSR, bql_show_burst,
+	    bql_set_burst);
+
 static ssize_t bql_show_inflight(struct netdev_queue *queue,
 				 struct netdev_queue_attribute *attr,
 				 char *buf)
@@ -1019,6 +1049,7 @@ static struct attribute *dql_attrs[] = {
 	&bql_limit_min_attribute.attr,
 	&bql_hold_time_attribute.attr,
 	&bql_inflight_attribute.attr,
+	&bql_burst_attribute.attr,
 	NULL
 };
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 12:51   ` [net-next PATCH] dql: add a burst attribute Eric Dumazet
@ 2014-09-30 13:46     ` Florian Westphal
  2014-09-30 14:17       ` Eric Dumazet
  2014-09-30 21:29     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 38+ messages in thread
From: Florian Westphal @ 2014-09-30 13:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Add a new dql attribute, to control how much packets we are allowed to
> burst from qdisc to device.

I understand the motivation for this, but I find it a bit out-of-place
to have a 'packet' type counter in bql context?

Would it perhaps make more sense to restrict bulk dequeues by an upper
(possibly changeable from userspace) byte counter limit?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 13:46     ` Florian Westphal
@ 2014-09-30 14:17       ` Eric Dumazet
  2014-09-30 14:26         ` Tom Herbert
                           ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 14:17 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Tue, 2014-09-30 at 15:46 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Add a new dql attribute, to control how much packets we are allowed to
> > burst from qdisc to device.
> 
> I understand the motivation for this, but I find it a bit out-of-place
> to have a 'packet' type counter in bql context?
> 
> Would it perhaps make more sense to restrict bulk dequeues by an upper
> (possibly changeable from userspace) byte counter limit?

The byte count is already provided : its the BQL limit.
We already have ways to tune it (limit_min & limit_max)
We do not think we need something else here.
That was the whole point of reusing BQL in the first place : Its already
powerful and implemented.

This new attribute is a packet burst.

When I suggested 'pburst' attribute for pktgen to Alexei, he sent a
patch with "burst". Nobody complained.

Now I am reusing what apparently was OK for pktgen, you seem to object.

If you feel not comfortable with "burst", rename it to whatever you
think is best.

But please, do not hard code magic 7 in your code.

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 14:17       ` Eric Dumazet
@ 2014-09-30 14:26         ` Tom Herbert
  2014-09-30 14:49           ` Eric Dumazet
  2014-09-30 14:31         ` Florian Westphal
  2014-09-30 15:26         ` Florian Westphal
  2 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2014-09-30 14:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Jesper Dangaard Brouer, Linux Netdev List,
	David S. Miller, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On Tue, Sep 30, 2014 at 7:17 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2014-09-30 at 15:46 +0200, Florian Westphal wrote:
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > Add a new dql attribute, to control how much packets we are allowed to
>> > burst from qdisc to device.
>>
>> I understand the motivation for this, but I find it a bit out-of-place
>> to have a 'packet' type counter in bql context?
>>
>> Would it perhaps make more sense to restrict bulk dequeues by an upper
>> (possibly changeable from userspace) byte counter limit?
>
> The byte count is already provided : its the BQL limit.
> We already have ways to tune it (limit_min & limit_max)
> We do not think we need something else here.
> That was the whole point of reusing BQL in the first place : Its already
> powerful and implemented.
>
> This new attribute is a packet burst.
>
Yes, this is the wrong place to put it. DQL knows nothing about
packets or technically even bytes, it's a generic library that deal
with "objects" on a queue. I don't see how burst relates to this.

I would infer from the patch that what you want is to add an attribute
to the TX queue. Why not do this directly?

>
> When I suggested 'pburst' attribute for pktgen to Alexei, he sent a
> patch with "burst". Nobody complained.
>
> Now I am reusing what apparently was OK for pktgen, you seem to object.
>
> If you feel not comfortable with "burst", rename it to whatever you
> think is best.
>
> But please, do not hard code magic 7 in your code.
>
> Thanks
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 14:26         ` Tom Herbert
@ 2014-09-30 14:49           ` Eric Dumazet
  2014-09-30 15:05             ` Tom Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 14:49 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Florian Westphal, Jesper Dangaard Brouer, Linux Netdev List,
	David S. Miller, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On Tue, 2014-09-30 at 07:26 -0700, Tom Herbert wrote:


> Yes, this is the wrong place to put it. DQL knows nothing about
> packets or technically even bytes, it's a generic library that deal
> with "objects" on a queue. I don't see how burst relates to this.

I wonder then why you put :

netdev_tx_completed_queue(struct netdev_queue *dev_queue,
			  unsigned int pkts, unsigned int bytes)

Really, I don't care _where_ the attribute is, as long we dont access
yet another cache line in fast path.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 14:49           ` Eric Dumazet
@ 2014-09-30 15:05             ` Tom Herbert
  0 siblings, 0 replies; 38+ messages in thread
From: Tom Herbert @ 2014-09-30 15:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Jesper Dangaard Brouer, Linux Netdev List,
	David S. Miller, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On Tue, Sep 30, 2014 at 7:49 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2014-09-30 at 07:26 -0700, Tom Herbert wrote:
>
>
>> Yes, this is the wrong place to put it. DQL knows nothing about
>> packets or technically even bytes, it's a generic library that deal
>> with "objects" on a queue. I don't see how burst relates to this.
>
> I wonder then why you put :
>
> netdev_tx_completed_queue(struct netdev_queue *dev_queue,
>                           unsigned int pkts, unsigned int bytes)
>
In case we ever wanted to count by packets instead of or in addition to bytes.

> Really, I don't care _where_ the attribute is, as long we dont access
> yet another cache line in fast path.

Just put it after/before dql in netdev_queue structure.

>
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 14:17       ` Eric Dumazet
  2014-09-30 14:26         ` Tom Herbert
@ 2014-09-30 14:31         ` Florian Westphal
  2014-09-30 14:55           ` Eric Dumazet
  2014-09-30 15:26         ` Florian Westphal
  2 siblings, 1 reply; 38+ messages in thread
From: Florian Westphal @ 2014-09-30 14:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Jesper Dangaard Brouer, netdev, David S. Miller,
	Tom Herbert, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> If you feel not comfortable with "burst", rename it to whatever you
> think is best.
> But please, do not hard code magic 7 in your code.

I had hoped that this 'magic' value could be removed
completely, only using bql data for bulking decisions.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 14:31         ` Florian Westphal
@ 2014-09-30 14:55           ` Eric Dumazet
  2014-09-30 21:46             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 14:55 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Tue, 2014-09-30 at 16:31 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > If you feel not comfortable with "burst", rename it to whatever you
> > think is best.
> > But please, do not hard code magic 7 in your code.
> 
> I had hoped that this 'magic' value could be removed
> completely, only using bql data for bulking decisions.

But it is apparently not the case, since you guys decided to had it set
to 8, then to 7 later, based on experiments.

This value is probably dependent on the NIC hardware, while BQL
adj_limit is more likely tied to external factors, like latencies to
react to TX IRQ and perform TX completion.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 14:55           ` Eric Dumazet
@ 2014-09-30 21:46             ` Jesper Dangaard Brouer
  2014-10-01  4:44               ` Tom Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-30 21:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke, brouer

On Tue, 30 Sep 2014 07:55:36 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2014-09-30 at 16:31 +0200, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > If you feel not comfortable with "burst", rename it to whatever you
> > > think is best.
> > > But please, do not hard code magic 7 in your code.
> > 
> > I had hoped that this 'magic' value could be removed
> > completely, only using bql data for bulking decisions.
> 
> But it is apparently not the case, since you guys decided to had it set
> to 8, then to 7 later, based on experiments.

The "magic" limit is only a conservative save guard.  We do plan to
remove this completely.  That is the reason for not exporting this, as
we want free-hands to remove this completely.

Guess, I'll remove it completely now, so we can move on.


I would like some off/on switch, exported to userspace, for disabling
this bulking (then we don't need this conservative magic number).  How
would that be done best?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 21:46             ` Jesper Dangaard Brouer
@ 2014-10-01  4:44               ` Tom Herbert
  0 siblings, 0 replies; 38+ messages in thread
From: Tom Herbert @ 2014-10-01  4:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Florian Westphal, Linux Netdev List,
	David S. Miller, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On Tue, Sep 30, 2014 at 2:46 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Tue, 30 Sep 2014 07:55:36 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > On Tue, 2014-09-30 at 16:31 +0200, Florian Westphal wrote:
> > > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > If you feel not comfortable with "burst", rename it to whatever you
> > > > think is best.
> > > > But please, do not hard code magic 7 in your code.
> > >
> > > I had hoped that this 'magic' value could be removed
> > > completely, only using bql data for bulking decisions.
> >
> > But it is apparently not the case, since you guys decided to had it set
> > to 8, then to 7 later, based on experiments.
>
> The "magic" limit is only a conservative save guard.  We do plan to
> remove this completely.  That is the reason for not exporting this, as
> we want free-hands to remove this completely.
>
> Guess, I'll remove it completely now, so we can move on.
>
>
> I would like some off/on switch, exported to userspace, for disabling
> this bulking (then we don't need this conservative magic number).  How
> would that be done best?


I'd suggest to keep the packet limit as an unsigned. Default value is
infinity (-1UL), 0 turns it off, and other values are available if
someone really wants to tune this (for example, if BQL is disabled--
BQL limit is infinity). Export this as sys variable for the TX queue.

Tom

>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 14:17       ` Eric Dumazet
  2014-09-30 14:26         ` Tom Herbert
  2014-09-30 14:31         ` Florian Westphal
@ 2014-09-30 15:26         ` Florian Westphal
  2014-09-30 15:39           ` Dave Taht
  2014-09-30 15:41           ` Eric Dumazet
  2 siblings, 2 replies; 38+ messages in thread
From: Florian Westphal @ 2014-09-30 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, Jesper Dangaard Brouer, netdev, David S. Miller,
	Tom Herbert, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2014-09-30 at 15:46 +0200, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > Add a new dql attribute, to control how much packets we are allowed to
> > > burst from qdisc to device.
> > 
> > I understand the motivation for this, but I find it a bit out-of-place
> > to have a 'packet' type counter in bql context?
> > 
> > Would it perhaps make more sense to restrict bulk dequeues by an upper
> > (possibly changeable from userspace) byte counter limit?
> 
> The byte count is already provided : its the BQL limit.
> We already have ways to tune it (limit_min & limit_max)
> We do not think we need something else here.

So you're saying that a bulk dequeue should just grab as many skbs
as possible until no more available or dql_avail exhausted?

The "magic" value was just to be conservative and not induce any
hol blocking, which is also why Jesper reduced it again in the latest
submission.

Then, we might later be able to remove the TSO restriction and switch
to a pure byte-based limit.

(I don't think having a packet-based count makes sense once tso/gso
 skbs can be bulk dequeued).

Sorry for the confusion.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 15:26         ` Florian Westphal
@ 2014-09-30 15:39           ` Dave Taht
  2014-09-30 15:41           ` Eric Dumazet
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Taht @ 2014-09-30 15:39 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eric Dumazet, Jesper Dangaard Brouer, netdev@vger.kernel.org,
	David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Daniel Borkmann, Jamal Hadi Salim, Alexander Duyck,
	John Fastabend, Toke Høiland-Jørgensen

On Tue, Sep 30, 2014 at 8:26 AM, Florian Westphal <fw@strlen.de> wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Tue, 2014-09-30 at 15:46 +0200, Florian Westphal wrote:
>> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > > From: Eric Dumazet <edumazet@google.com>
>> > >
>> > > Add a new dql attribute, to control how much packets we are allowed to
>> > > burst from qdisc to device.
>> >
>> > I understand the motivation for this, but I find it a bit out-of-place
>> > to have a 'packet' type counter in bql context?
>> >
>> > Would it perhaps make more sense to restrict bulk dequeues by an upper
>> > (possibly changeable from userspace) byte counter limit?
>>
>> The byte count is already provided : its the BQL limit.
>> We already have ways to tune it (limit_min & limit_max)
>> We do not think we need something else here.
>
> So you're saying that a bulk dequeue should just grab as many skbs
> as possible until no more available or dql_avail exhausted?
>
> The "magic" value was just to be conservative and not induce any
> hol blocking, which is also why Jesper reduced it again in the latest
> submission.
>
> Then, we might later be able to remove the TSO restriction and switch
> to a pure byte-based limit.
>
> (I don't think having a packet-based count makes sense once tso/gso
>  skbs can be bulk dequeued).
>
> Sorry for the confusion.

I still have some hope that we can one day fix wifi packet aggregation,
which is a limit of 42 packets or 64k bytes per aggregate (best case),
with something BQL-like.

As for the size of the tx ring problem vs GSO, there is at least one
driver with a very limited tx ring that I can think of that tears apart
GSO packets in the driver... but I'm not sure if it's mainlined.

I would not mind at all if TSO/GSO were banned on devices running
slower than 100Mbit.

Perhaps exporting the tx-ring size would be saner than a burst
parameter?

-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 15:26         ` Florian Westphal
  2014-09-30 15:39           ` Dave Taht
@ 2014-09-30 15:41           ` Eric Dumazet
  1 sibling, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-09-30 15:41 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Tue, 2014-09-30 at 17:26 +0200, Florian Westphal wrote:

> So you're saying that a bulk dequeue should just grab as many skbs
> as possible until no more available or dql_avail exhausted?
> 

This was certainly the intent.

> The "magic" value was just to be conservative and not induce any
> hol blocking, which is also why Jesper reduced it again in the latest
> submission.

Think about whats happening here.

BQL controls the optimal queue _in_ the NIC TX ring, given latencies and
throughput goals.

Once packets are queued in the NIC TX ring, there is no way a high prio
packet can be injected ahead of them, unless NIC has separate TX ring
for different prio already.

There is no need redoing this in another layer. BQL logic was hard to
setup and tune, I have no idea why you want to reinvent it.

> 
> Then, we might later be able to remove the TSO restriction and switch
> to a pure byte-based limit

Why later ? If it is not needed now, it should not be added 'to be safe'

> .
> 
> (I don't think having a packet-based count makes sense once tso/gso
>  skbs can be bulk dequeued).

So do not add this limit, so that we can spot bugs ahead of time.

If some NIC wants a 'doorbell at least once every 8 packets', then the
NIC driver should be fixed.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [net-next PATCH] dql: add a burst attribute
  2014-09-30 12:51   ` [net-next PATCH] dql: add a burst attribute Eric Dumazet
  2014-09-30 13:46     ` Florian Westphal
@ 2014-09-30 21:29     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-30 21:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke, brouer

On Tue, 30 Sep 2014 05:51:25 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> Add a new dql attribute, to control how much packets we are allowed to
> burst from qdisc to device.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> Jesper, please consider using this instead of 7 or 8 values you hard coded.

The hard coded budget was only a safe guard.  I didn't want to export
it (as we can never remove it then), I plan to remove it or replace it
with something else.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2014-10-02 13:02 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-30  8:53 [net-next PATCH V5] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
2014-09-30 11:07 ` Jamal Hadi Salim
2014-09-30 18:20   ` David Miller
2014-10-01 13:17     ` Jamal Hadi Salim
2014-10-01 14:55       ` Tom Herbert
2014-10-01 15:34         ` Jamal Hadi Salim
2014-10-01 17:28           ` Jesper Dangaard Brouer
2014-10-01 18:55             ` Jamal Hadi Salim
2014-10-01 19:47               ` Jesper Dangaard Brouer
2014-10-01 20:05                 ` Jamal Hadi Salim
2014-10-01 20:32                   ` Jesper Dangaard Brouer
2014-10-02  5:18                     ` Dave Taht
2014-10-02  7:44                       ` Jesper Dangaard Brouer
2014-10-02 10:08                         ` Toke Høiland-Jørgensen
2014-10-02 13:02                       ` Jamal Hadi Salim
2014-10-02 12:53                     ` Jamal Hadi Salim
2014-10-01 19:47             ` David Miller
2014-09-30 11:48 ` Eric Dumazet
2014-09-30 12:25   ` Florian Westphal
2014-09-30 12:38     ` Eric Dumazet
2014-09-30 12:41       ` Florian Westphal
2014-09-30 22:07   ` Jesper Dangaard Brouer
2014-09-30 22:15     ` Eric Dumazet
2014-09-30 12:34 ` Eric Dumazet
2014-09-30 12:51   ` [net-next PATCH] dql: add a burst attribute Eric Dumazet
2014-09-30 13:46     ` Florian Westphal
2014-09-30 14:17       ` Eric Dumazet
2014-09-30 14:26         ` Tom Herbert
2014-09-30 14:49           ` Eric Dumazet
2014-09-30 15:05             ` Tom Herbert
2014-09-30 14:31         ` Florian Westphal
2014-09-30 14:55           ` Eric Dumazet
2014-09-30 21:46             ` Jesper Dangaard Brouer
2014-10-01  4:44               ` Tom Herbert
2014-09-30 15:26         ` Florian Westphal
2014-09-30 15:39           ` Dave Taht
2014-09-30 15:41           ` Eric Dumazet
2014-09-30 21:29     ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).