[RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
@ 2014-09-19 20:49 Jesper Dangaard Brouer
  2014-09-19 20:49 ` [RFC net-next PATCH V3 1/2] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-19 20:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
from the qdisc layer, to implement dequeue bulking.

RFC V3: Keeping the ball rolling.

This patchset should now use BQL correctly.  I've done lots of testing
for Head-of-Line blocking issues that can occur due to requeue of a
SKB bulk list.  I've not been able to provoke any HoL blocking
situation, simply because BQL is doing such a good job, thus I'm
unable to "overshoot" HW/BQL limits with more than a single packet.

This patch chooses a very conservative approach, as by default only
allowing dequeue of one extra packet, besides the normal dequeue.

Open questions:

- How do we expose tuning to userspace?
  Patch adds /proc/sys/net/core/qdisc_bulk_dequeue_limit but I don't like it...
  Per device tunable?

- Can/should we limit dequeue bulking to devices supporting BQL?

Based on top of net-next:
 commit cb93471acc (tcp: do not fake tcp headers in tcp_send_rcvq())

---

Jesper Dangaard Brouer (1):
      qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE

Tom Herbert (1):
      net: Functions to report space available in device TX queues

 include/linux/netdevice.h  |   28 ++++++++++++++++--
 include/net/sch_generic.h  |    2 +
 net/core/sysctl_net_core.c |    9 ++++++
 net/sched/sch_generic.c    |   70 ++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 104 insertions(+), 5 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC net-next PATCH V3 1/2] net: Functions to report space available in device TX queues
  2014-09-19 20:49 [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
@ 2014-09-19 20:49 ` Jesper Dangaard Brouer
  2014-09-19 20:49 ` [RFC net-next PATCH V3 2/2] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
  2014-09-19 20:59 ` [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Eric Dumazet
  2 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-19 20:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

From: Tom Herbert <therbert@google.com>

This patch adds netdev_tx_avail_queue and netdev_avail_queue which are
used to report number of bytes available in transmit queues per BQL. The
functions call dql_avail which returns BQL limit minus number of
inflight bytes. These functions can be called without txlock, for
instance to ascertain how much data should be dequeued from a qdisc in
a batch. When called without the tx_lock, the result is technically a
hint, subsequently when the tx_lock is done for a transmit it is
possible the availability has changed (for example a transmit
completion may have freed up more space in the queue or changed the
limit).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
 - Fixed spelling in comments

 include/linux/netdevice.h |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 28d4378..b12b919 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2551,6 +2551,30 @@ static inline void netdev_completed_queue(struct net_device *dev,
 	netdev_tx_completed_queue(netdev_get_tx_queue(dev, 0), pkts, bytes);
 }
 
+static inline int netdev_tx_avail_queue(struct netdev_queue *dev_queue)
+{
+#ifdef CONFIG_BQL
+	return dql_avail(&dev_queue->dql);
+#else
+	return DQL_MAX_LIMIT;
+#endif
+}
+
+/**
+ *	netdev_avail_queue - report how much space is available for xmit
+ *	@dev: network device
+ *
+ *	Report the amount of space available in the TX queue in terms of
+ *	number of bytes. This returns the number of bytes available per
+ *	DQL. This function may be called without taking the txlock on
+ *	the device, however in that case the result should be taken as
+ *	a (strong) hint.
+ */
+static inline int netdev_avail_queue(struct net_device *dev)
+{
+	return netdev_tx_avail_queue(netdev_get_tx_queue(dev, 0));
+}
+
 static inline void netdev_tx_reset_queue(struct netdev_queue *q)
 {
 #ifdef CONFIG_BQL
@@ -2566,9 +2590,9 @@ static inline void netdev_tx_reset_queue(struct netdev_queue *q)
  * 	Reset the bytes and packet count of a network device and clear the
  * 	software flow control OFF bit for this network device
  */
-static inline void netdev_reset_queue(struct net_device *dev_queue)
+static inline void netdev_reset_queue(struct net_device *dev)
 {
-	netdev_tx_reset_queue(netdev_get_tx_queue(dev_queue, 0));
+	netdev_tx_reset_queue(netdev_get_tx_queue(dev, 0));
 }
 
 /**

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC net-next PATCH V3 2/2] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-19 20:49 [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
  2014-09-19 20:49 ` [RFC net-next PATCH V3 1/2] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
@ 2014-09-19 20:49 ` Jesper Dangaard Brouer
  2014-09-20  0:22   ` Tom Herbert
  2014-09-19 20:59 ` [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Eric Dumazet
  2 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-19 20:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.

It also see GSO and segmented GSO packets, as a seperate kind of SKB
bulking, and thus tries to avoid and stop dequeuing when seeing a GSO
packet (both real GSO and segmented GSO skb lists).

Choosing a very conservative initial default value of 1 extra packet
dequeue, if bulking is allowed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
V2:
 - Restruct functions, split out functionality
 - Use BQL bytelimit to avoid overshooting driver limits

V3:
 - Correct use of BQL
 - Some minor adjustments based on feedback.
 - Default setting only bulk dequeue 1 extra packet (2 packets).

 include/net/sch_generic.h  |    2 +
 net/core/sysctl_net_core.c |    9 ++++++
 net/sched/sch_generic.c    |   70 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 1e89b9a..da9324f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -14,6 +14,8 @@ struct qdisc_walker;
 struct tcf_walker;
 struct module;
 
+extern int qdisc_bulk_dequeue_limit;
+
 struct qdisc_rate_table {
 	struct tc_ratespec rate;
 	u32		data[256];
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13..5505841 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -361,6 +361,15 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "qdisc_bulk_dequeue_limit",
+		.data		= &qdisc_bulk_dequeue_limit,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.extra1		= &zero,
+		.extra2		= &ushort_max,
+		.proc_handler	= proc_dointvec_minmax
+	},
 	{ }
 };
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 346ef85..eee2280 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -34,6 +34,9 @@
 const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops;
 EXPORT_SYMBOL(default_qdisc_ops);
 
+// FIXME: Still undecided where to put this parameter...
+int qdisc_bulk_dequeue_limit __read_mostly = 1;
+
 /* Main transmission queue. */
 
 /* Modifications to data participating in scheduling must be protected with
@@ -56,6 +59,67 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }
 
+static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
+{
+	return qdisc->flags & TCQ_F_ONETXQUEUE;
+}
+
+static inline struct sk_buff *qdisc_dequeue_validate(struct Qdisc *qdisc)
+{
+	struct sk_buff *skb = qdisc->dequeue(qdisc);
+
+	if (skb != NULL)
+		skb = validate_xmit_skb(skb, qdisc_dev(qdisc));
+
+	return skb;
+}
+
+static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
+						     struct sk_buff *head)
+{
+	struct sk_buff *new, *skb = head;
+	struct netdev_queue *txq = q->dev_queue;
+	int bytelimit = netdev_tx_avail_queue(txq);
+	int limit = qdisc_bulk_dequeue_limit;
+
+	bytelimit -= skb->len; /* incorrect len if skb->next, but exits below */
+
+	// if (bytelimit < psched_mtu(qdisc_dev(q)) //proposed by fw
+	if (bytelimit < 0)
+		return head;
+
+	do {
+		if (skb->next || skb_is_gso(skb)) {
+			/* Stop processing if the skb is already a skb
+			 * list (e.g a segmented GSO packet, from
+			 * below or func caller) or a real GSO packet
+			 */
+			break;
+		}
+		new = q->dequeue(q);
+		if (new) {
+			bytelimit -= new->len; /* covers GSO len */
+			new = validate_xmit_skb(new, qdisc_dev(q));
+			if (!new)
+				break;
+			/* "new" can be a skb list after validate call
+			 * above (GSO segmented), but it is okay to
+			 * append it to current skb->next, because
+			 * next round will exit in-case "new" were a
+			 * skb list.
+			 */
+			skb->next = new;
+			skb = new;
+		}
+	} while (new && --limit && (bytelimit >= 0));
+	skb = head;
+
+	return skb;
+}
+
+/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
+ * A requeued skb (via q->gso_skb) can also be a SKB list.
+ */
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
 	struct sk_buff *skb = q->gso_skb;
@@ -71,9 +135,9 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 			skb = NULL;
 	} else {
 		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
-			skb = q->dequeue(q);
-			if (skb)
-				skb = validate_xmit_skb(skb, qdisc_dev(q));
+			skb = qdisc_dequeue_validate(q);
+			if (skb && qdisc_may_bulk(q) && qdisc_bulk_dequeue_limit)
+				skb = qdisc_bulk_dequeue_skb(q, skb);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-19 20:49 [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
  2014-09-19 20:49 ` [RFC net-next PATCH V3 1/2] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
  2014-09-19 20:49 ` [RFC net-next PATCH V3 2/2] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
@ 2014-09-19 20:59 ` Eric Dumazet
  2014-09-19 21:28   ` Dave Taht
                     ` (2 more replies)
  2 siblings, 3 replies; 11+ messages in thread
From: Eric Dumazet @ 2014-09-19 20:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Fri, 2014-09-19 at 22:49 +0200, Jesper Dangaard Brouer wrote:
> This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
> from the qdisc layer, to implement dequeue bulking.
> 
> RFC V3: Keeping the ball rolling.
> 
> This patchset should now use BQL correctly.  I've done lots of testing
> for Head-of-Line blocking issues that can occur due to requeue of a
> SKB bulk list.  I've not been able to provoke any HoL blocking
> situation, simply because BQL is doing such a good job, thus I'm
> unable to "overshoot" HW/BQL limits with more than a single packet.
> 
> This patch chooses a very conservative approach, as by default only
> allowing dequeue of one extra packet, besides the normal dequeue.
> 
> Open questions:
> 
> - How do we expose tuning to userspace?
>   Patch adds /proc/sys/net/core/qdisc_bulk_dequeue_limit but I don't like it...
>   Per device tunable?
> 

bql is using /sys, of course ;)

# grep . /sys/class/net/eth1/queues/tx-0/byte_queue_limits/*
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/hold_time:1000
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:0
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:113314
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_max:1879048192
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_min:0

Maybe you could simply reuse byte_queue_limits/limit, I am not sure we
need a specific tunable.


> - Can/should we limit dequeue bulking to devices supporting BQL?
> 

Yes please. This will be an incentive to get BQL on drivers.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-19 20:59 ` [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Eric Dumazet
@ 2014-09-19 21:28   ` Dave Taht
  2014-09-19 21:31   ` Jesper Dangaard Brouer
  2014-09-20  0:31   ` Tom Herbert
  2 siblings, 0 replies; 11+ messages in thread
From: Dave Taht @ 2014-09-19 21:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, netdev@vger.kernel.org, David S. Miller,
	Tom Herbert, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann, Jamal Hadi Salim, Alexander Duyck,
	John Fastabend, Toke Høiland-Jørgensen

On Fri, Sep 19, 2014 at 11:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-09-19 at 22:49 +0200, Jesper Dangaard Brouer wrote:
>> This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
>> from the qdisc layer, to implement dequeue bulking.
>>
>> RFC V3: Keeping the ball rolling.
>>
>> This patchset should now use BQL correctly.  I've done lots of testing
>> for Head-of-Line blocking issues that can occur due to requeue of a
>> SKB bulk list.  I've not been able to provoke any HoL blocking
>> situation, simply because BQL is doing such a good job, thus I'm
>> unable to "overshoot" HW/BQL limits with more than a single packet.
>>
>> This patch chooses a very conservative approach, as by default only
>> allowing dequeue of one extra packet, besides the normal dequeue.
>>
>> Open questions:
>>
>> - How do we expose tuning to userspace?
>>   Patch adds /proc/sys/net/core/qdisc_bulk_dequeue_limit but I don't like it...
>>   Per device tunable?
>>
>
> bql is using /sys, of course ;)
>
> # grep . /sys/class/net/eth1/queues/tx-0/byte_queue_limits/*
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/hold_time:1000
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:0
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:113314
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_max:1879048192
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_min:0
>
> Maybe you could simply reuse byte_queue_limits/limit, I am not sure we
> need a specific tunable.
>
>
>> - Can/should we limit dequeue bulking to devices supporting BQL?
>>
>
> Yes please. This will be an incentive to get BQL on drivers.

While I am +100 on getting more BQL'd drivers out there...

https://www.bufferbloat.net/projects/bloat/wiki/BQL_enabled_drivers

... It was my hope that some of this bulk dequeue work would apply to
wifi packet aggregation one day....

>
>
>



-- 
Dave Täht

https://www.bufferbloat.net/projects/make-wifi-fast

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-19 20:59 ` [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Eric Dumazet
  2014-09-19 21:28   ` Dave Taht
@ 2014-09-19 21:31   ` Jesper Dangaard Brouer
  2014-09-19 21:45     ` Eric Dumazet
  2014-09-20  0:31   ` Tom Herbert
  2 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-19 21:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke, brouer

On Fri, 19 Sep 2014 13:59:20 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Fri, 2014-09-19 at 22:49 +0200, Jesper Dangaard Brouer wrote:
> > This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
> > from the qdisc layer, to implement dequeue bulking.
> > 
> > RFC V3: Keeping the ball rolling.
> > 
> > This patchset should now use BQL correctly.  I've done lots of testing
> > for Head-of-Line blocking issues that can occur due to requeue of a
> > SKB bulk list.  I've not been able to provoke any HoL blocking
> > situation, simply because BQL is doing such a good job, thus I'm
> > unable to "overshoot" HW/BQL limits with more than a single packet.
> > 
> > This patch chooses a very conservative approach, as by default only
> > allowing dequeue of one extra packet, besides the normal dequeue.
> > 
> > Open questions:
> > 
> > - How do we expose tuning to userspace?
> >   Patch adds /proc/sys/net/core/qdisc_bulk_dequeue_limit but I don't like it...
> >   Per device tunable?
> > 
> 
> bql is using /sys, of course ;)
> 
> # grep . /sys/class/net/eth1/queues/tx-0/byte_queue_limits/*
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/hold_time:1000
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:0
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:113314
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_max:1879048192
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_min:0
> 
> Maybe you could simply reuse byte_queue_limits/limit, I am not sure we
> need a specific tunable.

It would make sense.  I've been running tests with qdisc_bulk_dequeue_limit=100,
and I'm being saved/limited by the BQL limit.

Perhaps we should still keep some upper bound on num of packet e.g. 32,
as this does influence the number of HW ring descriptors we use before
"flushing"/notifying HW by the tailptr write.

> 
> > - Can/should we limit dequeue bulking to devices supporting BQL?
> > 
> 
> Yes please. This will be an incentive to get BQL on drivers.

How can I test if the dev supports BQL?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-19 21:31   ` Jesper Dangaard Brouer
@ 2014-09-19 21:45     ` Eric Dumazet
  2014-09-19 22:39       ` Cong Wang
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2014-09-19 21:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, Dave Taht, toke

On Fri, 2014-09-19 at 23:31 +0200, Jesper Dangaard Brouer wrote:

> How can I test if the dev supports BQL?


CONFIG_BQL should be enough.

If a device driver do nothing about BQL, byte_queue_limits/limit will be
zero by default.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-19 21:45     ` Eric Dumazet
@ 2014-09-19 22:39       ` Cong Wang
  2014-09-19 22:52         ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2014-09-19 22:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

On Fri, Sep 19, 2014 at 2:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-09-19 at 23:31 +0200, Jesper Dangaard Brouer wrote:
>
>> How can I test if the dev supports BQL?
>
>
> CONFIG_BQL should be enough.

Then why commits like this one?

commit 7070ce0a6419a118842298bc967061ad6cea40db
Author: Alexander Duyck <alexander.h.duyck@intel.com>
Date:   Sat Sep 28 06:00:37 2013 +0000

    i40e: Add support for Tx byte queue limits

    Implement BQL (byte queue limit) support in i40e.


>
> If a device driver do nothing about BQL, byte_queue_limits/limit will be
> zero by default.

Also note that for some virtual device, BQL may make no sense at all,
at least veth doesn't even use a TX queue (I know it has tx-0 mostly for
qdisc). Perhaps all NETIF_F_LLTX ones?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-19 22:39       ` Cong Wang
@ 2014-09-19 22:52         ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2014-09-19 22:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	toke

On Fri, 2014-09-19 at 15:39 -0700, Cong Wang wrote:

> Then why commits like this one?
> 
> commit 7070ce0a6419a118842298bc967061ad6cea40db
> Author: Alexander Duyck <alexander.h.duyck@intel.com>
> Date:   Sat Sep 28 06:00:37 2013 +0000
> 
>     i40e: Add support for Tx byte queue limits
> 
>     Implement BQL (byte queue limit) support in i40e.
> 

Read again my full mail, don't jump into conclusions in the middle of
it.


> 
> >
> > If a device driver do nothing about BQL, byte_queue_limits/limit will be
> > zero by default.

See here ? Its all explained. If a device driver do not implement BQL,
limit will be zero, and bulk dequeue will _automatically_ be disabled.

BQL has no 'cost' if a driver does not use it.

Its basically fields which will never be updated.

(cf in struct netdev_queue :
			struct dql      dql;
)

> 

> Also note that for some virtual device, BQL may make no sense at all,
> at least veth doesn't even use a TX queue (I know it has tx-0 mostly for
> qdisc). Perhaps all NETIF_F_LLTX ones?

Lets focus on the normal cases, where virtual devices do not have any
qdisc.

We are talking here of _qdisc_ bulk dequeue, right ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 2/2] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-19 20:49 ` [RFC net-next PATCH V3 2/2] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
@ 2014-09-20  0:22   ` Tom Herbert
  0 siblings, 0 replies; 11+ messages in thread
From: Tom Herbert @ 2014-09-20  0:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Netdev List, David S. Miller, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen

On Fri, Sep 19, 2014 at 1:49 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> sending/processing an entire skb list.
>
> This patch implements qdisc bulk dequeue, by allowing multiple packets
> to be dequeued in dequeue_skb().
>
> The optimization principle for this is two fold, (1) to amortize
> locking cost and (2) avoid expensive tailptr update for notifying HW.
>  (1) Several packets are dequeued while holding the qdisc root_lock,
> amortizing locking cost over several packet.  The dequeued SKB list is
> processed under the TXQ lock in dev_hard_start_xmit(), thus also
> amortizing the cost of the TXQ lock.
>  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> API to delay HW tailptr update, which also reduces the cost per
> packet.
>
> One restriction of the new API is that every SKB must belong to the
> same TXQ.  This patch takes the easy way out, by restricting bulk
> dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> qdisc only have attached a single TXQ.
>
> Some detail about the flow; dev_hard_start_xmit() will process the skb
> list, and transmit packets individually towards the driver (see
> xmit_one()).  In case the driver stops midway in the list, the
> remaining skb list is returned by dev_hard_start_xmit().  In
> sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
>
> To avoid overshooting the HW limits, which results in requeuing, the
> patch limits the amount of bytes dequeued, based on the drivers BQL
> limits.
>
> It also see GSO and segmented GSO packets, as a seperate kind of SKB
> bulking, and thus tries to avoid and stop dequeuing when seeing a GSO
> packet (both real GSO and segmented GSO skb lists).
>
> Choosing a very conservative initial default value of 1 extra packet
> dequeue, if bulking is allowed.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> ---
> V2:
>  - Restruct functions, split out functionality
>  - Use BQL bytelimit to avoid overshooting driver limits
>
> V3:
>  - Correct use of BQL
>  - Some minor adjustments based on feedback.
>  - Default setting only bulk dequeue 1 extra packet (2 packets).
>
>  include/net/sch_generic.h  |    2 +
>  net/core/sysctl_net_core.c |    9 ++++++
>  net/sched/sch_generic.c    |   70 ++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 78 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 1e89b9a..da9324f 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -14,6 +14,8 @@ struct qdisc_walker;
>  struct tcf_walker;
>  struct module;
>
> +extern int qdisc_bulk_dequeue_limit;
> +
>  struct qdisc_rate_table {
>         struct tc_ratespec rate;
>         u32             data[256];
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index cf9cd13..5505841 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -361,6 +361,15 @@ static struct ctl_table net_core_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec
>         },
> +       {
> +               .procname       = "qdisc_bulk_dequeue_limit",
> +               .data           = &qdisc_bulk_dequeue_limit,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .extra1         = &zero,
> +               .extra2         = &ushort_max,
> +               .proc_handler   = proc_dointvec_minmax
> +       },
>         { }
>  };
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 346ef85..eee2280 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -34,6 +34,9 @@
>  const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops;
>  EXPORT_SYMBOL(default_qdisc_ops);
>
> +// FIXME: Still undecided where to put this parameter...
> +int qdisc_bulk_dequeue_limit __read_mostly = 1;
> +
>  /* Main transmission queue. */
>
>  /* Modifications to data participating in scheduling must be protected with
> @@ -56,6 +59,67 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>         return 0;
>  }
>
> +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> +{
> +       return qdisc->flags & TCQ_F_ONETXQUEUE;
> +}
> +
> +static inline struct sk_buff *qdisc_dequeue_validate(struct Qdisc *qdisc)
> +{
> +       struct sk_buff *skb = qdisc->dequeue(qdisc);
> +
> +       if (skb != NULL)
> +               skb = validate_xmit_skb(skb, qdisc_dev(qdisc));
> +
> +       return skb;
> +}
> +
> +static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
> +                                                    struct sk_buff *head)
> +{
> +       struct sk_buff *new, *skb = head;
> +       struct netdev_queue *txq = q->dev_queue;
> +       int bytelimit = netdev_tx_avail_queue(txq);
> +       int limit = qdisc_bulk_dequeue_limit;
> +
> +       bytelimit -= skb->len; /* incorrect len if skb->next, but exits below */
> +
> +       // if (bytelimit < psched_mtu(qdisc_dev(q)) //proposed by fw
> +       if (bytelimit < 0)
> +               return head;
> +
> +       do {
> +               if (skb->next || skb_is_gso(skb)) {
> +                       /* Stop processing if the skb is already a skb
> +                        * list (e.g a segmented GSO packet, from
> +                        * below or func caller) or a real GSO packet
> +                        */
> +                       break;
> +               }
> +               new = q->dequeue(q);
> +               if (new) {
> +                       bytelimit -= new->len; /* covers GSO len */
> +                       new = validate_xmit_skb(new, qdisc_dev(q));
> +                       if (!new)
> +                               break;
> +                       /* "new" can be a skb list after validate call
> +                        * above (GSO segmented), but it is okay to
> +                        * append it to current skb->next, because
> +                        * next round will exit in-case "new" were a
> +                        * skb list.
> +                        */
> +                       skb->next = new;
> +                       skb = new;
> +               }
> +       } while (new && --limit && (bytelimit >= 0));
> +       skb = head;
> +
> +       return skb;
> +}
> +
> +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
> + * A requeued skb (via q->gso_skb) can also be a SKB list.
> + */
>  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>  {
>         struct sk_buff *skb = q->gso_skb;
> @@ -71,9 +135,9 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>                         skb = NULL;
>         } else {
>                 if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> -                       skb = q->dequeue(q);
> -                       if (skb)
> -                               skb = validate_xmit_skb(skb, qdisc_dev(q));
> +                       skb = qdisc_dequeue_validate(q);
> +                       if (skb && qdisc_may_bulk(q) && qdisc_bulk_dequeue_limit)
> +                               skb = qdisc_bulk_dequeue_skb(q, skb);
>                 }
This is hard to read. I don't understand why qdisc_bulk_dequeue_skb
needs to be a separate function. Single dequeue and bulk dequeue can
happen in one common loop. Maybe something like:

struct sk_buff *ret = NULL, **next = &ret;

bytelimit = qdisc_may_bulk(q) ? netdev_tx_avail_queue(txq) : 0;

do {
    skb= q->dequeue;
    if (!skb)
        break;
    *next = skb;
    next = &skb->next;
    bytelimit -= skb->len;
    limit--;
} while (bytelimit > 0 && limit > 0);

>         }
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-19 20:59 ` [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Eric Dumazet
  2014-09-19 21:28   ` Dave Taht
  2014-09-19 21:31   ` Jesper Dangaard Brouer
@ 2014-09-20  0:31   ` Tom Herbert
  2 siblings, 0 replies; 11+ messages in thread
From: Tom Herbert @ 2014-09-20  0:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Linux Netdev List, David S. Miller,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen, Tino Reichardt

On Fri, Sep 19, 2014 at 1:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-09-19 at 22:49 +0200, Jesper Dangaard Brouer wrote:
>> This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
>> from the qdisc layer, to implement dequeue bulking.
>>
>> RFC V3: Keeping the ball rolling.
>>
>> This patchset should now use BQL correctly.  I've done lots of testing
>> for Head-of-Line blocking issues that can occur due to requeue of a
>> SKB bulk list.  I've not been able to provoke any HoL blocking
>> situation, simply because BQL is doing such a good job, thus I'm
>> unable to "overshoot" HW/BQL limits with more than a single packet.
>>
>> This patch chooses a very conservative approach, as by default only
>> allowing dequeue of one extra packet, besides the normal dequeue.
>>
>> Open questions:
>>
>> - How do we expose tuning to userspace?
>>   Patch adds /proc/sys/net/core/qdisc_bulk_dequeue_limit but I don't like it...
>>   Per device tunable?
>>
>
> bql is using /sys, of course ;)
>
> # grep . /sys/class/net/eth1/queues/tx-0/byte_queue_limits/*
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/hold_time:1000
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/inflight:0
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:113314
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_max:1879048192
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit_min:0
>
> Maybe you could simply reuse byte_queue_limits/limit, I am not sure we
> need a specific tunable.
>
>
>> - Can/should we limit dequeue bulking to devices supporting BQL?
>>
>
> Yes please. This will be an incentive to get BQL on drivers.
>
I think we can refine the expectation a bit. Bulk qdisc dequeue should
work for devices that don't support BQL, but the packet dequeue limit
is fixed number of packets (we don't want to create new mechanisms to
track fragments or packets that are outstanding in device). However,
in order to get bulk device transmit for a device, I think it
reasonable to request that BQL is supported since whoever is adding
that is already mucking in driver TX path anyway.

Tom

>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-09-20  0:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-19 20:49 [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
2014-09-19 20:49 ` [RFC net-next PATCH V3 1/2] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
2014-09-19 20:49 ` [RFC net-next PATCH V3 2/2] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
2014-09-20  0:22   ` Tom Herbert
2014-09-19 20:59 ` [RFC net-next PATCH V3 0/2] qdisc bulk dequeuing and utilizing delayed tailptr updates Eric Dumazet
2014-09-19 21:28   ` Dave Taht
2014-09-19 21:31   ` Jesper Dangaard Brouer
2014-09-19 21:45     ` Eric Dumazet
2014-09-19 22:39       ` Cong Wang
2014-09-19 22:52         ` Eric Dumazet
2014-09-20  0:31   ` Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).