[RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates
@ 2014-09-04 12:54 Jesper Dangaard Brouer
  2014-09-04 12:54 ` [RFC net-next PATCH V2 1/3] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-04 12:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend

Wanted people to review this work early... so this is the current
state, even added my debug patch, if people want to "see" it work.

I'm currently testing with different combination of netperf, and
different on/off adjustments of {gso,gro,tso} as this influences the
code path / skb list generation vs. real GSOs.

Open questions:

- For now set bulk limit to 6 packets (counting including the head),
  this need some user adjustable param.  I'm open to suggestions?
  Perhaps we should start with 2 packets, or disable it as default?

- We are not doing proper accounting for weight_p/quota in
  __qdisc_run().  Should we reduce change this.  Open to suggestions.

---

Jesper Dangaard Brouer (2):
      qdisc: debug statements while testing prev-patch
      qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE

Tom Herbert (1):
      net: Functions to report space available in device TX queues


 include/linux/netdevice.h |   28 ++++++++++++-
 net/sched/sch_generic.c   |   95 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 118 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC net-next PATCH V2 1/3] net: Functions to report space available in device TX queues
  2014-09-04 12:54 [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
@ 2014-09-04 12:54 ` Jesper Dangaard Brouer
  2014-09-04 12:55 ` [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-04 12:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend

From: Tom Herbert <therbert@google.com>

This patch adds netdev_tx_avail_queue and netdev_avail_queue which are
used to report number of bytes available in transmit queues per BQL. The
functions call dql_avail which returns BQL limit minus number of
inflight bytes. These functions can be called without txlock, for
instance to ascertain how much data should be dequeued from a qdisc in
a batch. When called without the tx_lock, the result is technically a
hint, subsequently when the tx_lock is done for a transmit it is
possible the availability has changed (for example a transmit
completion may have freed up more space in the queue or changed the
limit).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
 - Fixed spelling in comments

 include/linux/netdevice.h |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5be20a7..5a4384a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2567,6 +2567,30 @@ static inline void netdev_completed_queue(struct net_device *dev,
 	netdev_tx_completed_queue(netdev_get_tx_queue(dev, 0), pkts, bytes);
 }
 
+static inline int netdev_tx_avail_queue(struct netdev_queue *dev_queue)
+{
+#ifdef CONFIG_BQL
+	return dql_avail(&dev_queue->dql);
+#else
+	return DQL_MAX_LIMIT;
+#endif
+}
+
+/**
+ *	netdev_avail_queue - report how much space is available for xmit
+ *	@dev: network device
+ *
+ *	Report the amount of space available in the TX queue in terms of
+ *	number of bytes. This returns the number of bytes available per
+ *	DQL. This function may be called without taking the txlock on
+ *	the device, however in that case the result should be taken as
+ *	a (strong) hint.
+ */
+static inline int netdev_avail_queue(struct net_device *dev)
+{
+	return netdev_tx_avail_queue(netdev_get_tx_queue(dev, 0));
+}
+
 static inline void netdev_tx_reset_queue(struct netdev_queue *q)
 {
 #ifdef CONFIG_BQL
@@ -2582,9 +2606,9 @@ static inline void netdev_tx_reset_queue(struct netdev_queue *q)
  * 	Reset the bytes and packet count of a network device and clear the
  * 	software flow control OFF bit for this network device
  */
-static inline void netdev_reset_queue(struct net_device *dev_queue)
+static inline void netdev_reset_queue(struct net_device *dev)
 {
-	netdev_tx_reset_queue(netdev_get_tx_queue(dev_queue, 0));
+	netdev_tx_reset_queue(netdev_get_tx_queue(dev, 0));
 }
 
 /**

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-04 12:54 [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
  2014-09-04 12:54 ` [RFC net-next PATCH V2 1/3] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
@ 2014-09-04 12:55 ` Jesper Dangaard Brouer
  2014-09-04 13:17   ` Florian Westphal
                     ` (2 more replies)
  2014-09-04 12:56 ` [RFC net-next PATCH V2 3/3] qdisc: debug statements while testing prev-patch Jesper Dangaard Brouer
  2014-09-04 13:44 ` [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jamal Hadi Salim
  3 siblings, 3 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-04 12:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

The patch also tries to limit the amount of bytes dequeued, based on
the drivers BQL limits.  It also tries to avoid and stop dequeuing
when seeing a GSO packet (both real GSO and segmented GSO skb lists).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
V2:
 - Restruct functions, split out functionality
 - Use BQL bytelimit to avoid overshooting driver limits, causing
   too large skb lists to be sitting on the requeue gso_skb.

 net/sched/sch_generic.c |   67 +++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 19696eb..a0c8070 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,67 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }
 
+static inline bool qdisc_may_bulk(const struct Qdisc *qdisc,
+				  const struct sk_buff *skb)
+{
+	return (qdisc->flags & TCQ_F_ONETXQUEUE);
+}
+
+static inline struct sk_buff *qdisc_dequeue_validate(struct Qdisc *qdisc)
+{
+	struct sk_buff *skb = qdisc->dequeue(qdisc);
+
+	if (skb != NULL)
+		skb = validate_xmit_skb(skb, qdisc_dev(qdisc));
+
+	return skb;
+}
+
+static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
+						     struct sk_buff *head)
+{
+	struct sk_buff *new, *skb = head;
+	struct netdev_queue *txq = q->dev_queue;
+	int bytelimit = netdev_tx_avail_queue(txq);
+	int limit = 5;
+
+	if (bytelimit <= 0)
+		return head;
+
+	do {
+		if (skb->next || skb_is_gso(skb)) {
+			/* Stop processing if the skb is already a skb
+			 * list (e.g a segmented GSO packet) or a real
+			 * GSO packet */
+			break;
+		}
+		new = qdisc_dequeue_validate(q);
+		if (new) {
+			skb->next = new;
+			skb = new;
+			bytelimit -= skb->len;
+			cnt++;
+			/* One problem here is it is difficult to
+			 * requeue the "new" dequeued skb, e.g. in
+			 * case of GSO, thus a "normal" packet can
+			 * have a GSO packet on its ->next ptr.
+			 *
+			 * Requeue is difficult because if requeuing
+			 * on q->gso_skb, then a second requeue can
+			 * happen from sch_direct_xmit e.g. if driver
+			 * returns NETDEV_TX_BUSY, which would
+			 * overwrite this requeue.
+			 */
+		}
+	} while (new && --limit && (bytelimit > 0));
+	skb = head;
+
+	return skb;
+}
+
+/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
+ * A requeued skb (via q->gso_skb) can also be a SKB list.
+ */
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
 	struct sk_buff *skb = q->gso_skb;
@@ -71,9 +132,9 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 			skb = NULL;
 	} else {
 		if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
-			skb = q->dequeue(q);
-			if (skb)
-				skb = validate_xmit_skb(skb, qdisc_dev(q));
+			skb = qdisc_dequeue_validate(q);
+			if (skb && qdisc_may_bulk(q, skb))
+				skb = qdisc_bulk_dequeue_skb(q, skb);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-04 12:55 ` [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
@ 2014-09-04 13:17   ` Florian Westphal
  2014-09-04 17:09     ` Cong Wang
  2014-09-04 13:29   ` Eric Dumazet
  2014-09-04 16:59   ` Tom Herbert
  2 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2014-09-04 13:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend

Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> Some detail about the flow; dev_hard_start_xmit() will process the skb
> list, and transmit packets individually towards the driver (see
> xmit_one()).  In case the driver stops midway in the list, the
> remaining skb list is returned by dev_hard_start_xmit().  In
> sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> 
> The patch also tries to limit the amount of bytes dequeued, based on
> the drivers BQL limits.  It also tries to avoid and stop dequeuing
> when seeing a GSO packet (both real GSO and segmented GSO skb lists).
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> +static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
> +						     struct sk_buff *head)
> +{
> +	struct sk_buff *new, *skb = head;
> +	struct netdev_queue *txq = q->dev_queue;
> +	int bytelimit = netdev_tx_avail_queue(txq);
> +	int limit = 5;
> +
> +	if (bytelimit <= 0)
> +		return head;

bytelimit < psched_mtu(qdisc_dev(q)) ?
It would avoid overshooting bql on next dequeue operation.

> +			 * Requeue is difficult because if requeuing
> +			 * on q->gso_skb, then a second requeue can
> +			 * happen from sch_direct_xmit e.g. if driver
> +			 * returns NETDEV_TX_BUSY, which would
> +			 * overwrite this requeue.
> +			 */

Perhaps we should just free the entire list in such a case;
it would also permit extending this to qdisc with
multiple tx queues later (since you could requeue skb
if the txq mapping changes)?

It would also allow use of qdisc_peek helpers to only dequeue
skb when it is suitable for bulking.

Unrelated to your patch: q->gso_skb should be renamed.
It has nothing to do with gso anymore...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-04 13:17   ` Florian Westphal
@ 2014-09-04 17:09     ` Cong Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2014-09-04 17:09 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend

On Thu, Sep 4, 2014 at 6:17 AM, Florian Westphal <fw@strlen.de> wrote:
>
> Unrelated to your patch: q->gso_skb should be renamed.
> It has nothing to do with gso anymore...
>

I have a pending patch which renames it to ->dequeued_skb.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-04 12:55 ` [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
  2014-09-04 13:17   ` Florian Westphal
@ 2014-09-04 13:29   ` Eric Dumazet
  2014-09-05  8:28     ` Jesper Dangaard Brouer
  2014-09-04 16:59   ` Tom Herbert
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2014-09-04 13:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend

On Thu, 2014-09-04 at 14:55 +0200, Jesper Dangaard Brouer wrote:
> Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> sending/processing an entire skb list.
> 
> This patch implements qdisc bulk dequeue, by allowing multiple packets
> to be dequeued in dequeue_skb().
> 
> The optimization principle for this is two fold, (1) to amortize
> locking cost and (2) avoid expensive tailptr update for notifying HW.
>  (1) Several packets are dequeued while holding the qdisc root_lock,
> amortizing locking cost over several packet.  The dequeued SKB list is
> processed under the TXQ lock in dev_hard_start_xmit(), thus also
> amortizing the cost of the TXQ lock.
>  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> API to delay HW tailptr update, which also reduces the cost per
> packet.
> 
> One restriction of the new API is that every SKB must belong to the
> same TXQ.  This patch takes the easy way out, by restricting bulk
> dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> qdisc only have attached a single TXQ.
> 
> Some detail about the flow; dev_hard_start_xmit() will process the skb
> list, and transmit packets individually towards the driver (see
> xmit_one()).  In case the driver stops midway in the list, the
> remaining skb list is returned by dev_hard_start_xmit().  In
> sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> 
> The patch also tries to limit the amount of bytes dequeued, based on
> the drivers BQL limits.  It also tries to avoid and stop dequeuing
> when seeing a GSO packet (both real GSO and segmented GSO skb lists).
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> ---
> V2:
>  - Restruct functions, split out functionality
>  - Use BQL bytelimit to avoid overshooting driver limits, causing
>    too large skb lists to be sitting on the requeue gso_skb.
> 
>  net/sched/sch_generic.c |   67 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 files changed, 64 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 19696eb..a0c8070 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -56,6 +56,67 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  	return 0;
>  }
>  
> +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc,
> +				  const struct sk_buff *skb)

Why skb is passed here ?

> +{
> +	return (qdisc->flags & TCQ_F_ONETXQUEUE);
	
	return qdisc->flags & TCQ_F_ONETXQUEUE;

> +}
> +
> +static inline struct sk_buff *qdisc_dequeue_validate(struct Qdisc *qdisc)
> +{
> +	struct sk_buff *skb = qdisc->dequeue(qdisc);
> +
> +	if (skb != NULL)
> +		skb = validate_xmit_skb(skb, qdisc_dev(qdisc));

> +
> +	return skb;
> +}
> +
> +static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
> +						     struct sk_buff *head)
> +{
> +	struct sk_buff *new, *skb = head;
> +	struct netdev_queue *txq = q->dev_queue;
> +	int bytelimit = netdev_tx_avail_queue(txq);
> +	int limit = 5;
> +
> +	if (bytelimit <= 0)
> +		return head;
> +
> +	do {
> +		if (skb->next || skb_is_gso(skb)) {

A GSO packet should not 'stop the processing' if the device is TSO
capable : It should look as a normal packet.

> +			/* Stop processing if the skb is already a skb
> +			 * list (e.g a segmented GSO packet) or a real
> +			 * GSO packet */
> +			break;
> +		}
> +		new = qdisc_dequeue_validate(q);

Thats broken.

After validate_xmit_skb()  @new might be the head of a list. (GSO split)



> +		if (new) {
> +			skb->next = new;


> +			skb = new;

and new is only the first element on the list.

> +			bytelimit -= skb->len;

skb->len is the length of first segment.

> +			cnt++;
> +			/* One problem here is it is difficult to
> +			 * requeue the "new" dequeued skb, e.g. in
> +			 * case of GSO, thus a "normal" packet can
> +			 * have a GSO packet on its ->next ptr.
> +			 *
> +			 * Requeue is difficult because if requeuing
> +			 * on q->gso_skb, then a second requeue can
> +			 * happen from sch_direct_xmit e.g. if driver
> +			 * returns NETDEV_TX_BUSY, which would
> +			 * overwrite this requeue.
> +			 */

It should not overwrite, but insert back. Thats what need to be done.

> +		}
> +	} while (new && --limit && (bytelimit > 0));
> +	skb = head;
> +
> +	return skb;
> +}
> +


Idea is really the following :

Once qdisc dequeue bulk is done, and skb validated, we have a list of
skb to send to the device.

We iterate the list and try to send the individual skb.

As soon as device returns NETDEV_TX_BUSY (or even better, when the queue
is stopped), we requeue the remaining list.

BQL never tried to find the exact split point :

If available budget is 25000 bytes, and next TSO packet is 45000 bytes,
we send it.

So the bql validation should be done before the eventual
validate_xmit_skb() : This way you dont care of GSO or TSO.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-04 13:29   ` Eric Dumazet
@ 2014-09-05  8:28     ` Jesper Dangaard Brouer
  2014-09-06 12:55       ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-05  8:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend, brouer

On Thu, 04 Sep 2014 06:29:43 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2014-09-04 at 14:55 +0200, Jesper Dangaard Brouer wrote:
> > Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> > sending/processing an entire skb list.
> > 
> > This patch implements qdisc bulk dequeue, by allowing multiple packets
> > to be dequeued in dequeue_skb().
> > 
> > The optimization principle for this is two fold, (1) to amortize
> > locking cost and (2) avoid expensive tailptr update for notifying HW.
> >  (1) Several packets are dequeued while holding the qdisc root_lock,
> > amortizing locking cost over several packet.  The dequeued SKB list is
> > processed under the TXQ lock in dev_hard_start_xmit(), thus also
> > amortizing the cost of the TXQ lock.
> >  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> > API to delay HW tailptr update, which also reduces the cost per
> > packet.
> > 
> > One restriction of the new API is that every SKB must belong to the
> > same TXQ.  This patch takes the easy way out, by restricting bulk
> > dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> > qdisc only have attached a single TXQ.
> > 
> > Some detail about the flow; dev_hard_start_xmit() will process the skb
> > list, and transmit packets individually towards the driver (see
> > xmit_one()).  In case the driver stops midway in the list, the
> > remaining skb list is returned by dev_hard_start_xmit().  In
> > sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
> > 
> > The patch also tries to limit the amount of bytes dequeued, based on
> > the drivers BQL limits.  It also tries to avoid and stop dequeuing
> > when seeing a GSO packet (both real GSO and segmented GSO skb lists).
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > 
> > ---
> > V2:
> >  - Restruct functions, split out functionality
> >  - Use BQL bytelimit to avoid overshooting driver limits, causing
> >    too large skb lists to be sitting on the requeue gso_skb.
> > 
> >  net/sched/sch_generic.c |   67 +++++++++++++++++++++++++++++++++++++++++++++--
> >  1 files changed, 64 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 19696eb..a0c8070 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -56,6 +56,67 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> >  	return 0;
> >  }
> >  
> > +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc,
> > +				  const struct sk_buff *skb)
> 
> Why skb is passed here ?

Just a left over, when the func did more checking.

> > +{
> > +	return (qdisc->flags & TCQ_F_ONETXQUEUE);
> 	
> 	return qdisc->flags & TCQ_F_ONETXQUEUE;

Sure, nitpick ;-)


> > +}
> > +
> > +static inline struct sk_buff *qdisc_dequeue_validate(struct Qdisc *qdisc)
> > +{
> > +	struct sk_buff *skb = qdisc->dequeue(qdisc);
> > +
> > +	if (skb != NULL)
> > +		skb = validate_xmit_skb(skb, qdisc_dev(qdisc));
> 
> > +
> > +	return skb;
> > +}
> > +
> > +static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
> > +						     struct sk_buff *head)
> > +{
> > +	struct sk_buff *new, *skb = head;
> > +	struct netdev_queue *txq = q->dev_queue;
> > +	int bytelimit = netdev_tx_avail_queue(txq);
> > +	int limit = 5;
> > +
> > +	if (bytelimit <= 0)
> > +		return head;
> > +
> > +	do {
> > +		if (skb->next || skb_is_gso(skb)) {
> 
> A GSO packet should not 'stop the processing' if the device is TSO
> capable : It should look as a normal packet.

So, I cannot see if it is a TSO packet via the skb_is_gso(skb) check?
Will skb->len be the length of the full TSO packet size?

This check (skb->next) also guards against new=qdisc_dequeue_validate()
returned a list (GSO segmenting), in last "round" (now here skb=new),
as I then can assign the next "new" packet directly to skb->next=new
(without checking if new is a skb list).


> > +			/* Stop processing if the skb is already a skb
> > +			 * list (e.g a segmented GSO packet) or a real
> > +			 * GSO packet */
> > +			break;
> > +		}
> > +		new = qdisc_dequeue_validate(q);
> 
> Thats broken.
> 
> After validate_xmit_skb()  @new might be the head of a list. (GSO split)

As described above, I think, I'm handling this case, as I don't allow
further processing if I detect that this was an skb list.
 
> 
> 
> > +		if (new) {
> > +			skb->next = new;
> 
> 
> > +			skb = new;
> 
> and new is only the first element on the list.
> 
> > +			bytelimit -= skb->len;
> 
> skb->len is the length of first segment.

Damn, yes, then I will have to walk the "new" skb in-case it is a list,
I was hoping to avoid that.


> > +			cnt++;
(ups, debug left over)

> > +			/* One problem here is it is difficult to
> > +			 * requeue the "new" dequeued skb, e.g. in
> > +			 * case of GSO, thus a "normal" packet can
> > +			 * have a GSO packet on its ->next ptr.
> > +			 *
> > +			 * Requeue is difficult because if requeuing
> > +			 * on q->gso_skb, then a second requeue can
> > +			 * happen from sch_direct_xmit e.g. if driver
> > +			 * returns NETDEV_TX_BUSY, which would
> > +			 * overwrite this requeue.
> > +			 */
> 
> It should not overwrite, but insert back. Thats what need to be done.

Okay, guess that should/could be handled in dev_requeue_skb().

In this case with (TCQ_F_ONETXQUEUE) the end result of not requeuing
here is almost the same.  In case dev_hard_start_xmit() didn't xmit the
entire list, it will be placed back on the requeue (q->gso_skb) anyway.
Thus we might be better off just sending this to the device, instead of
requeuing explicitly here.


> > +		}
> > +	} while (new && --limit && (bytelimit > 0));
> > +	skb = head;
> > +
> > +	return skb;
> > +}
> > +
> 
> 
> Idea is really the following :
> 
> Once qdisc dequeue bulk is done, and skb validated, we have a list of
> skb to send to the device.
> 
> We iterate the list and try to send the individual skb.
> 
> As soon as device returns NETDEV_TX_BUSY (or even better, when the queue
> is stopped), we requeue the remaining list.

Yes, that is also my understanding.

The "dangerous" part is the size (in bytes) of the requeued remaining
list.  In the-meantime while the driver/queue have been stopped, a high
priority packet might have been queued in the qdisc.  When we start to
dequeue again, then the requeued list is transmitted *before* dequeuing
the high prio packet sitting in the real qdisc queue. (That is what you
call head-of-line blocking right)


> BQL never tried to find the exact split point :
> 
> If available budget is 25000 bytes, and next TSO packet is 45000 bytes,
> we send it.

Okay, so it is okay to overshoot, which I also think this patch does.


> So the bql validation should be done before the eventual
> validate_xmit_skb() : This way you dont care of GSO or TSO.

If do a skb=q->dequeue(q), and the packet is a GSO packet, then
skb->len should be valid/cover-hole-packet right? (I could use that,
and avoid walking the gso list).


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-05  8:28     ` Jesper Dangaard Brouer
@ 2014-09-06 12:55       ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2014-09-06 12:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, David S. Miller, Tom Herbert, Hannes Frederic Sowa,
	Florian Westphal, Daniel Borkmann, Jamal Hadi Salim,
	Alexander Duyck, John Fastabend

On Fri, 2014-09-05 at 10:28 +0200, Jesper Dangaard Brouer wrote:

> 
> So, I cannot see if it is a TSO packet via the skb_is_gso(skb) check?
> Will skb->len be the length of the full TSO packet size?

There is no difference between a TSO and GSO packet. They are what we
call a GSO packet.

The difference lies on the device ability to perform TSO.

If device is not able to send a TSO packet, packet is segmented by the
core networking stack.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
  2014-09-04 12:55 ` [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
  2014-09-04 13:17   ` Florian Westphal
  2014-09-04 13:29   ` Eric Dumazet
@ 2014-09-04 16:59   ` Tom Herbert
  2 siblings, 0 replies; 11+ messages in thread
From: Tom Herbert @ 2014-09-04 16:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Netdev List, David S. Miller, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend

On Thu, Sep 4, 2014 at 5:55 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
> sending/processing an entire skb list.
>
> This patch implements qdisc bulk dequeue, by allowing multiple packets
> to be dequeued in dequeue_skb().
>
> The optimization principle for this is two fold, (1) to amortize
> locking cost and (2) avoid expensive tailptr update for notifying HW.
>  (1) Several packets are dequeued while holding the qdisc root_lock,
> amortizing locking cost over several packet.  The dequeued SKB list is
> processed under the TXQ lock in dev_hard_start_xmit(), thus also
> amortizing the cost of the TXQ lock.
>  (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
> API to delay HW tailptr update, which also reduces the cost per
> packet.
>
> One restriction of the new API is that every SKB must belong to the
> same TXQ.  This patch takes the easy way out, by restricting bulk
> dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
> qdisc only have attached a single TXQ.
>
> Some detail about the flow; dev_hard_start_xmit() will process the skb
> list, and transmit packets individually towards the driver (see
> xmit_one()).  In case the driver stops midway in the list, the
> remaining skb list is returned by dev_hard_start_xmit().  In
> sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
>
> The patch also tries to limit the amount of bytes dequeued, based on
> the drivers BQL limits.  It also tries to avoid and stop dequeuing
> when seeing a GSO packet (both real GSO and segmented GSO skb lists).
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> ---
> V2:
>  - Restruct functions, split out functionality
>  - Use BQL bytelimit to avoid overshooting driver limits, causing
>    too large skb lists to be sitting on the requeue gso_skb.
>
>  net/sched/sch_generic.c |   67 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 files changed, 64 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 19696eb..a0c8070 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -56,6 +56,67 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>         return 0;
>  }
>
> +static inline bool qdisc_may_bulk(const struct Qdisc *qdisc,
> +                                 const struct sk_buff *skb)
> +{
> +       return (qdisc->flags & TCQ_F_ONETXQUEUE);
> +}
> +
> +static inline struct sk_buff *qdisc_dequeue_validate(struct Qdisc *qdisc)
> +{
> +       struct sk_buff *skb = qdisc->dequeue(qdisc);
> +
> +       if (skb != NULL)
> +               skb = validate_xmit_skb(skb, qdisc_dev(qdisc));
> +
> +       return skb;
> +}
> +
> +static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
> +                                                    struct sk_buff *head)
> +{
> +       struct sk_buff *new, *skb = head;
> +       struct netdev_queue *txq = q->dev_queue;
> +       int bytelimit = netdev_tx_avail_queue(txq);
> +       int limit = 5;
> +
> +       if (bytelimit <= 0)
> +               return head;
> +
This should be "if (bytelimit < 0)" since BQL doesn't stop a queue
until available space goes negative. Also, I don't think this check
might not be needed anyway. I believe this function would only be
called when the queue is not stopped, so we can, and probably should,
send at least one packet to avoid the possibility of locking transmit.

> +       do {
> +               if (skb->next || skb_is_gso(skb)) {
> +                       /* Stop processing if the skb is already a skb
> +                        * list (e.g a segmented GSO packet) or a real
> +                        * GSO packet */
> +                       break;
> +               }
> +               new = qdisc_dequeue_validate(q);
> +               if (new) {
> +                       skb->next = new;
> +                       skb = new;
> +                       bytelimit -= skb->len;
> +                       cnt++;
> +                       /* One problem here is it is difficult to
> +                        * requeue the "new" dequeued skb, e.g. in
> +                        * case of GSO, thus a "normal" packet can
> +                        * have a GSO packet on its ->next ptr.
> +                        *
> +                        * Requeue is difficult because if requeuing
> +                        * on q->gso_skb, then a second requeue can
> +                        * happen from sch_direct_xmit e.g. if driver
> +                        * returns NETDEV_TX_BUSY, which would
> +                        * overwrite this requeue.
> +                        */
Maybe we could just add another pointer e.g. hold_skb and use that to
requeue normal packets and continue to use gso_skb for GSO requeue? In
dequeue_skb, check gso_skb first, then hold_skb, and the
qdisc->dequeue.

> +               }
> +       } while (new && --limit && (bytelimit > 0));

Should be "bytelimit >= 0".

> +       skb = head;
> +
> +       return skb;
> +}
> +
> +/* Note that dequeue_skb can possibly return a SKB list (via skb->next).
> + * A requeued skb (via q->gso_skb) can also be a SKB list.
> + */
>  static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>  {
>         struct sk_buff *skb = q->gso_skb;
> @@ -71,9 +132,9 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
>                         skb = NULL;
>         } else {
>                 if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(txq)) {
> -                       skb = q->dequeue(q);
> -                       if (skb)
> -                               skb = validate_xmit_skb(skb, qdisc_dev(q));
> +                       skb = qdisc_dequeue_validate(q);
> +                       if (skb && qdisc_may_bulk(q, skb))
> +                               skb = qdisc_bulk_dequeue_skb(q, skb);
>                 }
>         }
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC net-next PATCH V2 3/3] qdisc: debug statements while testing prev-patch
  2014-09-04 12:54 [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
  2014-09-04 12:54 ` [RFC net-next PATCH V2 1/3] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
  2014-09-04 12:55 ` [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
@ 2014-09-04 12:56 ` Jesper Dangaard Brouer
  2014-09-04 13:44 ` [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jamal Hadi Salim
  3 siblings, 0 replies; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2014-09-04 12:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Jamal Hadi Salim, Alexander Duyck, John Fastabend

Not-signed-off
---

 net/sched/sch_generic.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a0c8070..8c8ac40 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -47,8 +47,20 @@ EXPORT_SYMBOL(default_qdisc_ops);
 
 static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
+	int bytelimit = netdev_tx_avail_queue(q->dev_queue); //DEBUG
+
 	skb_dst_force(skb);
 	q->gso_skb = skb;
+
+	if (skb->next) // DEBUG
+		net_warn_ratelimited(
+			"%s() dev:%s REQUEUEd SKB list len:%d bql:%d\n",
+			__func__, q->dev_queue->dev->name, bytelimit, skb->len);
+	else if (skb_is_gso(skb)) // DEBUG
+		net_warn_ratelimited(
+			"%s() dev:%s REQUEUEd GSO      len:%d bql:%d\n",
+			__func__, q->dev_queue->dev->name, bytelimit, skb->len);
+
 	q->qstats.requeues++;
 	q->q.qlen++;	/* it's still part of the queue */
 	__netif_schedule(q);
@@ -76,9 +88,11 @@ static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
 						     struct sk_buff *head)
 {
 	struct sk_buff *new, *skb = head;
+//??	struct netdev_queue *txq = skb_get_tx_queue(dev, skb); //which to choose?
 	struct netdev_queue *txq = q->dev_queue;
 	int bytelimit = netdev_tx_avail_queue(txq);
 	int limit = 5;
+	int cnt = 0; //DEBUG
 
 	if (bytelimit <= 0)
 		return head;
@@ -107,10 +121,24 @@ static inline struct sk_buff *qdisc_bulk_dequeue_skb(struct Qdisc *q,
 			 * returns NETDEV_TX_BUSY, which would
 			 * overwrite this requeue.
 			 */
+			if (skb->next) //DEBUG
+				net_warn_ratelimited(
+				"%s() dev:%s pkt-append SKB-list bql:%d cnd:%d\n",
+				__func__, q->dev_queue->dev->name,
+				bytelimit, cnt);
+			else if (skb_is_gso(skb))
+				net_warn_ratelimited(
+				"%s() dev:%s pkt-append real-GSO bql:%d cnd:%d\n",
+				__func__, q->dev_queue->dev->name,
+				bytelimit, cnt);
 		}
 	} while (new && --limit && (bytelimit > 0));
 	skb = head;
 
+	if (cnt > 0) //DEBUG
+		net_warn_ratelimited("%s() dev:%s BULK-active deq:%d bql:%d\n",
+				     __func__, q->dev_queue->dev->name,
+				     cnt, bytelimit);
 	return skb;
 }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates
  2014-09-04 12:54 [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2014-09-04 12:56 ` [RFC net-next PATCH V2 3/3] qdisc: debug statements while testing prev-patch Jesper Dangaard Brouer
@ 2014-09-04 13:44 ` Jamal Hadi Salim
  3 siblings, 0 replies; 11+ messages in thread
From: Jamal Hadi Salim @ 2014-09-04 13:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller, Tom Herbert,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann
  Cc: Alexander Duyck, John Fastabend

On 09/04/14 08:54, Jesper Dangaard Brouer wrote:
> Wanted people to review this work early... so this is the current
> state, even added my debug patch, if people want to "see" it work.
>

It would be useful to collect some stats when you are
experimenting with this stuff on the average batch size the driver
sees. Maybe a histogram which shows >1, > 3, > 10 etc that a
driver was able to chew on at a time. I realize this may skew
your results a little because you have to chew cycles updating
such a histogram.

I believe that any dumb traffic generator will show lots of packets
being batched. But that is not very interesting.
It would be hard to show more than 1 in the general case for
other types of traffic - but learning what else needs to be tweaked
to achive more batch sizes would be helpful.

i.e this would help quantify whether the batching is valuable in the
general sense...

cheers,
jamal

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-09-06 12:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-04 12:54 [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jesper Dangaard Brouer
2014-09-04 12:54 ` [RFC net-next PATCH V2 1/3] net: Functions to report space available in device TX queues Jesper Dangaard Brouer
2014-09-04 12:55 ` [RFC net-next PATCH V2 2/3] qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE Jesper Dangaard Brouer
2014-09-04 13:17   ` Florian Westphal
2014-09-04 17:09     ` Cong Wang
2014-09-04 13:29   ` Eric Dumazet
2014-09-05  8:28     ` Jesper Dangaard Brouer
2014-09-06 12:55       ` Eric Dumazet
2014-09-04 16:59   ` Tom Herbert
2014-09-04 12:56 ` [RFC net-next PATCH V2 3/3] qdisc: debug statements while testing prev-patch Jesper Dangaard Brouer
2014-09-04 13:44 ` [RFC net-next PATCH V2 0/3] qdisc bulk dequeuing and utilizing delayed tailptr updates Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).