[RFC PATCH v2 00/17] latest qdisc patch series

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/17] latest qdisc patch series
@ 2017-05-02 15:36 John Fastabend
  2017-05-02 15:36 ` [RFC PATCH v2 01/17] net: sched: cleanup qdisc_run and __qdisc_run semantics John Fastabend
                   ` (17 more replies)
  0 siblings, 18 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

- v2 now hopefully with all patches and uncorrupted -

This is my latest series of patches to remove the locking requirement
from qdisc logic. 

I still have a couple issues to resolve main problem at the moment is
pfifo_fast qdisc without contention running under mq or mqprio is
actually slower by a few 100k pps with pktgen tests. I am trying to
sort out how to get the performance back now.

The main difference in these patches is to recognize that parking
packets on the gso slot and bad txq cases are really edge cases and
can be handled with locked queue operations. This simplifies the
patches and avoids racing with netif scheduler. If you hit these paths
it means either there is a driver overrun, should be very rare with
bql, or a TCP session has migrated cores with outstanding packets.

Patches 16, 17 were an attempt to resolve the performance degradation
in the uncontended case. The idea is to use the qdisc lock around the
dequeue operations so that we only need to take a single lock for the
entire bulk operation and can consume packets out of skb_array without
a spin_lock.

Another potential issue, I think, is if we have multiple packets
on the bad_txq we should bulk out of the bad_txq and not jump to
bulking out of the "normal" dequeue qdisc op.

I've mostly tested with pktgen at this point but have done some basic
netperf tests and both seem to be working.

I am not going to be able to work on this for a few days so I figured
it might be worth getting some feedback if there is any. Any thoughts
on how to squeeze a few extra pps out of this would be very useful. I
would like to avoid having a degradation in the micro-benchmark if
possible. Further, it should be possible best I can tell.

Thanks!
John

---

John Fastabend (17):
      net: sched: cleanup qdisc_run and __qdisc_run semantics
      net: sched: allow qdiscs to handle locking
      net: sched: remove remaining uses for qdisc_qlen in xmit path
      net: sched: provide per cpu qstat helpers
      net: sched: a dflt qdisc may be used with per cpu stats
      net: sched: explicit locking in gso_cpu fallback
      net: sched: drop qdisc_reset from dev_graft_qdisc
      net: sched: support skb_bad_tx with lockless qdisc
      net: sched: check for frozen queue before skb_bad_txq check
      net: sched: qdisc_qlen for per cpu logic
      net: sched: helper to sum qlen
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
      net: skb_array: expose peek API
      net: sched: pfifo_fast use skb_array
      net: skb_array additions for unlocked consumer
      net: sched: lock once per bulk dequeue

 include/linux/skb_array.h |   10 +
 include/net/gen_stats.h   |    3 
 include/net/pkt_sched.h   |   10 +
 include/net/sch_generic.h |   82 ++++++++-
 net/core/dev.c            |   31 +++
 net/core/gen_stats.c      |    9 +
 net/sched/sch_api.c       |    3 
 net/sched/sch_generic.c   |  400 +++++++++++++++++++++++++++++++++------------
 net/sched/sch_mq.c        |   25 ++-
 net/sched/sch_mqprio.c    |   61 ++++---
 10 files changed, 470 insertions(+), 164 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 01/17] net: sched: cleanup qdisc_run and __qdisc_run semantics
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
@ 2017-05-02 15:36 ` John Fastabend
  2017-05-02 15:36 ` [RFC PATCH v2 02/17] net: sched: allow qdiscs to handle locking John Fastabend
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

Currently __qdisc_run calls qdisc_run_end() but does not call
qdisc_run_begin(). This makes it hard to track pairs of
qdisc_run_{begin,end} across function calls.

To simplify reading these code paths and simpler code this
patch moves begin/end calls into qdisc_run().

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 include/net/pkt_sched.h |    4 +++-
 net/core/dev.c          |    5 +++--
 net/sched/sch_generic.c |    2 --
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index bec46f6..b7f5776 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -109,8 +109,10 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	if (qdisc_run_begin(q))
+	if (qdisc_run_begin(q)) {
 		__qdisc_run(q);
+		qdisc_run_end(q);
+	}
 }
 
 int tc_classify(struct sk_buff *skb, const struct tcf_proto *tp,
diff --git a/net/core/dev.c b/net/core/dev.c
index db6e315..3169074 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3099,9 +3099,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				contended = false;
 			}
 			__qdisc_run(q);
-		} else
-			qdisc_run_end(q);
+		}
 
+		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
@@ -3111,6 +3111,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				contended = false;
 			}
 			__qdisc_run(q);
+			qdisc_run_end(q);
 		}
 	}
 	spin_unlock(root_lock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 52a2c55..ef33bf8 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -262,8 +262,6 @@ void __qdisc_run(struct Qdisc *q)
 			break;
 		}
 	}
-
-	qdisc_run_end(q);
 }
 
 unsigned long dev_trans_start(struct net_device *dev)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 02/17] net: sched: allow qdiscs to handle locking
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
  2017-05-02 15:36 ` [RFC PATCH v2 01/17] net: sched: cleanup qdisc_run and __qdisc_run semantics John Fastabend
@ 2017-05-02 15:36 ` John Fastabend
  2017-05-02 15:37 ` [RFC PATCH v2 03/17] net: sched: remove remaining uses for qdisc_qlen in xmit path John Fastabend
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
for this is synchronizing a qlen counter across threads proves to
cost more than doing the enqueue/dequeue operations when tested with
pktgen.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    1 +
 net/core/dev.c            |   26 ++++++++++++++++++++++----
 net/sched/sch_generic.c   |   30 ++++++++++++++++++++----------
 3 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 22e5209..f5c8613 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -67,6 +67,7 @@ struct Qdisc {
 				      * qdisc_tree_decrease_qlen() should stop.
 				      */
 #define TCQ_F_INVISIBLE		0x80 /* invisible by default in dump */
+#define TCQ_F_NOLOCK		0x100 /* qdisc does not require locking */
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
diff --git a/net/core/dev.c b/net/core/dev.c
index 3169074..6d2db37 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3069,6 +3069,21 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	int rc;
 
 	qdisc_calculate_pkt_len(skb, q);
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+			__qdisc_drop(skb, &to_free);
+			rc = NET_XMIT_DROP;
+		} else {
+			rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+			__qdisc_run(q);
+		}
+
+		if (unlikely(to_free))
+			kfree_skb_list(to_free);
+		return rc;
+	}
+
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
 	 * separate lock before trying to get qdisc main lock.
@@ -3896,19 +3911,22 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
 
 		while (head) {
 			struct Qdisc *q = head;
-			spinlock_t *root_lock;
+			spinlock_t *root_lock = NULL;
 
 			head = head->next_sched;
 
-			root_lock = qdisc_lock(q);
-			spin_lock(root_lock);
+			if (!(q->flags & TCQ_F_NOLOCK)) {
+				root_lock = qdisc_lock(q);
+				spin_lock(root_lock);
+			}
 			/* We need to make sure head->next_sched is read
 			 * before clearing __QDISC_STATE_SCHED
 			 */
 			smp_mb__before_atomic();
 			clear_bit(__QDISC_STATE_SCHED, &q->state);
 			qdisc_run(q);
-			spin_unlock(root_lock);
+			if (root_lock)
+				spin_unlock(root_lock);
 		}
 	}
 }
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ef33bf8..39653e8 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -170,7 +170,8 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 	int ret = NETDEV_TX_BUSY;
 
 	/* And release qdisc */
-	spin_unlock(root_lock);
+	if (root_lock)
+		spin_unlock(root_lock);
 
 	/* Note that we validate skb (GSO, checksum, ...) outside of locks */
 	if (validate)
@@ -183,10 +184,13 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 		HARD_TX_UNLOCK(dev, txq);
 	} else {
-		spin_lock(root_lock);
+		if (root_lock)
+			spin_lock(root_lock);
 		return qdisc_qlen(q);
 	}
-	spin_lock(root_lock);
+
+	if (root_lock)
+		spin_lock(root_lock);
 
 	if (dev_xmit_complete(ret)) {
 		/* Driver sent out skb successfully or skb was consumed */
@@ -227,9 +231,9 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  */
 static inline int qdisc_restart(struct Qdisc *q, int *packets)
 {
+	spinlock_t *root_lock = NULL;
 	struct netdev_queue *txq;
 	struct net_device *dev;
-	spinlock_t *root_lock;
 	struct sk_buff *skb;
 	bool validate;
 
@@ -238,7 +242,9 @@ static inline int qdisc_restart(struct Qdisc *q, int *packets)
 	if (unlikely(!skb))
 		return 0;
 
-	root_lock = qdisc_lock(q);
+	if (!(q->flags & TCQ_F_NOLOCK))
+		root_lock = qdisc_lock(q);
+
 	dev = qdisc_dev(q);
 	txq = skb_get_tx_queue(dev, skb);
 
@@ -875,14 +881,18 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 
 		dev_queue = netdev_get_tx_queue(dev, i);
 		q = dev_queue->qdisc_sleeping;
-		root_lock = qdisc_lock(q);
 
-		spin_lock_bh(root_lock);
+		if (q->flags & TCQ_F_NOLOCK) {
+			val = test_bit(__QDISC_STATE_SCHED, &q->state);
+		} else {
+			root_lock = qdisc_lock(q);
+			spin_lock_bh(root_lock);
 
-		val = (qdisc_is_running(q) ||
-		       test_bit(__QDISC_STATE_SCHED, &q->state));
+			val = (qdisc_is_running(q) ||
+			       test_bit(__QDISC_STATE_SCHED, &q->state));
 
-		spin_unlock_bh(root_lock);
+			spin_unlock_bh(root_lock);
+		}
 
 		if (val)
 			return true;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 03/17] net: sched: remove remaining uses for qdisc_qlen in xmit path
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
  2017-05-02 15:36 ` [RFC PATCH v2 01/17] net: sched: cleanup qdisc_run and __qdisc_run semantics John Fastabend
  2017-05-02 15:36 ` [RFC PATCH v2 02/17] net: sched: allow qdiscs to handle locking John Fastabend
@ 2017-05-02 15:37 ` John Fastabend
  2017-05-02 15:37 ` [RFC PATCH v2 04/17] net: sched: provide per cpu qstat helpers John Fastabend
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
of the routine only check if it is zero or not. Simplify the logic so
that we don't need to return an actual queue length value.

This introduces a case now where sch_direct_xmit would have returned
a qlen of zero but now it returns true. However in this case all
call sites of sch_direct_xmit will implement a dequeue() and get
a null skb and abort. This trades tracking qlen in the hotpath for
an extra dequeue operation. Overall this seems to be good for
performance.

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/pkt_sched.h |    6 +++---
 net/sched/sch_generic.c |   23 ++++++++++++-----------
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b7f5776..69b90a3 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -101,9 +101,9 @@ struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r,
 void qdisc_put_rtab(struct qdisc_rate_table *tab);
 void qdisc_put_stab(struct qdisc_size_table *tab);
 void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc);
-int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
-		    struct net_device *dev, struct netdev_queue *txq,
-		    spinlock_t *root_lock, bool validate);
+bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
+		     struct net_device *dev, struct netdev_queue *txq,
+		     spinlock_t *root_lock, bool validate);
 
 void __qdisc_run(struct Qdisc *q);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 39653e8..2f66b1f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -160,12 +160,12 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
  * only one CPU can execute this function.
  *
  * Returns to the caller:
- *				0  - queue is empty or throttled.
- *				>0 - queue is not empty.
+ *				false  - hardware queue frozen backoff
+ *				true   - feel free to send more pkts
  */
-int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
-		    struct net_device *dev, struct netdev_queue *txq,
-		    spinlock_t *root_lock, bool validate)
+bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
+		     struct net_device *dev, struct netdev_queue *txq,
+		     spinlock_t *root_lock, bool validate)
 {
 	int ret = NETDEV_TX_BUSY;
 
@@ -186,7 +186,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 	} else {
 		if (root_lock)
 			spin_lock(root_lock);
-		return qdisc_qlen(q);
+		return true;
 	}
 
 	if (root_lock)
@@ -194,18 +194,19 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 	if (dev_xmit_complete(ret)) {
 		/* Driver sent out skb successfully or skb was consumed */
-		ret = qdisc_qlen(q);
+		ret = true;
 	} else {
 		/* Driver returned NETDEV_TX_BUSY - requeue skb */
 		if (unlikely(ret != NETDEV_TX_BUSY))
 			net_warn_ratelimited("BUG %s code %d qlen %d\n",
 					     dev->name, ret, q->q.qlen);
 
-		ret = dev_requeue_skb(skb, q);
+		dev_requeue_skb(skb, q);
+		ret = false;
 	}
 
 	if (ret && netif_xmit_frozen_or_stopped(txq))
-		ret = 0;
+		ret = false;
 
 	return ret;
 }
@@ -229,7 +230,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct Qdisc *q, int *packets)
+static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 {
 	spinlock_t *root_lock = NULL;
 	struct netdev_queue *txq;
@@ -240,7 +241,7 @@ static inline int qdisc_restart(struct Qdisc *q, int *packets)
 	/* Dequeue packet */
 	skb = dequeue_skb(q, &validate, packets);
 	if (unlikely(!skb))
-		return 0;
+		return false;
 
 	if (!(q->flags & TCQ_F_NOLOCK))
 		root_lock = qdisc_lock(q);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 04/17] net: sched: provide per cpu qstat helpers
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (2 preceding siblings ...)
  2017-05-02 15:37 ` [RFC PATCH v2 03/17] net: sched: remove remaining uses for qdisc_qlen in xmit path John Fastabend
@ 2017-05-02 15:37 ` John Fastabend
  2017-05-02 15:37 ` [RFC PATCH v2 05/17] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

The per cpu qstats support was added with per cpu bstat support which
is currently used by the ingress qdisc. This patch adds a set of
helpers needed to make other qdiscs that use qstats per cpu as well.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f5c8613..9531b81 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -590,12 +590,39 @@ static inline void qdisc_qstats_backlog_dec(struct Qdisc *sch,
 	sch->qstats.backlog -= qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_dec(struct Qdisc *sch,
+						const struct sk_buff *skb)
+{
+	this_cpu_sub(sch->cpu_qstats->backlog, qdisc_pkt_len(skb));
+}
+
 static inline void qdisc_qstats_backlog_inc(struct Qdisc *sch,
 					    const struct sk_buff *skb)
 {
 	sch->qstats.backlog += qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_inc(struct Qdisc *sch,
+						const struct sk_buff *skb)
+{
+	this_cpu_add(sch->cpu_qstats->backlog, qdisc_pkt_len(skb));
+}
+
+static inline void qdisc_qstats_cpu_qlen_inc(struct Qdisc *sch)
+{
+	this_cpu_inc(sch->cpu_qstats->qlen);
+}
+
+static inline void qdisc_qstats_cpu_qlen_dec(struct Qdisc *sch)
+{
+	this_cpu_dec(sch->cpu_qstats->qlen);
+}
+
+static inline void qdisc_qstats_cpu_requeues_inc(struct Qdisc *sch)
+{
+	this_cpu_inc(sch->cpu_qstats->requeues);
+}
+
 static inline void __qdisc_qstats_drop(struct Qdisc *sch, int count)
 {
 	sch->qstats.drops += count;
@@ -800,6 +827,14 @@ static inline void rtnl_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch)
 	qdisc_qstats_drop(sch);
 }
 
+static inline int qdisc_drop_cpu(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	__qdisc_drop(skb, to_free);
+	qdisc_qstats_cpu_drop(sch);
+
+	return NET_XMIT_DROP;
+}
 
 static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
 			     struct sk_buff **to_free)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 05/17] net: sched: a dflt qdisc may be used with per cpu stats
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (3 preceding siblings ...)
  2017-05-02 15:37 ` [RFC PATCH v2 04/17] net: sched: provide per cpu qstat helpers John Fastabend
@ 2017-05-02 15:37 ` John Fastabend
  2017-05-02 15:38 ` [RFC PATCH v2 06/17] net: sched: explicit locking in gso_cpu fallback John Fastabend
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

Enable dflt qdisc support for per cpu stats before this patch a dflt
qdisc was required to use the global statistics qstats and bstats.

This adds a static flags field to qdisc_ops that is propagated
into qdisc->flags in qdisc allocate call. This allows the allocation
block to completely allocate the qdisc object so we don't have
dangling allocations after qdisc init.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    1 +
 net/sched/sch_generic.c   |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9531b81..83952af 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -172,6 +172,7 @@ struct Qdisc_ops {
 	const struct Qdisc_class_ops	*cl_ops;
 	char			id[IFNAMSIZ];
 	int			priv_size;
+	unsigned int		static_flags;
 
 	int 			(*enqueue)(struct sk_buff *skb,
 					   struct Qdisc *sch,
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2f66b1f..8a03738 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -625,6 +625,19 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	qdisc_skb_head_init(&sch->q);
 	spin_lock_init(&sch->q.lock);
 
+	if (ops->static_flags & TCQ_F_CPUSTATS) {
+		sch->cpu_bstats =
+			netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu);
+		if (!sch->cpu_bstats)
+			goto errout1;
+
+		sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
+		if (!sch->cpu_qstats) {
+			free_percpu(sch->cpu_bstats);
+			goto errout1;
+		}
+	}
+
 	spin_lock_init(&sch->busylock);
 	lockdep_set_class(&sch->busylock,
 			  dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
@@ -634,6 +647,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 			  dev->qdisc_running_key ?: &qdisc_running_key);
 
 	sch->ops = ops;
+	sch->flags = ops->static_flags;
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;
 	sch->dev_queue = dev_queue;
@@ -641,6 +655,8 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	atomic_set(&sch->refcnt, 1);
 
 	return sch;
+errout1:
+	kfree(p);
 errout:
 	return ERR_PTR(err);
 }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 06/17] net: sched: explicit locking in gso_cpu fallback
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (4 preceding siblings ...)
  2017-05-02 15:37 ` [RFC PATCH v2 05/17] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
@ 2017-05-02 15:38 ` John Fastabend
  2017-05-02 15:38 ` [RFC PATCH v2 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc John Fastabend
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

This work is preparing the qdisc layer to support egress lockless
qdiscs. If we are running the egress qdisc lockless in the case we
overrun the netdev, for whatever reason, the netdev returns a busy
error code and the skb is parked on the gso_skb pointer. With many
cores all hitting this case at once its possible to have multiple
sk_buffs here so we turn gso_skb into a queue.

This should be the edge case and if we see this frequently then
the netdev/qdisc layer needs to back off.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   20 +++++++----
 net/sched/sch_generic.c   |   81 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 80 insertions(+), 21 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 83952af..3021bbb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -85,7 +85,7 @@ struct Qdisc {
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
 	 */
-	struct sk_buff		*gso_skb ____cacheline_aligned_in_smp;
+	struct sk_buff_head	gso_skb ____cacheline_aligned_in_smp;
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_packed bstats;
 	seqcount_t		running;
@@ -754,26 +754,30 @@ static inline struct sk_buff *qdisc_peek_head(struct Qdisc *sch)
 /* generic pseudo peek method for non-work-conserving qdisc */
 static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
 {
+	struct sk_buff *skb = skb_peek(&sch->gso_skb);
+
 	/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
-	if (!sch->gso_skb) {
-		sch->gso_skb = sch->dequeue(sch);
-		if (sch->gso_skb) {
+	if (!skb) {
+		skb = sch->dequeue(sch);
+
+		if (skb) {
+			__skb_queue_head(&sch->gso_skb, skb);
 			/* it's still part of the queue */
-			qdisc_qstats_backlog_inc(sch, sch->gso_skb);
+			qdisc_qstats_backlog_inc(sch, skb);
 			sch->q.qlen++;
 		}
 	}
 
-	return sch->gso_skb;
+	return skb;
 }
 
 /* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
 static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
 {
-	struct sk_buff *skb = sch->gso_skb;
+	struct sk_buff *skb = skb_peek(&sch->gso_skb);
 
 	if (skb) {
-		sch->gso_skb = NULL;
+		skb = __skb_dequeue(&sch->gso_skb);
 		qdisc_qstats_backlog_dec(sch, skb);
 		sch->q.qlen--;
 	} else {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 8a03738..f179fc9 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -44,10 +44,9 @@
  * - ingress filtering is also serialized via qdisc root lock
  * - updates to tree and tree walking are only done under the rtnl mutex.
  */
-
-static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
-	q->gso_skb = skb;
+	__skb_queue_head(&q->gso_skb, skb);
 	q->qstats.requeues++;
 	qdisc_qstats_backlog_inc(q, skb);
 	q->q.qlen++;	/* it's still part of the queue */
@@ -56,6 +55,30 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }
 
+static inline int dev_requeue_skb_locked(struct sk_buff *skb, struct Qdisc *q)
+{
+	spinlock_t *lock = qdisc_lock(q);
+
+	spin_lock(lock);
+	__skb_queue_tail(&q->gso_skb, skb);
+	spin_unlock(lock);
+
+	qdisc_qstats_cpu_requeues_inc(q);
+	qdisc_qstats_cpu_backlog_inc(q, skb);
+	qdisc_qstats_cpu_qlen_inc(q);
+	__netif_schedule(q);
+
+	return 0;
+}
+
+static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+{
+	if (q->flags & TCQ_F_NOLOCK)
+		return dev_requeue_skb_locked(skb, q);
+	else
+		return __dev_requeue_skb(skb, q);
+}
+
 static void try_bulk_dequeue_skb(struct Qdisc *q,
 				 struct sk_buff *skb,
 				 const struct netdev_queue *txq,
@@ -111,23 +134,50 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
 static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 				   int *packets)
 {
-	struct sk_buff *skb = q->gso_skb;
 	const struct netdev_queue *txq = q->dev_queue;
+	struct sk_buff *skb;
 
 	*packets = 1;
-	if (unlikely(skb)) {
+	if (unlikely(!skb_queue_empty(&q->gso_skb))) {
+		spinlock_t *lock = NULL;
+
+		if (q->flags & TCQ_F_NOLOCK) {
+			lock = qdisc_lock(q);
+			spin_lock(lock);
+		}
+
+		skb = skb_peek(&q->gso_skb);
+
+		/* skb may be null if another cpu pulls gso_skb off in between
+		 * empty check and lock.
+		 */
+		if (!skb) {
+			if (lock)
+				spin_unlock(lock);
+			goto validate;
+		}
+
 		/* skb in gso_skb were already validated */
 		*validate = false;
 		/* check the reason of requeuing without tx lock first */
 		txq = skb_get_tx_queue(txq->dev, skb);
 		if (!netif_xmit_frozen_or_stopped(txq)) {
-			q->gso_skb = NULL;
-			qdisc_qstats_backlog_dec(q, skb);
-			q->q.qlen--;
+			skb = __skb_dequeue(&q->gso_skb);
+			if (qdisc_is_percpu_stats(q)) {
+				qdisc_qstats_cpu_backlog_dec(q, skb);
+				qdisc_qstats_cpu_qlen_dec(q);
+			} else {
+				qdisc_qstats_backlog_dec(q, skb);
+				q->q.qlen--;
+			}
 		} else
 			skb = NULL;
+
+		if (lock)
+			spin_unlock(lock);
 		return skb;
 	}
+validate:
 	*validate = true;
 	skb = q->skb_bad_txq;
 	if (unlikely(skb)) {
@@ -622,6 +672,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 		sch = (struct Qdisc *) QDISC_ALIGN((unsigned long) p);
 		sch->padded = (char *) sch - (char *) p;
 	}
+	__skb_queue_head_init(&sch->gso_skb);
 	qdisc_skb_head_init(&sch->q);
 	spin_lock_init(&sch->q.lock);
 
@@ -690,6 +741,7 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 void qdisc_reset(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops *ops = qdisc->ops;
+	struct sk_buff *skb, *tmp;
 
 	if (ops->reset)
 		ops->reset(qdisc);
@@ -697,10 +749,9 @@ void qdisc_reset(struct Qdisc *qdisc)
 	kfree_skb(qdisc->skb_bad_txq);
 	qdisc->skb_bad_txq = NULL;
 
-	if (qdisc->gso_skb) {
-		kfree_skb_list(qdisc->gso_skb);
-		qdisc->gso_skb = NULL;
-	}
+	skb_queue_walk_safe(&qdisc->gso_skb, skb, tmp)
+		kfree_skb_list(skb);
+
 	qdisc->q.qlen = 0;
 }
 EXPORT_SYMBOL(qdisc_reset);
@@ -720,6 +771,7 @@ static void qdisc_rcu_free(struct rcu_head *head)
 void qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
+	struct sk_buff *skb, *tmp;
 
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 	    !atomic_dec_and_test(&qdisc->refcnt))
@@ -739,7 +791,9 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	module_put(ops->owner);
 	dev_put(qdisc_dev(qdisc));
 
-	kfree_skb_list(qdisc->gso_skb);
+	skb_queue_walk_safe(&qdisc->gso_skb, skb, tmp)
+		kfree_skb_list(skb);
+
 	kfree_skb(qdisc->skb_bad_txq);
 	/*
 	 * gen_estimator est_timer() might access qdisc->q.lock,
@@ -971,6 +1025,7 @@ static void dev_init_scheduler_queue(struct net_device *dev,
 
 	rcu_assign_pointer(dev_queue->qdisc, qdisc);
 	dev_queue->qdisc_sleeping = qdisc;
+	__skb_queue_head_init(&qdisc->gso_skb);
 }
 
 void dev_init_scheduler(struct net_device *dev)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (5 preceding siblings ...)
  2017-05-02 15:38 ` [RFC PATCH v2 06/17] net: sched: explicit locking in gso_cpu fallback John Fastabend
@ 2017-05-02 15:38 ` John Fastabend
  2017-05-02 20:00   ` Jesper Dangaard Brouer
  2017-05-02 15:38 ` [RFC PATCH v2 08/17] net: sched: support skb_bad_tx with lockless qdisc John Fastabend
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy'
operation is called on the old qdisc. The destroy operation will wait
a rcu grace period and call qdisc_rcu_free(). At which point
gso_cpu_skb is free'd along with all stats so no need to zero stats
and gso_cpu_skb from the graft operation itself.

Further after dropping the qdisc locks we can not continue to call
qdisc_reset before waiting an rcu grace period so that the qdisc is
detached from all cpus. By removing the qdisc_reset() here we get
the correct property of waiting an rcu grace period and letting the
qdisc_destroy operation clean up the qdisc correctly.

Note, a refcnt greater than 1 would cause the destroy operation to
be aborted however if this ever happened the reference to the qdisc
would be lost and we would have a memory leak.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |   28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f179fc9..37cd54e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -813,10 +813,6 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 	root_lock = qdisc_lock(oqdisc);
 	spin_lock_bh(root_lock);
 
-	/* Prune old scheduler */
-	if (oqdisc && atomic_read(&oqdisc->refcnt) <= 1)
-		qdisc_reset(oqdisc);
-
 	/* ... and graft new one */
 	if (qdisc == NULL)
 		qdisc = &noop_qdisc;
@@ -971,6 +967,16 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 	return false;
 }
 
+static void dev_qdisc_reset(struct net_device *dev,
+			    struct netdev_queue *dev_queue,
+			    void *none)
+{
+	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+
+	if (qdisc)
+		qdisc_reset(qdisc);
+}
+
 /**
  * 	dev_deactivate_many - deactivate transmissions on several devices
  * 	@head: list of devices to deactivate
@@ -981,7 +987,6 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 void dev_deactivate_many(struct list_head *head)
 {
 	struct net_device *dev;
-	bool sync_needed = false;
 
 	list_for_each_entry(dev, head, close_list) {
 		netdev_for_each_tx_queue(dev, dev_deactivate_queue,
@@ -991,20 +996,25 @@ void dev_deactivate_many(struct list_head *head)
 					     &noop_qdisc);
 
 		dev_watchdog_down(dev);
-		sync_needed |= !dev->dismantle;
 	}
 
 	/* Wait for outstanding qdisc-less dev_queue_xmit calls.
 	 * This is avoided if all devices are in dismantle phase :
 	 * Caller will call synchronize_net() for us
 	 */
-	if (sync_needed)
-		synchronize_net();
+	synchronize_net();
 
 	/* Wait for outstanding qdisc_run calls. */
-	list_for_each_entry(dev, head, close_list)
+	list_for_each_entry(dev, head, close_list) {
 		while (some_qdisc_is_busy(dev))
 			yield();
+		/* The new qdisc is assigned at this point so we can safely
+		 * unwind stale skb lists and qdisc statistics
+		 */
+		netdev_for_each_tx_queue(dev, dev_qdisc_reset, NULL);
+		if (dev_ingress_queue(dev))
+			dev_qdisc_reset(dev, dev_ingress_queue(dev), NULL);
+	}
 }
 
 void dev_deactivate(struct net_device *dev)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc
  2017-05-02 15:38 ` [RFC PATCH v2 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc John Fastabend
@ 2017-05-02 20:00   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-02 20:00 UTC (permalink / raw)
  To: John Fastabend; +Cc: brouer, eric.dumazet, netdev

On Tue, 02 May 2017 08:38:19 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> @@ -991,20 +996,25 @@ void dev_deactivate_many(struct list_head *head)
>  					     &noop_qdisc);
>  
>  		dev_watchdog_down(dev);
> -		sync_needed |= !dev->dismantle;
>  	}
>  
>  	/* Wait for outstanding qdisc-less dev_queue_xmit calls.
>  	 * This is avoided if all devices are in dismantle phase :
>  	 * Caller will call synchronize_net() for us
>  	 */

Is the comment still correct after this change?

> -	if (sync_needed)
> -		synchronize_net();
> +	synchronize_net();

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 08/17] net: sched: support skb_bad_tx with lockless qdisc
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (6 preceding siblings ...)
  2017-05-02 15:38 ` [RFC PATCH v2 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc John Fastabend
@ 2017-05-02 15:38 ` John Fastabend
  2017-05-02 15:39 ` [RFC PATCH v2 09/17] net: sched: check for frozen queue before skb_bad_txq check John Fastabend
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

Similar to how gso is handled skb_bad_tx needs to be per cpu to handle
lockless qdisc with multiple writer/producers.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    2 -
 net/sched/sch_generic.c   |  100 ++++++++++++++++++++++++++++++++++++---------
 2 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 3021bbb..4288222 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -92,7 +92,7 @@ struct Qdisc {
 	struct gnet_stats_queue	qstats;
 	unsigned long		state;
 	struct Qdisc            *next_sched;
-	struct sk_buff		*skb_bad_txq;
+	struct sk_buff_head	skb_bad_txq;
 	struct rcu_head		rcu_head;
 	int			padded;
 	atomic_t		refcnt;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37cd54e..d4194c2 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -44,6 +44,68 @@
  * - ingress filtering is also serialized via qdisc root lock
  * - updates to tree and tree walking are only done under the rtnl mutex.
  */
+
+static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q)
+{
+	const struct netdev_queue *txq = q->dev_queue;
+	spinlock_t *lock = NULL;
+	struct sk_buff *skb;
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		lock = qdisc_lock(q);
+		spin_lock(lock);
+	}
+
+	skb = skb_peek(&q->skb_bad_txq);
+	if (skb) {
+		/* check the reason of requeuing without tx lock first */
+		txq = skb_get_tx_queue(txq->dev, skb);
+		if (!netif_xmit_frozen_or_stopped(txq)) {
+			skb = __skb_dequeue(&q->skb_bad_txq);
+			if (qdisc_is_percpu_stats(q)) {
+				qdisc_qstats_cpu_backlog_dec(q, skb);
+				qdisc_qstats_cpu_qlen_dec(q);
+			} else {
+				qdisc_qstats_backlog_dec(q, skb);
+				q->q.qlen--;
+			}
+		} else {
+			skb = NULL;
+		}
+	}
+
+	if (lock)
+		spin_unlock(lock);
+
+	return skb;
+}
+
+static inline struct sk_buff *qdisc_dequeue_skb_bad_txq(struct Qdisc *q)
+{
+	struct sk_buff *skb = skb_peek(&q->skb_bad_txq);
+
+	if (unlikely(skb))
+		skb = __skb_dequeue_bad_txq(q);
+
+	return skb;
+}
+
+static inline void qdisc_enqueue_skb_bad_txq(struct Qdisc *q,
+					     struct sk_buff *skb)
+{
+	spinlock_t *lock = NULL;
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		lock = qdisc_lock(q);
+		spin_lock(lock);
+	}
+
+	__skb_queue_tail(&q->skb_bad_txq, skb);
+
+	if (lock)
+		spin_unlock(lock);
+}
+
 static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
 	__skb_queue_head(&q->gso_skb, skb);
@@ -116,9 +178,15 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
 		if (!nskb)
 			break;
 		if (unlikely(skb_get_queue_mapping(nskb) != mapping)) {
-			q->skb_bad_txq = nskb;
-			qdisc_qstats_backlog_inc(q, nskb);
-			q->q.qlen++;
+			qdisc_enqueue_skb_bad_txq(q, nskb);
+
+			if (qdisc_is_percpu_stats(q)) {
+				qdisc_qstats_cpu_backlog_inc(q, nskb);
+				qdisc_qstats_cpu_qlen_inc(q);
+			} else {
+				qdisc_qstats_backlog_inc(q, nskb);
+				q->q.qlen++;
+			}
 			break;
 		}
 		skb->next = nskb;
@@ -179,18 +247,9 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	}
 validate:
 	*validate = true;
-	skb = q->skb_bad_txq;
-	if (unlikely(skb)) {
-		/* check the reason of requeuing without tx lock first */
-		txq = skb_get_tx_queue(txq->dev, skb);
-		if (!netif_xmit_frozen_or_stopped(txq)) {
-			q->skb_bad_txq = NULL;
-			qdisc_qstats_backlog_dec(q, skb);
-			q->q.qlen--;
-			goto bulk;
-		}
-		return NULL;
-	}
+	skb = qdisc_dequeue_skb_bad_txq(q);
+	if (unlikely(skb))
+		goto bulk;
 	if (!(q->flags & TCQ_F_ONETXQUEUE) ||
 	    !netif_xmit_frozen_or_stopped(txq))
 		skb = q->dequeue(q);
@@ -673,6 +732,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 		sch->padded = (char *) sch - (char *) p;
 	}
 	__skb_queue_head_init(&sch->gso_skb);
+	__skb_queue_head_init(&sch->skb_bad_txq);
 	qdisc_skb_head_init(&sch->q);
 	spin_lock_init(&sch->q.lock);
 
@@ -746,12 +806,12 @@ void qdisc_reset(struct Qdisc *qdisc)
 	if (ops->reset)
 		ops->reset(qdisc);
 
-	kfree_skb(qdisc->skb_bad_txq);
-	qdisc->skb_bad_txq = NULL;
-
 	skb_queue_walk_safe(&qdisc->gso_skb, skb, tmp)
 		kfree_skb_list(skb);
 
+	skb_queue_walk_safe(&qdisc->skb_bad_txq, skb, tmp)
+		kfree_skb_list(skb);
+
 	qdisc->q.qlen = 0;
 }
 EXPORT_SYMBOL(qdisc_reset);
@@ -793,8 +853,9 @@ void qdisc_destroy(struct Qdisc *qdisc)
 
 	skb_queue_walk_safe(&qdisc->gso_skb, skb, tmp)
 		kfree_skb_list(skb);
+	skb_queue_walk_safe(&qdisc->skb_bad_txq, skb, tmp)
+		kfree_skb_list(skb);
 
-	kfree_skb(qdisc->skb_bad_txq);
 	/*
 	 * gen_estimator est_timer() might access qdisc->q.lock,
 	 * wait a RCU grace period before freeing qdisc.
@@ -1036,6 +1097,7 @@ static void dev_init_scheduler_queue(struct net_device *dev,
 	rcu_assign_pointer(dev_queue->qdisc, qdisc);
 	dev_queue->qdisc_sleeping = qdisc;
 	__skb_queue_head_init(&qdisc->gso_skb);
+	__skb_queue_head_init(&qdisc->skb_bad_txq);
 }
 
 void dev_init_scheduler(struct net_device *dev)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 09/17] net: sched: check for frozen queue before skb_bad_txq check
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (7 preceding siblings ...)
  2017-05-02 15:38 ` [RFC PATCH v2 08/17] net: sched: support skb_bad_tx with lockless qdisc John Fastabend
@ 2017-05-02 15:39 ` John Fastabend
  2017-05-02 15:39 ` [RFC PATCH v2 10/17] net: sched: qdisc_qlen for per cpu logic John Fastabend
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

I can not think of any good reason to pull the bad txq skb off the
qdisc if the txq we plan to send this on is still frozen. So check
for frozen queue first and abort before dequeuing either skb_bad_txq
skb or normal qdisc dequeue() skb.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d4194c2..db5f7a0 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -203,7 +203,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 				   int *packets)
 {
 	const struct netdev_queue *txq = q->dev_queue;
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 
 	*packets = 1;
 	if (unlikely(!skb_queue_empty(&q->gso_skb))) {
@@ -247,12 +247,15 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	}
 validate:
 	*validate = true;
+
+	if ((q->flags & TCQ_F_ONETXQUEUE) &&
+	    netif_xmit_frozen_or_stopped(txq))
+		return skb;
+
 	skb = qdisc_dequeue_skb_bad_txq(q);
 	if (unlikely(skb))
 		goto bulk;
-	if (!(q->flags & TCQ_F_ONETXQUEUE) ||
-	    !netif_xmit_frozen_or_stopped(txq))
-		skb = q->dequeue(q);
+	skb = q->dequeue(q);
 	if (skb) {
 bulk:
 		if (qdisc_may_bulk(q))

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 10/17] net: sched: qdisc_qlen for per cpu logic
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (8 preceding siblings ...)
  2017-05-02 15:39 ` [RFC PATCH v2 09/17] net: sched: check for frozen queue before skb_bad_txq check John Fastabend
@ 2017-05-02 15:39 ` John Fastabend
  2017-05-02 15:39 ` [RFC PATCH v2 11/17] net: sched: helper to sum qlen John Fastabend
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

Add qdisc qlen helper routines for lockless qdiscs to use.

The qdisc qlen is no longer used in the hotpath but it is reported
via stats query on the qdisc so it still needs to be tracked. This
adds the per cpu operations needed.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4288222..4c446fe 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -257,8 +257,16 @@ static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
 	BUILD_BUG_ON(sizeof(qcb->data) < sz);
 }
 
+static inline int qdisc_qlen_cpu(const struct Qdisc *q)
+{
+	return this_cpu_ptr(q->cpu_qstats)->qlen;
+}
+
 static inline int qdisc_qlen(const struct Qdisc *q)
 {
+	if (q->flags & TCQ_F_NOLOCK)
+		return qdisc_qlen_cpu(q);
+
 	return q->q.qlen;
 }
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 11/17] net: sched: helper to sum qlen
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (9 preceding siblings ...)
  2017-05-02 15:39 ` [RFC PATCH v2 10/17] net: sched: qdisc_qlen for per cpu logic John Fastabend
@ 2017-05-02 15:39 ` John Fastabend
  2017-05-02 15:39 ` [RFC PATCH v2 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

Reporting qlen when qlen is per cpu requires aggregating the per
cpu counters. This adds a helper routine for this.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   15 +++++++++++++++
 net/sched/sch_api.c       |    3 ++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4c446fe..5ae07dd 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -270,6 +270,21 @@ static inline int qdisc_qlen(const struct Qdisc *q)
 	return q->q.qlen;
 }
 
+static inline int qdisc_qlen_sum(const struct Qdisc *q)
+{
+	__u32 qlen = 0;
+	int i;
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		for_each_possible_cpu(i)
+			qlen += per_cpu_ptr(q->cpu_qstats, i)->qlen;
+	} else {
+		qlen = q->q.qlen;
+	}
+
+	return qlen;
+}
+
 static inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
 {
 	return (struct qdisc_skb_cb *)skb->cb;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index bbe57d5..1bb9c89 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1378,7 +1378,8 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
 		goto nla_put_failure;
 	if (q->ops->dump && q->ops->dump(q, skb) < 0)
 		goto nla_put_failure;
-	qlen = q->q.qlen;
+
+	qlen = qdisc_qlen_sum(q);
 
 	stab = rtnl_dereference(q->stab);
 	if (stab && qdisc_dump_stab(skb, stab) < 0)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (10 preceding siblings ...)
  2017-05-02 15:39 ` [RFC PATCH v2 11/17] net: sched: helper to sum qlen John Fastabend
@ 2017-05-02 15:39 ` John Fastabend
  2017-06-19  9:21   ` huaixin chang
  2017-05-02 15:40 ` [RFC PATCH v2 13/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio John Fastabend
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Also exports __gnet_stats_copy_queue() to use as a helper function.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/gen_stats.h |    3 +++
 net/core/gen_stats.c    |    9 +++++----
 net/sched/sch_mq.c      |   25 ++++++++++++++++++-------
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 8b7aa37..2bf0ec2 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -48,6 +48,9 @@ int gnet_stats_copy_rate_est(struct gnet_dump *d,
 int gnet_stats_copy_queue(struct gnet_dump *d,
 			  struct gnet_stats_queue __percpu *cpu_q,
 			  struct gnet_stats_queue *q, __u32 qlen);
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+			     const struct gnet_stats_queue __percpu *cpu_q,
+			     const struct gnet_stats_queue *q, __u32 qlen);
 int gnet_stats_copy_app(struct gnet_dump *d, void *st, int len);
 
 int gnet_stats_finish_copy(struct gnet_dump *d);
diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 87f2855..b2b2323b 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -252,10 +252,10 @@
 	}
 }
 
-static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
-				    const struct gnet_stats_queue __percpu *cpu,
-				    const struct gnet_stats_queue *q,
-				    __u32 qlen)
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+			     const struct gnet_stats_queue __percpu *cpu,
+			     const struct gnet_stats_queue *q,
+			     __u32 qlen)
 {
 	if (cpu) {
 		__gnet_stats_copy_queue_cpu(qstats, cpu);
@@ -269,6 +269,7 @@ static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
 
 	qstats->qlen = qlen;
 }
+EXPORT_SYMBOL(__gnet_stats_copy_queue);
 
 /**
  * gnet_stats_copy_queue - copy queue statistics into statistics TLV
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index cadfdd4..7871db4 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/sch_generic.h>
 
 struct mq_sched {
 	struct Qdisc		**qdiscs;
@@ -103,15 +104,25 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	memset(&sch->qstats, 0, sizeof(sch->qstats));
 
 	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+		struct gnet_stats_queue __percpu *cpu_qstats = NULL;
+		__u32 qlen = 0;
+
 		qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
 		spin_lock_bh(qdisc_lock(qdisc));
-		sch->q.qlen		+= qdisc->q.qlen;
-		sch->bstats.bytes	+= qdisc->bstats.bytes;
-		sch->bstats.packets	+= qdisc->bstats.packets;
-		sch->qstats.backlog	+= qdisc->qstats.backlog;
-		sch->qstats.drops	+= qdisc->qstats.drops;
-		sch->qstats.requeues	+= qdisc->qstats.requeues;
-		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+
+		if (qdisc_is_percpu_stats(qdisc)) {
+			cpu_bstats = qdisc->cpu_bstats;
+			cpu_qstats = qdisc->cpu_qstats;
+		}
+
+		qlen = qdisc_qlen_sum(qdisc);
+
+		__gnet_stats_copy_basic(NULL, &sch->bstats,
+					cpu_bstats, &qdisc->bstats);
+		__gnet_stats_copy_queue(&sch->qstats,
+					cpu_qstats, &qdisc->qstats, qlen);
+
 		spin_unlock_bh(qdisc_lock(qdisc));
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
  2017-05-02 15:39 ` [RFC PATCH v2 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
@ 2017-06-19  9:21   ` huaixin chang
  0 siblings, 0 replies; 22+ messages in thread
From: huaixin chang @ 2017-06-19  9:21 UTC (permalink / raw)
  To: John Fastabend; +Cc: huaixin chang, eric.dumazet, netdev


> 在 2017年5月2日，下午11:39，John Fastabend <john.fastabend@gmail.com> 写道：
> 
> The sch_mq qdisc creates a sub-qdisc per tx queue which are then
> called independently for enqueue and dequeue operations. However
> statistics are aggregated and pushed up to the "master" qdisc.
> 
> This patch adds support for any of the sub-qdiscs to be per cpu
> statistic qdiscs. To handle this case add a check when calculating
> stats and aggregate the per cpu stats if needed.
> 
> Also exports __gnet_stats_copy_queue() to use as a helper function.

Maybe mq_dump_class_stats() should handle per cpu statistic sub-qdiscs too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 13/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (11 preceding siblings ...)
  2017-05-02 15:39 ` [RFC PATCH v2 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
@ 2017-05-02 15:40 ` John Fastabend
  2017-05-02 15:40 ` [RFC PATCH v2 14/17] net: skb_array: expose peek API John Fastabend
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_mqprio.c |   61 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 0a4cf27..d2f33ac 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -233,22 +233,32 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 	unsigned char *b = skb_tail_pointer(skb);
 	struct tc_mqprio_qopt opt = { 0 };
 	struct Qdisc *qdisc;
-	unsigned int i;
+	unsigned int ntx, tc;
 
 	sch->q.qlen = 0;
 	memset(&sch->bstats, 0, sizeof(sch->bstats));
 	memset(&sch->qstats, 0, sizeof(sch->qstats));
 
-	for (i = 0; i < dev->num_tx_queues; i++) {
-		qdisc = rtnl_dereference(netdev_get_tx_queue(dev, i)->qdisc);
+	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+		struct gnet_stats_queue __percpu *cpu_qstats = NULL;
+		__u32 qlen = 0;
+
+		qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
 		spin_lock_bh(qdisc_lock(qdisc));
-		sch->q.qlen		+= qdisc->q.qlen;
-		sch->bstats.bytes	+= qdisc->bstats.bytes;
-		sch->bstats.packets	+= qdisc->bstats.packets;
-		sch->qstats.backlog	+= qdisc->qstats.backlog;
-		sch->qstats.drops	+= qdisc->qstats.drops;
-		sch->qstats.requeues	+= qdisc->qstats.requeues;
-		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+
+		if (qdisc_is_percpu_stats(qdisc)) {
+			cpu_bstats = qdisc->cpu_bstats;
+			cpu_qstats = qdisc->cpu_qstats;
+		}
+
+		qlen = qdisc_qlen_sum(qdisc);
+
+		__gnet_stats_copy_basic(NULL, &sch->bstats,
+					cpu_bstats, &qdisc->bstats);
+		__gnet_stats_copy_queue(&sch->qstats,
+					cpu_qstats, &qdisc->qstats, qlen);
+
 		spin_unlock_bh(qdisc_lock(qdisc));
 	}
 
@@ -256,9 +266,9 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 	memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
 	opt.hw = priv->hw_offload;
 
-	for (i = 0; i < netdev_get_num_tc(dev); i++) {
-		opt.count[i] = dev->tc_to_txq[i].count;
-		opt.offset[i] = dev->tc_to_txq[i].offset;
+	for (tc = 0; tc < netdev_get_num_tc(dev); tc++) {
+		opt.count[tc] = dev->tc_to_txq[tc].count;
+		opt.offset[tc] = dev->tc_to_txq[tc].offset;
 	}
 
 	if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
@@ -336,7 +346,6 @@ static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 	if (cl <= netdev_get_num_tc(dev)) {
 		int i;
 		__u32 qlen = 0;
-		struct Qdisc *qdisc;
 		struct gnet_stats_queue qstats = {0};
 		struct gnet_stats_basic_packed bstats = {0};
 		struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
@@ -351,18 +360,26 @@ static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 
 		for (i = tc.offset; i < tc.offset + tc.count; i++) {
 			struct netdev_queue *q = netdev_get_tx_queue(dev, i);
+			struct Qdisc *qdisc = rtnl_dereference(q->qdisc);
+			struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+			struct gnet_stats_queue __percpu *cpu_qstats = NULL;
 
-			qdisc = rtnl_dereference(q->qdisc);
 			spin_lock_bh(qdisc_lock(qdisc));
-			qlen		  += qdisc->q.qlen;
-			bstats.bytes      += qdisc->bstats.bytes;
-			bstats.packets    += qdisc->bstats.packets;
-			qstats.backlog    += qdisc->qstats.backlog;
-			qstats.drops      += qdisc->qstats.drops;
-			qstats.requeues   += qdisc->qstats.requeues;
-			qstats.overlimits += qdisc->qstats.overlimits;
+			if (qdisc_is_percpu_stats(qdisc)) {
+				cpu_bstats = qdisc->cpu_bstats;
+				cpu_qstats = qdisc->cpu_qstats;
+			}
+
+			qlen = qdisc_qlen_sum(qdisc);
+			__gnet_stats_copy_basic(NULL, &sch->bstats,
+						cpu_bstats, &qdisc->bstats);
+			__gnet_stats_copy_queue(&sch->qstats,
+						cpu_qstats,
+						&qdisc->qstats,
+						qlen);
 			spin_unlock_bh(qdisc_lock(qdisc));
 		}
+
 		/* Reclaim root sleeping lock before completing stats */
 		if (d->lock)
 			spin_lock_bh(d->lock);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 14/17] net: skb_array: expose peek API
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (12 preceding siblings ...)
  2017-05-02 15:40 ` [RFC PATCH v2 13/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio John Fastabend
@ 2017-05-02 15:40 ` John Fastabend
  2017-06-06  3:10   ` Michael S. Tsirkin
  2017-05-02 15:40 ` [RFC PATCH v2 15/17] net: sched: pfifo_fast use skb_array John Fastabend
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

This adds a peek routine to skb_array.h for use with qdisc.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/skb_array.h |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index f4dfade..33f1f0c 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -72,6 +72,11 @@ static inline bool __skb_array_empty(struct skb_array *a)
 	return !__ptr_ring_peek(&a->ring);
 }
 
+static inline struct sk_buff *skb_array_peek(struct skb_array *a)
+{
+	return __ptr_ring_peek(&a->ring);
+}
+
 static inline bool skb_array_empty(struct skb_array *a)
 {
 	return ptr_ring_empty(&a->ring);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 14/17] net: skb_array: expose peek API
  2017-05-02 15:40 ` [RFC PATCH v2 14/17] net: skb_array: expose peek API John Fastabend
@ 2017-06-06  3:10   ` Michael S. Tsirkin
  0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2017-06-06  3:10 UTC (permalink / raw)
  To: John Fastabend; +Cc: eric.dumazet, netdev

On Tue, May 02, 2017 at 08:40:32AM -0700, John Fastabend wrote:
> This adds a peek routine to skb_array.h for use with qdisc.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  include/linux/skb_array.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
> index f4dfade..33f1f0c 100644
> --- a/include/linux/skb_array.h
> +++ b/include/linux/skb_array.h
> @@ -72,6 +72,11 @@ static inline bool __skb_array_empty(struct skb_array *a)
>  	return !__ptr_ring_peek(&a->ring);
>  }
>  
> +static inline struct sk_buff *skb_array_peek(struct skb_array *a)
> +{
> +	return __ptr_ring_peek(&a->ring);
> +}
> +
>  static inline bool skb_array_empty(struct skb_array *a)
>  {
>  	return ptr_ring_empty(&a->ring);

I think it's better to call this __skb_array_peek: the issue
is that callers must be careful with it's use - see the comment
near __ptr_ring_peek.

-- 
MST

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 15/17] net: sched: pfifo_fast use skb_array
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (13 preceding siblings ...)
  2017-05-02 15:40 ` [RFC PATCH v2 14/17] net: skb_array: expose peek API John Fastabend
@ 2017-05-02 15:40 ` John Fastabend
  2017-05-02 15:41 ` [RFC PATCH v2 16/17] net: skb_array additions for unlocked consumer John Fastabend
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

This converts the pfifo_fast qdisc to use the skb_array data structure
and set the lockless qdisc bit.

This also removes the logic used to pick the next band to dequeue from
and instead just checks a per priority array for packets from top priority
to lowest. This might need to be a bit more clever but seems to work
for now.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |  134 ++++++++++++++++++++++++++++-------------------
 1 file changed, 81 insertions(+), 53 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index db5f7a0..be5a201 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -26,6 +26,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/if_vlan.h>
+#include <linux/skb_array.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -578,93 +579,95 @@ struct Qdisc_ops noqueue_qdisc_ops __read_mostly = {
 
 /*
  * Private data for a pfifo_fast scheduler containing:
- * 	- queues for the three band
- * 	- bitmap indicating which of the bands contain skbs
+ *	- rings for priority bands
  */
 struct pfifo_fast_priv {
-	u32 bitmap;
-	struct qdisc_skb_head q[PFIFO_FAST_BANDS];
+	struct skb_array q[PFIFO_FAST_BANDS];
 };
 
-/*
- * Convert a bitmap to the first band number where an skb is queued, where:
- * 	bitmap=0 means there are no skbs on any band.
- * 	bitmap=1 means there is an skb on band 0.
- *	bitmap=7 means there are skbs on all 3 bands, etc.
- */
-static const int bitmap2band[] = {-1, 0, 1, 0, 2, 0, 1, 0};
-
-static inline struct qdisc_skb_head *band2list(struct pfifo_fast_priv *priv,
-					     int band)
+static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
+					  int band)
 {
-	return priv->q + band;
+	return &priv->q[band];
 }
 
 static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
 			      struct sk_buff **to_free)
 {
-	if (qdisc->q.qlen < qdisc_dev(qdisc)->tx_queue_len) {
-		int band = prio2band[skb->priority & TC_PRIO_MAX];
-		struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-		struct qdisc_skb_head *list = band2list(priv, band);
-
-		priv->bitmap |= (1 << band);
-		qdisc->q.qlen++;
-		return __qdisc_enqueue_tail(skb, qdisc, list);
-	}
+	int band = prio2band[skb->priority & TC_PRIO_MAX];
+	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
+	struct skb_array *q = band2list(priv, band);
+	int err;
 
-	return qdisc_drop(skb, qdisc, to_free);
+	err = skb_array_produce_bh(q, skb);
+
+	if (unlikely(err))
+		return qdisc_drop_cpu(skb, qdisc, to_free);
+
+	qdisc_qstats_cpu_qlen_inc(qdisc);
+	qdisc_qstats_cpu_backlog_inc(qdisc, skb);
+	return NET_XMIT_SUCCESS;
 }
 
 static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-	int band = bitmap2band[priv->bitmap];
-
-	if (likely(band >= 0)) {
-		struct qdisc_skb_head *qh = band2list(priv, band);
-		struct sk_buff *skb = __qdisc_dequeue_head(qh);
+	struct sk_buff *skb = NULL;
+	int band;
 
-		if (likely(skb != NULL)) {
-			qdisc_qstats_backlog_dec(qdisc, skb);
-			qdisc_bstats_update(qdisc, skb);
-		}
+	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
+		struct skb_array *q = band2list(priv, band);
 
-		qdisc->q.qlen--;
-		if (qh->qlen == 0)
-			priv->bitmap &= ~(1 << band);
+		if (__skb_array_empty(q))
+			continue;
 
-		return skb;
+		skb = skb_array_consume_bh(q);
+	}
+	if (likely(skb)) {
+		qdisc_qstats_cpu_backlog_dec(qdisc, skb);
+		qdisc_bstats_cpu_update(qdisc, skb);
+		qdisc_qstats_cpu_qlen_dec(qdisc);
 	}
 
-	return NULL;
+	return skb;
 }
 
 static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-	int band = bitmap2band[priv->bitmap];
+	struct sk_buff *skb = NULL;
+	int band;
 
-	if (band >= 0) {
-		struct qdisc_skb_head *qh = band2list(priv, band);
+	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
+		struct skb_array *q = band2list(priv, band);
 
-		return qh->head;
+		skb = skb_array_peek(q);
+		if (!skb)
+			continue;
 	}
 
-	return NULL;
+	return skb;
 }
 
 static void pfifo_fast_reset(struct Qdisc *qdisc)
 {
-	int prio;
+	int i, band;
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 
-	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
-		__qdisc_reset_queue(band2list(priv, prio));
+	for (band = 0; band < PFIFO_FAST_BANDS; band++) {
+		struct skb_array *q = band2list(priv, band);
+		struct sk_buff *skb;
 
-	priv->bitmap = 0;
-	qdisc->qstats.backlog = 0;
-	qdisc->q.qlen = 0;
+		while ((skb = skb_array_consume_bh(q)) != NULL)
+			__skb_array_destroy_skb(skb);
+	}
+
+	for_each_possible_cpu(i) {
+		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
+
+		q->backlog = 0;
+		q->qlen = 0;
+	}
 }
 
 static int pfifo_fast_dump(struct Qdisc *qdisc, struct sk_buff *skb)
@@ -682,17 +685,40 @@ static int pfifo_fast_dump(struct Qdisc *qdisc, struct sk_buff *skb)
 
 static int pfifo_fast_init(struct Qdisc *qdisc, struct nlattr *opt)
 {
-	int prio;
+	unsigned int qlen = qdisc_dev(qdisc)->tx_queue_len;
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
+	int prio;
+
+	/* guard against zero length rings */
+	if (!qlen)
+		return -EINVAL;
 
-	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
-		qdisc_skb_head_init(band2list(priv, prio));
+	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+		struct skb_array *q = band2list(priv, prio);
+		int err;
+
+		err = skb_array_init(q, qlen, GFP_KERNEL);
+		if (err)
+			return -ENOMEM;
+	}
 
 	/* Can by-pass the queue discipline */
 	qdisc->flags |= TCQ_F_CAN_BYPASS;
 	return 0;
 }
 
+static void pfifo_fast_destroy(struct Qdisc *sch)
+{
+	struct pfifo_fast_priv *priv = qdisc_priv(sch);
+	int prio;
+
+	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+		struct skb_array *q = band2list(priv, prio);
+
+		skb_array_cleanup(q);
+	}
+}
+
 struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.id		=	"pfifo_fast",
 	.priv_size	=	sizeof(struct pfifo_fast_priv),
@@ -700,9 +726,11 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.dequeue	=	pfifo_fast_dequeue,
 	.peek		=	pfifo_fast_peek,
 	.init		=	pfifo_fast_init,
+	.destroy	=	pfifo_fast_destroy,
 	.reset		=	pfifo_fast_reset,
 	.dump		=	pfifo_fast_dump,
 	.owner		=	THIS_MODULE,
+	.static_flags	=	TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
 };
 EXPORT_SYMBOL(pfifo_fast_ops);
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 16/17] net: skb_array additions for unlocked consumer
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (14 preceding siblings ...)
  2017-05-02 15:40 ` [RFC PATCH v2 15/17] net: sched: pfifo_fast use skb_array John Fastabend
@ 2017-05-02 15:41 ` John Fastabend
  2017-05-02 15:41 ` [RFC PATCH v2 17/17] net: sched: lock once per bulk dequeue John Fastabend
  2017-06-06  4:30 ` [RFC PATCH v2 00/17] latest qdisc patch series Michael S. Tsirkin
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/skb_array.h |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index 33f1f0c..b28db83 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -117,6 +117,11 @@ static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
 	return ptr_ring_consume_bh(&a->ring);
 }
 
+static inline struct sk_buff *__skb_array_consume(struct skb_array *a)
+{
+	return __ptr_ring_consume(&a->ring);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
 	if (likely(skb)) {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 17/17] net: sched: lock once per bulk dequeue
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (15 preceding siblings ...)
  2017-05-02 15:41 ` [RFC PATCH v2 16/17] net: skb_array additions for unlocked consumer John Fastabend
@ 2017-05-02 15:41 ` John Fastabend
  2017-06-06  4:30 ` [RFC PATCH v2 00/17] latest qdisc patch series Michael S. Tsirkin
  17 siblings, 0 replies; 22+ messages in thread
From: John Fastabend @ 2017-05-02 15:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, john.fastabend


---
 net/sched/sch_generic.c |   53 +++++++++++++++++++++--------------------------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index be5a201..0f0831a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -205,33 +205,22 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 {
 	const struct netdev_queue *txq = q->dev_queue;
 	struct sk_buff *skb = NULL;
+	spinlock_t *lock = NULL;
 
-	*packets = 1;
-	if (unlikely(!skb_queue_empty(&q->gso_skb))) {
-		spinlock_t *lock = NULL;
-
-		if (q->flags & TCQ_F_NOLOCK) {
-			lock = qdisc_lock(q);
-			spin_lock(lock);
-		}
-
-		skb = skb_peek(&q->gso_skb);
-
-		/* skb may be null if another cpu pulls gso_skb off in between
-		 * empty check and lock.
-		 */
-		if (!skb) {
-			if (lock)
-				spin_unlock(lock);
-			goto validate;
-		}
+	if (q->flags & TCQ_F_NOLOCK) {
+		lock = qdisc_lock(q);
+		spin_lock(lock);
+	}
 
+	*packets = 1;
+	skb = skb_peek(&q->gso_skb);
+	if (unlikely(skb)) {
 		/* skb in gso_skb were already validated */
 		*validate = false;
 		/* check the reason of requeuing without tx lock first */
 		txq = skb_get_tx_queue(txq->dev, skb);
 		if (!netif_xmit_frozen_or_stopped(txq)) {
-			skb = __skb_dequeue(&q->gso_skb);
+			__skb_unlink(skb, &q->gso_skb);
 			if (qdisc_is_percpu_stats(q)) {
 				qdisc_qstats_cpu_backlog_dec(q, skb);
 				qdisc_qstats_cpu_qlen_dec(q);
@@ -246,17 +235,18 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 			spin_unlock(lock);
 		return skb;
 	}
-validate:
-	*validate = true;
 
-	if ((q->flags & TCQ_F_ONETXQUEUE) &&
-	    netif_xmit_frozen_or_stopped(txq))
-		return skb;
+	*validate = true;
 
 	skb = qdisc_dequeue_skb_bad_txq(q);
-	if (unlikely(skb))
+	if (unlikely(skb)) {
 		goto bulk;
-	skb = q->dequeue(q);
+	}
+
+	if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+	    !netif_xmit_frozen_or_stopped(txq))
+		skb = q->dequeue(q);
+
 	if (skb) {
 bulk:
 		if (qdisc_may_bulk(q))
@@ -264,6 +254,11 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 		else
 			try_bulk_dequeue_skb_slow(q, skb, packets);
 	}
+
+blocked:
+	if (lock)
+		spin_unlock(lock);
+
 	return skb;
 }
 
@@ -621,7 +616,7 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 		if (__skb_array_empty(q))
 			continue;
 
-		skb = skb_array_consume_bh(q);
+		skb = __skb_array_consume(q);
 	}
 	if (likely(skb)) {
 		qdisc_qstats_cpu_backlog_dec(qdisc, skb);
@@ -658,7 +653,7 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
 		struct skb_array *q = band2list(priv, band);
 		struct sk_buff *skb;
 
-		while ((skb = skb_array_consume_bh(q)) != NULL)
+		while ((skb = __skb_array_consume(q)) != NULL)
 			__skb_array_destroy_skb(skb);
 	}
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 00/17] latest qdisc patch series
  2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
                   ` (16 preceding siblings ...)
  2017-05-02 15:41 ` [RFC PATCH v2 17/17] net: sched: lock once per bulk dequeue John Fastabend
@ 2017-06-06  4:30 ` Michael S. Tsirkin
  17 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2017-06-06  4:30 UTC (permalink / raw)
  To: John Fastabend; +Cc: eric.dumazet, netdev, Jason Wang

On Tue, May 02, 2017 at 08:36:03AM -0700, John Fastabend wrote:
> I am not going to be able to work on this for a few days so I figured
> it might be worth getting some feedback if there is any. Any thoughts
> on how to squeeze a few extra pps out of this would be very useful.

Batch dequeue into a temporary buffer, then scan this
buffer and xmit without locks might be a good idea.
It certainly helped vhost. You might need to requeue
if you are unable to xmit, there's now support for this.

-- 
MST

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-06-19  9:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-02 15:36 [RFC PATCH v2 00/17] latest qdisc patch series John Fastabend
2017-05-02 15:36 ` [RFC PATCH v2 01/17] net: sched: cleanup qdisc_run and __qdisc_run semantics John Fastabend
2017-05-02 15:36 ` [RFC PATCH v2 02/17] net: sched: allow qdiscs to handle locking John Fastabend
2017-05-02 15:37 ` [RFC PATCH v2 03/17] net: sched: remove remaining uses for qdisc_qlen in xmit path John Fastabend
2017-05-02 15:37 ` [RFC PATCH v2 04/17] net: sched: provide per cpu qstat helpers John Fastabend
2017-05-02 15:37 ` [RFC PATCH v2 05/17] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
2017-05-02 15:38 ` [RFC PATCH v2 06/17] net: sched: explicit locking in gso_cpu fallback John Fastabend
2017-05-02 15:38 ` [RFC PATCH v2 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc John Fastabend
2017-05-02 20:00   ` Jesper Dangaard Brouer
2017-05-02 15:38 ` [RFC PATCH v2 08/17] net: sched: support skb_bad_tx with lockless qdisc John Fastabend
2017-05-02 15:39 ` [RFC PATCH v2 09/17] net: sched: check for frozen queue before skb_bad_txq check John Fastabend
2017-05-02 15:39 ` [RFC PATCH v2 10/17] net: sched: qdisc_qlen for per cpu logic John Fastabend
2017-05-02 15:39 ` [RFC PATCH v2 11/17] net: sched: helper to sum qlen John Fastabend
2017-05-02 15:39 ` [RFC PATCH v2 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
2017-06-19  9:21   ` huaixin chang
2017-05-02 15:40 ` [RFC PATCH v2 13/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio John Fastabend
2017-05-02 15:40 ` [RFC PATCH v2 14/17] net: skb_array: expose peek API John Fastabend
2017-06-06  3:10   ` Michael S. Tsirkin
2017-05-02 15:40 ` [RFC PATCH v2 15/17] net: sched: pfifo_fast use skb_array John Fastabend
2017-05-02 15:41 ` [RFC PATCH v2 16/17] net: skb_array additions for unlocked consumer John Fastabend
2017-05-02 15:41 ` [RFC PATCH v2 17/17] net: sched: lock once per bulk dequeue John Fastabend
2017-06-06  4:30 ` [RFC PATCH v2 00/17] latest qdisc patch series Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).