[PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
@ 2016-06-22  6:16 Eric Dumazet
  2016-06-22  6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22  6:16 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer,
	Eric Dumazet

First patch adds an additional parameter to ->enqueue() qdisc method
so that drops can be done outside of critical section
(after locks are released).

Then fq_codel can have a small optimization to reduce number of cache
lines misses during a drop event
(possibly accumulating hundreds of packets to be freed).

A small htb change exports the backlog in class dumps.

Final patch adds bulk dequeue to qdiscs that were lacking this feature.

This series brings a nice qdisc performance increase (more than 80 %
in some cases).

Eric Dumazet (4):
  net_sched: drop packets after root qdisc lock is released
  net_sched: fq_codel: cache skb->truesize into skb->cb
  net_sched: sch_htb: export class backlog in dumps
  net_sched: generalize bulk dequeue

 include/net/codel_qdisc.h |  1 +
 include/net/sch_generic.h | 48 ++++++++++++++++++++---------
 net/core/dev.c            |  7 +++--
 net/sched/sch_atm.c       |  9 +++---
 net/sched/sch_blackhole.c |  5 +--
 net/sched/sch_cbq.c       |  7 +++--
 net/sched/sch_choke.c     | 16 +++++-----
 net/sched/sch_codel.c     |  8 +++--
 net/sched/sch_drr.c       |  7 +++--
 net/sched/sch_dsmark.c    |  9 +++---
 net/sched/sch_fifo.c      | 15 +++++----
 net/sched/sch_fq.c        |  7 +++--
 net/sched/sch_fq_codel.c  | 22 +++++++------
 net/sched/sch_generic.c   | 78 ++++++++++++++++++++++++++++++++++++++---------
 net/sched/sch_gred.c      |  7 +++--
 net/sched/sch_hfsc.c      |  6 ++--
 net/sched/sch_hhf.c       | 10 +++---
 net/sched/sch_htb.c       | 24 ++++++++++-----
 net/sched/sch_multiq.c    |  7 +++--
 net/sched/sch_netem.c     | 25 +++++++++------
 net/sched/sch_pie.c       |  5 +--
 net/sched/sch_plug.c      |  5 +--
 net/sched/sch_prio.c      |  4 +--
 net/sched/sch_qfq.c       |  7 +++--
 net/sched/sch_red.c       |  7 +++--
 net/sched/sch_sfb.c       |  7 +++--
 net/sched/sch_sfq.c       |  8 ++---
 net/sched/sch_tbf.c       | 16 +++++-----
 net/sched/sch_teql.c      |  4 +--
 29 files changed, 247 insertions(+), 134 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released
  2016-06-22  6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet
@ 2016-06-22  6:16 ` Eric Dumazet
  2016-06-22 15:14   ` Jesper Dangaard Brouer
  2016-06-22  6:16 ` [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb Eric Dumazet
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22  6:16 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer,
	Eric Dumazet

Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h | 41 ++++++++++++++++++++++++++++++-----------
 net/core/dev.c            |  7 +++++--
 net/sched/sch_atm.c       |  9 +++++----
 net/sched/sch_blackhole.c |  5 +++--
 net/sched/sch_cbq.c       |  7 ++++---
 net/sched/sch_choke.c     | 16 +++++++++-------
 net/sched/sch_codel.c     |  8 +++++---
 net/sched/sch_drr.c       |  7 ++++---
 net/sched/sch_dsmark.c    |  9 +++++----
 net/sched/sch_fifo.c      | 15 +++++++++------
 net/sched/sch_fq.c        |  7 ++++---
 net/sched/sch_fq_codel.c  | 15 +++++++++------
 net/sched/sch_generic.c   | 10 ++++++----
 net/sched/sch_gred.c      |  7 ++++---
 net/sched/sch_hfsc.c      |  6 +++---
 net/sched/sch_hhf.c       | 10 +++++-----
 net/sched/sch_htb.c       | 10 ++++++----
 net/sched/sch_multiq.c    |  7 ++++---
 net/sched/sch_netem.c     | 25 +++++++++++++++----------
 net/sched/sch_pie.c       |  5 +++--
 net/sched/sch_plug.c      |  5 +++--
 net/sched/sch_prio.c      |  4 ++--
 net/sched/sch_qfq.c       |  7 ++++---
 net/sched/sch_red.c       |  7 ++++---
 net/sched/sch_sfb.c       |  7 ++++---
 net/sched/sch_sfq.c       |  8 ++++----
 net/sched/sch_tbf.c       | 16 +++++++++-------
 net/sched/sch_teql.c      |  4 ++--
 28 files changed, 170 insertions(+), 114 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4f7cee8344c4..04e84c07c94f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -37,8 +37,10 @@ struct qdisc_size_table {
 };
 
 struct Qdisc {
-	int 			(*enqueue)(struct sk_buff *skb, struct Qdisc *dev);
-	struct sk_buff *	(*dequeue)(struct Qdisc *dev);
+	int 			(*enqueue)(struct sk_buff *skb,
+					   struct Qdisc *sch,
+					   struct sk_buff **to_free);
+	struct sk_buff *	(*dequeue)(struct Qdisc *sch);
 	unsigned int		flags;
 #define TCQ_F_BUILTIN		1
 #define TCQ_F_INGRESS		2
@@ -160,7 +162,9 @@ struct Qdisc_ops {
 	char			id[IFNAMSIZ];
 	int			priv_size;
 
-	int 			(*enqueue)(struct sk_buff *, struct Qdisc *);
+	int 			(*enqueue)(struct sk_buff *skb,
+					   struct Qdisc *sch,
+					   struct sk_buff **to_free);
 	struct sk_buff *	(*dequeue)(struct Qdisc *);
 	struct sk_buff *	(*peek)(struct Qdisc *);
 
@@ -498,10 +502,11 @@ static inline void qdisc_calculate_pkt_len(struct sk_buff *skb,
 #endif
 }
 
-static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				struct sk_buff **to_free)
 {
 	qdisc_calculate_pkt_len(skb, sch);
-	return sch->enqueue(skb, sch);
+	return sch->enqueue(skb, sch, to_free);
 }
 
 static inline bool qdisc_is_percpu_stats(const struct Qdisc *q)
@@ -626,24 +631,36 @@ static inline struct sk_buff *qdisc_dequeue_head(struct Qdisc *sch)
 	return __qdisc_dequeue_head(sch, &sch->q);
 }
 
+/* Instead of calling kfree_skb() while root qdisc lock is held,
+ * queue the skb for future freeing at end of __dev_xmit_skb()
+ */
+static inline void __qdisc_drop(struct sk_buff *skb, struct sk_buff **to_free)
+{
+	skb->next = *to_free;
+	*to_free = skb;
+}
+
 static inline unsigned int __qdisc_queue_drop_head(struct Qdisc *sch,
-					      struct sk_buff_head *list)
+						   struct sk_buff_head *list,
+						   struct sk_buff **to_free)
 {
 	struct sk_buff *skb = __skb_dequeue(list);
 
 	if (likely(skb != NULL)) {
 		unsigned int len = qdisc_pkt_len(skb);
+
 		qdisc_qstats_backlog_dec(sch, skb);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return len;
 	}
 
 	return 0;
 }
 
-static inline unsigned int qdisc_queue_drop_head(struct Qdisc *sch)
+static inline unsigned int qdisc_queue_drop_head(struct Qdisc *sch,
+						 struct sk_buff **to_free)
 {
-	return __qdisc_queue_drop_head(sch, &sch->q);
+	return __qdisc_queue_drop_head(sch, &sch->q, to_free);
 }
 
 static inline struct sk_buff *qdisc_peek_head(struct Qdisc *sch)
@@ -724,9 +741,11 @@ static inline void rtnl_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch)
 	qdisc_qstats_drop(sch);
 }
 
-static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch)
+
+static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
+			     struct sk_buff **to_free)
 {
-	kfree_skb(skb);
+	__qdisc_drop(skb, to_free);
 	qdisc_qstats_drop(sch);
 
 	return NET_XMIT_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index d40593b3b9fb..aba10d2a8bc3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3070,6 +3070,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				 struct netdev_queue *txq)
 {
 	spinlock_t *root_lock = qdisc_lock(q);
+	struct sk_buff *to_free = NULL;
 	bool contended;
 	int rc;
 
@@ -3086,7 +3087,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
-		kfree_skb(skb);
+		__qdisc_drop(skb, &to_free);
 		rc = NET_XMIT_DROP;
 	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
 		   qdisc_run_begin(q)) {
@@ -3109,7 +3110,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		rc = q->enqueue(skb, q) & NET_XMIT_MASK;
+		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
@@ -3119,6 +3120,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		}
 	}
 	spin_unlock(root_lock);
+	if (unlikely(to_free))
+		kfree_skb_list(to_free);
 	if (unlikely(contended))
 		spin_unlock(&q->busylock);
 	return rc;
diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index e04ea6994d1c..481e4f12aeb4 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -357,7 +357,8 @@ static struct tcf_proto __rcu **atm_tc_find_tcf(struct Qdisc *sch,
 
 /* --------------------------- Qdisc operations ---------------------------- */
 
-static int atm_tc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int atm_tc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			  struct sk_buff **to_free)
 {
 	struct atm_qdisc_data *p = qdisc_priv(sch);
 	struct atm_flow_data *flow;
@@ -398,10 +399,10 @@ done:
 		switch (result) {
 		case TC_ACT_QUEUED:
 		case TC_ACT_STOLEN:
-			kfree_skb(skb);
+			__qdisc_drop(skb, to_free);
 			return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
 		case TC_ACT_SHOT:
-			kfree_skb(skb);
+			__qdisc_drop(skb, to_free);
 			goto drop;
 		case TC_ACT_RECLASSIFY:
 			if (flow->excess)
@@ -413,7 +414,7 @@ done:
 #endif
 	}
 
-	ret = qdisc_enqueue(skb, flow->q);
+	ret = qdisc_enqueue(skb, flow->q, to_free);
 	if (ret != NET_XMIT_SUCCESS) {
 drop: __maybe_unused
 		if (net_xmit_drop_count(ret)) {
diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c
index 3fee70d9814f..c98a61e980ba 100644
--- a/net/sched/sch_blackhole.c
+++ b/net/sched/sch_blackhole.c
@@ -17,9 +17,10 @@
 #include <linux/skbuff.h>
 #include <net/pkt_sched.h>
 
-static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			     struct sk_buff **to_free)
 {
-	qdisc_drop(skb, sch);
+	qdisc_drop(skb, sch, to_free);
 	return NET_XMIT_SUCCESS;
 }
 
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index a29fd811d7b9..beb554aa8cfb 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -358,7 +358,8 @@ cbq_mark_toplevel(struct cbq_sched_data *q, struct cbq_class *cl)
 }
 
 static int
-cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+	    struct sk_buff **to_free)
 {
 	struct cbq_sched_data *q = qdisc_priv(sch);
 	int uninitialized_var(ret);
@@ -370,11 +371,11 @@ cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (cl == NULL) {
 		if (ret & __NET_XMIT_BYPASS)
 			qdisc_qstats_drop(sch);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return ret;
 	}
 
-	ret = qdisc_enqueue(skb, cl->q);
+	ret = qdisc_enqueue(skb, cl->q, to_free);
 	if (ret == NET_XMIT_SUCCESS) {
 		sch->q.qlen++;
 		cbq_mark_toplevel(q, cl);
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index 789b69ee9e51..3b6d5bd69101 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -115,7 +115,8 @@ static void choke_zap_tail_holes(struct choke_sched_data *q)
 }
 
 /* Drop packet from queue array by creating a "hole" */
-static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx)
+static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx,
+			      struct sk_buff **to_free)
 {
 	struct choke_sched_data *q = qdisc_priv(sch);
 	struct sk_buff *skb = q->tab[idx];
@@ -129,7 +130,7 @@ static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx)
 
 	qdisc_qstats_backlog_dec(sch, skb);
 	qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
-	qdisc_drop(skb, sch);
+	qdisc_drop(skb, sch, to_free);
 	--sch->q.qlen;
 }
 
@@ -261,7 +262,8 @@ static bool choke_match_random(const struct choke_sched_data *q,
 	return choke_match_flow(oskb, nskb);
 }
 
-static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			 struct sk_buff **to_free)
 {
 	int ret = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	struct choke_sched_data *q = qdisc_priv(sch);
@@ -288,7 +290,7 @@ static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		/* Draw a packet at random from queue and compare flow */
 		if (choke_match_random(q, skb, &idx)) {
 			q->stats.matched++;
-			choke_drop_by_idx(sch, idx);
+			choke_drop_by_idx(sch, idx, to_free);
 			goto congestion_drop;
 		}
 
@@ -331,16 +333,16 @@ static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 
 	q->stats.pdrop++;
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 
 congestion_drop:
-	qdisc_drop(skb, sch);
+	qdisc_drop(skb, sch, to_free);
 	return NET_XMIT_CN;
 
 other_drop:
 	if (ret & __NET_XMIT_BYPASS)
 		qdisc_qstats_drop(sch);
-	kfree_skb(skb);
+	__qdisc_drop(skb, to_free);
 	return ret;
 }
 
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index c5bc424e3b3c..4002df3c7d9f 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -82,7 +82,8 @@ static void drop_func(struct sk_buff *skb, void *ctx)
 {
 	struct Qdisc *sch = ctx;
 
-	qdisc_drop(skb, sch);
+	kfree_skb(skb);
+	qdisc_qstats_drop(sch);
 }
 
 static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
@@ -107,7 +108,8 @@ static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
-static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			       struct sk_buff **to_free)
 {
 	struct codel_sched_data *q;
 
@@ -117,7 +119,7 @@ static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 	q = qdisc_priv(sch);
 	q->drop_overlimit++;
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 }
 
 static const struct nla_policy codel_policy[TCA_CODEL_MAX + 1] = {
diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c
index 22609e4e845f..8af5c59eef84 100644
--- a/net/sched/sch_drr.c
+++ b/net/sched/sch_drr.c
@@ -350,7 +350,8 @@ static struct drr_class *drr_classify(struct sk_buff *skb, struct Qdisc *sch,
 	return NULL;
 }
 
-static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 	struct drr_sched *q = qdisc_priv(sch);
 	struct drr_class *cl;
@@ -360,11 +361,11 @@ static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (cl == NULL) {
 		if (err & __NET_XMIT_BYPASS)
 			qdisc_qstats_drop(sch);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return err;
 	}
 
-	err = qdisc_enqueue(skb, cl->qdisc);
+	err = qdisc_enqueue(skb, cl->qdisc, to_free);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		if (net_xmit_drop_count(err)) {
 			cl->qstats.drops++;
diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c
index b9ba5f658528..1308bbf460f7 100644
--- a/net/sched/sch_dsmark.c
+++ b/net/sched/sch_dsmark.c
@@ -191,7 +191,8 @@ static inline struct tcf_proto __rcu **dsmark_find_tcf(struct Qdisc *sch,
 
 /* --------------------------- Qdisc operations ---------------------------- */
 
-static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			  struct sk_buff **to_free)
 {
 	struct dsmark_qdisc_data *p = qdisc_priv(sch);
 	int err;
@@ -234,7 +235,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 #ifdef CONFIG_NET_CLS_ACT
 		case TC_ACT_QUEUED:
 		case TC_ACT_STOLEN:
-			kfree_skb(skb);
+			__qdisc_drop(skb, to_free);
 			return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
 
 		case TC_ACT_SHOT:
@@ -251,7 +252,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		}
 	}
 
-	err = qdisc_enqueue(skb, p->q);
+	err = qdisc_enqueue(skb, p->q, to_free);
 	if (err != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(err))
 			qdisc_qstats_drop(sch);
@@ -264,7 +265,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return NET_XMIT_SUCCESS;
 
 drop:
-	qdisc_drop(skb, sch);
+	qdisc_drop(skb, sch, to_free);
 	return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 }
 
diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index dea70e3ef0ba..6ea0db427f91 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -19,29 +19,32 @@
 
 /* 1 band FIFO pseudo-"scheduler" */
 
-static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			 struct sk_buff **to_free)
 {
 	if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= sch->limit))
 		return qdisc_enqueue_tail(skb, sch);
 
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 }
 
-static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			 struct sk_buff **to_free)
 {
 	if (likely(skb_queue_len(&sch->q) < sch->limit))
 		return qdisc_enqueue_tail(skb, sch);
 
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 }
 
-static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			      struct sk_buff **to_free)
 {
 	if (likely(skb_queue_len(&sch->q) < sch->limit))
 		return qdisc_enqueue_tail(skb, sch);
 
 	/* queue full, remove one skb to fulfill the limit */
-	__qdisc_queue_drop_head(sch, &sch->q);
+	__qdisc_queue_drop_head(sch, &sch->q, to_free);
 	qdisc_qstats_drop(sch);
 	qdisc_enqueue_tail(skb, sch);
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 6eb06674f778..e5458b99e09c 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -368,18 +368,19 @@ static void flow_queue_add(struct fq_flow *flow, struct sk_buff *skb)
 	}
 }
 
-static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		      struct sk_buff **to_free)
 {
 	struct fq_sched_data *q = qdisc_priv(sch);
 	struct fq_flow *f;
 
 	if (unlikely(sch->q.qlen >= sch->limit))
-		return qdisc_drop(skb, sch);
+		return qdisc_drop(skb, sch, to_free);
 
 	f = fq_classify(skb, q);
 	if (unlikely(f->qlen >= q->flow_plimit && f != &q->internal)) {
 		q->stat_flows_plimit++;
-		return qdisc_drop(skb, sch);
+		return qdisc_drop(skb, sch, to_free);
 	}
 
 	f->qlen++;
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 2dc0a849515a..f715195459c9 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -139,7 +139,8 @@ static inline void flow_queue_add(struct fq_codel_flow *flow,
 	skb->next = NULL;
 }
 
-static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets)
+static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
+				  struct sk_buff **to_free)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	struct sk_buff *skb;
@@ -172,7 +173,7 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets)
 		skb = dequeue_head(flow);
 		len += qdisc_pkt_len(skb);
 		mem += skb->truesize;
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 	} while (++i < max_packets && len < threshold);
 
 	flow->dropped += i;
@@ -184,7 +185,8 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets)
 	return idx;
 }
 
-static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			    struct sk_buff **to_free)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	unsigned int idx, prev_backlog, prev_qlen;
@@ -197,7 +199,7 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (idx == 0) {
 		if (ret & __NET_XMIT_BYPASS)
 			qdisc_qstats_drop(sch);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return ret;
 	}
 	idx--;
@@ -229,7 +231,7 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	 * So instead of dropping a single packet, drop half of its backlog
 	 * with a 64 packets limit to not add a too big cpu spike here.
 	 */
-	ret = fq_codel_drop(sch, q->drop_batch_size);
+	ret = fq_codel_drop(sch, q->drop_batch_size, to_free);
 
 	prev_qlen -= sch->q.qlen;
 	prev_backlog -= sch->qstats.backlog;
@@ -276,7 +278,8 @@ static void drop_func(struct sk_buff *skb, void *ctx)
 {
 	struct Qdisc *sch = ctx;
 
-	qdisc_drop(skb, sch);
+	kfree_skb(skb);
+	qdisc_qstats_drop(sch);
 }
 
 static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 773b632e1e33..ff86606954f2 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -348,9 +348,10 @@ EXPORT_SYMBOL(netif_carrier_off);
    cheaper.
  */
 
-static int noop_enqueue(struct sk_buff *skb, struct Qdisc *qdisc)
+static int noop_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
+			struct sk_buff **to_free)
 {
-	kfree_skb(skb);
+	__qdisc_drop(skb, to_free);
 	return NET_XMIT_CN;
 }
 
@@ -439,7 +440,8 @@ static inline struct sk_buff_head *band2list(struct pfifo_fast_priv *priv,
 	return priv->q + band;
 }
 
-static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc)
+static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
+			      struct sk_buff **to_free)
 {
 	if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
 		int band = prio2band[skb->priority & TC_PRIO_MAX];
@@ -451,7 +453,7 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc)
 		return __qdisc_enqueue_tail(skb, qdisc, list);
 	}
 
-	return qdisc_drop(skb, qdisc);
+	return qdisc_drop(skb, qdisc, to_free);
 }
 
 static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c
index b5fb63c7be02..c78a093c551a 100644
--- a/net/sched/sch_gred.c
+++ b/net/sched/sch_gred.c
@@ -149,7 +149,8 @@ static inline int gred_use_harddrop(struct gred_sched *t)
 	return t->red_flags & TC_RED_HARDDROP;
 }
 
-static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			struct sk_buff **to_free)
 {
 	struct gred_sched_data *q = NULL;
 	struct gred_sched *t = qdisc_priv(sch);
@@ -237,10 +238,10 @@ static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 
 	q->stats.pdrop++;
 drop:
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 
 congestion_drop:
-	qdisc_drop(skb, sch);
+	qdisc_drop(skb, sch, to_free);
 	return NET_XMIT_CN;
 }
 
diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index bd08c363a26d..8cb5eff7b79c 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1572,7 +1572,7 @@ hfsc_dump_qdisc(struct Qdisc *sch, struct sk_buff *skb)
 }
 
 static int
-hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 {
 	struct hfsc_class *cl;
 	int uninitialized_var(err);
@@ -1581,11 +1581,11 @@ hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (cl == NULL) {
 		if (err & __NET_XMIT_BYPASS)
 			qdisc_qstats_drop(sch);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return err;
 	}
 
-	err = qdisc_enqueue(skb, cl->qdisc);
+	err = qdisc_enqueue(skb, cl->qdisc, to_free);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		if (net_xmit_drop_count(err)) {
 			cl->qstats.drops++;
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index c44593b8e65a..e3d0458af17b 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -345,7 +345,7 @@ static void bucket_add(struct wdrr_bucket *bucket, struct sk_buff *skb)
 	skb->next = NULL;
 }
 
-static unsigned int hhf_drop(struct Qdisc *sch)
+static unsigned int hhf_drop(struct Qdisc *sch, struct sk_buff **to_free)
 {
 	struct hhf_sched_data *q = qdisc_priv(sch);
 	struct wdrr_bucket *bucket;
@@ -359,16 +359,16 @@ static unsigned int hhf_drop(struct Qdisc *sch)
 		struct sk_buff *skb = dequeue_head(bucket);
 
 		sch->q.qlen--;
-		qdisc_qstats_drop(sch);
 		qdisc_qstats_backlog_dec(sch, skb);
-		kfree_skb(skb);
+		qdisc_drop(skb, sch, to_free);
 	}
 
 	/* Return id of the bucket from which the packet was dropped. */
 	return bucket - q->buckets;
 }
 
-static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 	struct hhf_sched_data *q = qdisc_priv(sch);
 	enum wdrr_bucket_idx idx;
@@ -406,7 +406,7 @@ static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	/* Return Congestion Notification only if we dropped a packet from this
 	 * bucket.
 	 */
-	if (hhf_drop(sch) == idx)
+	if (hhf_drop(sch, to_free) == idx)
 		return NET_XMIT_CN;
 
 	/* As we dropped a packet, better let upper stack know this. */
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index a454605ab5cb..f3882259c385 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -569,7 +569,8 @@ static inline void htb_deactivate(struct htb_sched *q, struct htb_class *cl)
 	list_del_init(&cl->un.leaf.drop_list);
 }
 
-static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 	int uninitialized_var(ret);
 	struct htb_sched *q = qdisc_priv(sch);
@@ -581,16 +582,17 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 			__skb_queue_tail(&q->direct_queue, skb);
 			q->direct_pkts++;
 		} else {
-			return qdisc_drop(skb, sch);
+			return qdisc_drop(skb, sch, to_free);
 		}
 #ifdef CONFIG_NET_CLS_ACT
 	} else if (!cl) {
 		if (ret & __NET_XMIT_BYPASS)
 			qdisc_qstats_drop(sch);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return ret;
 #endif
-	} else if ((ret = qdisc_enqueue(skb, cl->un.leaf.q)) != NET_XMIT_SUCCESS) {
+	} else if ((ret = qdisc_enqueue(skb, cl->un.leaf.q,
+					to_free)) != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(ret)) {
 			qdisc_qstats_drop(sch);
 			cl->qstats.drops++;
diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index 5ea93305d705..9ffbb025b37e 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -65,7 +65,8 @@ multiq_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 }
 
 static int
-multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+	       struct sk_buff **to_free)
 {
 	struct Qdisc *qdisc;
 	int ret;
@@ -76,12 +77,12 @@ multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 
 		if (ret & __NET_XMIT_BYPASS)
 			qdisc_qstats_drop(sch);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return ret;
 	}
 #endif
 
-	ret = qdisc_enqueue(skb, qdisc);
+	ret = qdisc_enqueue(skb, qdisc, to_free);
 	if (ret == NET_XMIT_SUCCESS) {
 		sch->q.qlen++;
 		return NET_XMIT_SUCCESS;
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index e271967439bf..ccca8ca4c722 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -397,7 +397,8 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
  * when we statistically choose to corrupt one, we instead segment it, returning
  * the first packet to be corrupted, and re-enqueue the remaining frames
  */
-static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch)
+static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch,
+				     struct sk_buff **to_free)
 {
 	struct sk_buff *segs;
 	netdev_features_t features = netif_skb_features(skb);
@@ -405,7 +406,7 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch)
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
 
 	if (IS_ERR_OR_NULL(segs)) {
-		qdisc_drop(skb, sch);
+		qdisc_drop(skb, sch, to_free);
 		return NULL;
 	}
 	consume_skb(skb);
@@ -418,7 +419,8 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch)
  * 	NET_XMIT_DROP: queue length didn't change.
  *      NET_XMIT_SUCCESS: one skb was queued.
  */
-static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			 struct sk_buff **to_free)
 {
 	struct netem_sched_data *q = qdisc_priv(sch);
 	/* We don't fill cb now as skb_unshare() may invalidate it */
@@ -443,7 +445,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 	if (count == 0) {
 		qdisc_qstats_drop(sch);
-		kfree_skb(skb);
+		__qdisc_drop(skb, to_free);
 		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	}
 
@@ -463,7 +465,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		u32 dupsave = q->duplicate; /* prevent duplicating a dup... */
 
 		q->duplicate = 0;
-		rootq->enqueue(skb2, rootq);
+		rootq->enqueue(skb2, rootq, to_free);
 		q->duplicate = dupsave;
 	}
 
@@ -475,7 +477,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	 */
 	if (q->corrupt && q->corrupt >= get_crandom(&q->corrupt_cor)) {
 		if (skb_is_gso(skb)) {
-			segs = netem_segment(skb, sch);
+			segs = netem_segment(skb, sch, to_free);
 			if (!segs)
 				return NET_XMIT_DROP;
 		} else {
@@ -488,7 +490,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		if (!(skb = skb_unshare(skb, GFP_ATOMIC)) ||
 		    (skb->ip_summed == CHECKSUM_PARTIAL &&
 		     skb_checksum_help(skb))) {
-			rc = qdisc_drop(skb, sch);
+			rc = qdisc_drop(skb, sch, to_free);
 			goto finish_segs;
 		}
 
@@ -497,7 +499,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 
 	if (unlikely(skb_queue_len(&sch->q) >= sch->limit))
-		return qdisc_drop(skb, sch);
+		return qdisc_drop(skb, sch, to_free);
 
 	qdisc_qstats_backlog_inc(sch, skb);
 
@@ -557,7 +559,7 @@ finish_segs:
 			segs->next = NULL;
 			qdisc_skb_cb(segs)->pkt_len = segs->len;
 			last_len = segs->len;
-			rc = qdisc_enqueue(segs, sch);
+			rc = qdisc_enqueue(segs, sch, to_free);
 			if (rc != NET_XMIT_SUCCESS) {
 				if (net_xmit_drop_count(rc))
 					qdisc_qstats_drop(sch);
@@ -615,8 +617,11 @@ deliver:
 #endif
 
 			if (q->qdisc) {
-				int err = qdisc_enqueue(skb, q->qdisc);
+				struct sk_buff *to_free = NULL;
+				int err;
 
+				err = qdisc_enqueue(skb, q->qdisc, &to_free);
+				kfree_skb_list(to_free);
 				if (unlikely(err != NET_XMIT_SUCCESS)) {
 					if (net_xmit_drop_count(err)) {
 						qdisc_qstats_drop(sch);
diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index 912a46a5d02e..a570b0bb254c 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -134,7 +134,8 @@ static bool drop_early(struct Qdisc *sch, u32 packet_size)
 	return false;
 }
 
-static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			     struct sk_buff **to_free)
 {
 	struct pie_sched_data *q = qdisc_priv(sch);
 	bool enqueue = false;
@@ -166,7 +167,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 
 out:
 	q->stats.dropped++;
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 }
 
 static const struct nla_policy pie_policy[TCA_PIE_MAX + 1] = {
diff --git a/net/sched/sch_plug.c b/net/sched/sch_plug.c
index a12cd37680f8..1c6cbab3e7b9 100644
--- a/net/sched/sch_plug.c
+++ b/net/sched/sch_plug.c
@@ -88,7 +88,8 @@ struct plug_sched_data {
 	u32 pkts_to_release;
 };
 
-static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			struct sk_buff **to_free)
 {
 	struct plug_sched_data *q = qdisc_priv(sch);
 
@@ -98,7 +99,7 @@ static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		return qdisc_enqueue_tail(skb, sch);
 	}
 
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 }
 
 static struct sk_buff *plug_dequeue(struct Qdisc *sch)
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index de492682caee..f4d443aeae54 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -67,7 +67,7 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 }
 
 static int
-prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+prio_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 {
 	struct Qdisc *qdisc;
 	int ret;
@@ -83,7 +83,7 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 #endif
 
-	ret = qdisc_enqueue(skb, qdisc);
+	ret = qdisc_enqueue(skb, qdisc, to_free);
 	if (ret == NET_XMIT_SUCCESS) {
 		qdisc_qstats_backlog_inc(sch, skb);
 		sch->q.qlen++;
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index 0427fa8b23f2..f27ffee106f6 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -1217,7 +1217,8 @@ static struct qfq_aggregate *qfq_choose_next_agg(struct qfq_sched *q)
 	return agg;
 }
 
-static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 	struct qfq_sched *q = qdisc_priv(sch);
 	struct qfq_class *cl;
@@ -1240,11 +1241,11 @@ static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 				     qdisc_pkt_len(skb));
 		if (err) {
 			cl->qstats.drops++;
-			return qdisc_drop(skb, sch);
+			return qdisc_drop(skb, sch, to_free);
 		}
 	}
 
-	err = qdisc_enqueue(skb, cl->qdisc);
+	err = qdisc_enqueue(skb, cl->qdisc, to_free);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		pr_debug("qfq_enqueue: enqueue failed %d\n", err);
 		if (net_xmit_drop_count(err)) {
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index a0d57530335e..249b2a18acbd 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -56,7 +56,8 @@ static inline int red_use_harddrop(struct red_sched_data *q)
 	return q->flags & TC_RED_HARDDROP;
 }
 
-static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 	struct red_sched_data *q = qdisc_priv(sch);
 	struct Qdisc *child = q->qdisc;
@@ -95,7 +96,7 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		break;
 	}
 
-	ret = qdisc_enqueue(skb, child);
+	ret = qdisc_enqueue(skb, child, to_free);
 	if (likely(ret == NET_XMIT_SUCCESS)) {
 		qdisc_qstats_backlog_inc(sch, skb);
 		sch->q.qlen++;
@@ -106,7 +107,7 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return ret;
 
 congestion_drop:
-	qdisc_drop(skb, sch);
+	qdisc_drop(skb, sch, to_free);
 	return NET_XMIT_CN;
 }
 
diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c
index c69611640fa5..add3cc7d37ec 100644
--- a/net/sched/sch_sfb.c
+++ b/net/sched/sch_sfb.c
@@ -275,7 +275,8 @@ static bool sfb_classify(struct sk_buff *skb, struct tcf_proto *fl,
 	return false;
 }
 
-static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 
 	struct sfb_sched_data *q = qdisc_priv(sch);
@@ -397,7 +398,7 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 
 enqueue:
-	ret = qdisc_enqueue(skb, child);
+	ret = qdisc_enqueue(skb, child, to_free);
 	if (likely(ret == NET_XMIT_SUCCESS)) {
 		sch->q.qlen++;
 		increment_qlen(skb, q);
@@ -408,7 +409,7 @@ enqueue:
 	return ret;
 
 drop:
-	qdisc_drop(skb, sch);
+	qdisc_drop(skb, sch, to_free);
 	return NET_XMIT_CN;
 other_drop:
 	if (ret & __NET_XMIT_BYPASS)
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 57d118b41cad..7f195ed4d568 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -343,7 +343,7 @@ static int sfq_headdrop(const struct sfq_sched_data *q)
 }
 
 static int
-sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	unsigned int hash, dropped;
@@ -367,7 +367,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (x == SFQ_EMPTY_SLOT) {
 		x = q->dep[0].next; /* get a free slot */
 		if (x >= SFQ_MAX_FLOWS)
-			return qdisc_drop(skb, sch);
+			return qdisc_drop(skb, sch, to_free);
 		q->ht[hash] = x;
 		slot = &q->slots[x];
 		slot->hash = hash;
@@ -424,14 +424,14 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	if (slot->qlen >= q->maxdepth) {
 congestion_drop:
 		if (!sfq_headdrop(q))
-			return qdisc_drop(skb, sch);
+			return qdisc_drop(skb, sch, to_free);
 
 		/* We know we have at least one packet in queue */
 		head = slot_dequeue_head(slot);
 		delta = qdisc_pkt_len(head) - qdisc_pkt_len(skb);
 		sch->qstats.backlog -= delta;
 		slot->backlog -= delta;
-		qdisc_drop(head, sch);
+		qdisc_drop(head, sch, to_free);
 
 		slot_queue_add(slot, skb);
 		return NET_XMIT_CN;
diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index c12df84d1078..303355c449ab 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -155,7 +155,8 @@ static unsigned int skb_gso_mac_seglen(const struct sk_buff *skb)
 /* GSO packet is too big, segment it so that tbf can transmit
  * each segment in time
  */
-static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch)
+static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
 	struct sk_buff *segs, *nskb;
@@ -166,7 +167,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch)
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
 
 	if (IS_ERR_OR_NULL(segs))
-		return qdisc_drop(skb, sch);
+		return qdisc_drop(skb, sch, to_free);
 
 	nb = 0;
 	while (segs) {
@@ -174,7 +175,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch)
 		segs->next = NULL;
 		qdisc_skb_cb(segs)->pkt_len = segs->len;
 		len += segs->len;
-		ret = qdisc_enqueue(segs, q->qdisc);
+		ret = qdisc_enqueue(segs, q->qdisc, to_free);
 		if (ret != NET_XMIT_SUCCESS) {
 			if (net_xmit_drop_count(ret))
 				qdisc_qstats_drop(sch);
@@ -190,17 +191,18 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch)
 	return nb > 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;
 }
 
-static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
 	int ret;
 
 	if (qdisc_pkt_len(skb) > q->max_size) {
 		if (skb_is_gso(skb) && skb_gso_mac_seglen(skb) <= q->max_size)
-			return tbf_segment(skb, sch);
-		return qdisc_drop(skb, sch);
+			return tbf_segment(skb, sch, to_free);
+		return qdisc_drop(skb, sch, to_free);
 	}
-	ret = qdisc_enqueue(skb, q->qdisc);
+	ret = qdisc_enqueue(skb, q->qdisc, to_free);
 	if (ret != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(ret))
 			qdisc_qstats_drop(sch);
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index e02687185a59..2cd9b4478b92 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -77,7 +77,7 @@ struct teql_sched_data {
 /* "teql*" qdisc routines */
 
 static int
-teql_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+teql_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 {
 	struct net_device *dev = qdisc_dev(sch);
 	struct teql_sched_data *q = qdisc_priv(sch);
@@ -87,7 +87,7 @@ teql_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		return NET_XMIT_SUCCESS;
 	}
 
-	return qdisc_drop(skb, sch);
+	return qdisc_drop(skb, sch, to_free);
 }
 
 static struct sk_buff *
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released
  2016-06-22  6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet
@ 2016-06-22 15:14   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2016-06-22 15:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, John Fastabend, Eric Dumazet, brouer

On Tue, 21 Jun 2016 23:16:49 -0700
Eric Dumazet <edumazet@google.com> wrote:

> Qdisc performance suffers when packets are dropped at enqueue()
> time because drops (kfree_skb()) are done while qdisc lock is held,
> delaying a dequeue() draining the queue.
> 
> Nominal throughput can be reduced by 50 % when this happens,
> at a time we would like the dequeue() to proceed as fast as possible.
> 
> Even FQ is vulnerable to this problem, while one of FQ goals was
> to provide some flow isolation.
> 
> This patch adds a 'struct sk_buff **to_free' parameter to all
> qdisc->enqueue(), and in qdisc_drop() helper.
> 
> I measured a performance increase of up to 12 %, but this patch
> is a prereq so that future batches in enqueue() can fly.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
[...]
> +/* Instead of calling kfree_skb() while root qdisc lock is held,
> + * queue the skb for future freeing at end of __dev_xmit_skb()
> + */
> +static inline void __qdisc_drop(struct sk_buff *skb, struct sk_buff **to_free)
> +{
> +	skb->next = *to_free;
> +	*to_free = skb;
> +}
> +
[...]
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d40593b3b9fb..aba10d2a8bc3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3070,6 +3070,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  				 struct netdev_queue *txq)
>  {
>  	spinlock_t *root_lock = qdisc_lock(q);
> +	struct sk_buff *to_free = NULL;
>  	bool contended;
>  	int rc;
>  
> @@ -3086,7 +3087,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  
>  	spin_lock(root_lock);
>  	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> -		kfree_skb(skb);
> +		__qdisc_drop(skb, &to_free);
>  		rc = NET_XMIT_DROP;
>  	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>  		   qdisc_run_begin(q)) {
> @@ -3109,7 +3110,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  
>  		rc = NET_XMIT_SUCCESS;
>  	} else {
> -		rc = q->enqueue(skb, q) & NET_XMIT_MASK;
> +		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>  		if (qdisc_run_begin(q)) {
>  			if (unlikely(contended)) {
>  				spin_unlock(&q->busylock);
> @@ -3119,6 +3120,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  		}
>  	}
>  	spin_unlock(root_lock);
> +	if (unlikely(to_free))
> +		kfree_skb_list(to_free);

Great, now there is a good argument for implementing kmem_cache bulk
freeing inside kfree_skb_list().  I did a ugly PoC implementation
once, but there was no use-case that really needed the performance
boost.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb
  2016-06-22  6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet
  2016-06-22  6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet
@ 2016-06-22  6:16 ` Eric Dumazet
  2016-06-22  6:16 ` [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps Eric Dumazet
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22  6:16 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer,
	Eric Dumazet

Now we defer skb drops, it makes sense to keep a copy
of skb->truesize in struct codel_skb_cb to avoid one
cache line miss per dropped skb in fq_codel_drop(),
to reduce latencies a bit further.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/codel_qdisc.h | 1 +
 net/sched/sch_fq_codel.c  | 7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/net/codel_qdisc.h b/include/net/codel_qdisc.h
index 8144d9cd2908..098630f83a55 100644
--- a/include/net/codel_qdisc.h
+++ b/include/net/codel_qdisc.h
@@ -52,6 +52,7 @@
 /* Qdiscs using codel plugin must use codel_skb_cb in their own cb[] */
 struct codel_skb_cb {
 	codel_time_t enqueue_time;
+	unsigned int mem_usage;
 };
 
 static struct codel_skb_cb *get_codel_cb(const struct sk_buff *skb)
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index f715195459c9..a5ea0e9b6be4 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -172,7 +172,7 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
 	do {
 		skb = dequeue_head(flow);
 		len += qdisc_pkt_len(skb);
-		mem += skb->truesize;
+		mem += get_codel_cb(skb)->mem_usage;
 		__qdisc_drop(skb, to_free);
 	} while (++i < max_packets && len < threshold);
 
@@ -216,7 +216,8 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		flow->deficit = q->quantum;
 		flow->dropped = 0;
 	}
-	q->memory_usage += skb->truesize;
+	get_codel_cb(skb)->mem_usage = skb->truesize;
+	q->memory_usage += get_codel_cb(skb)->mem_usage;
 	memory_limited = q->memory_usage > q->memory_limit;
 	if (++sch->q.qlen <= sch->limit && !memory_limited)
 		return NET_XMIT_SUCCESS;
@@ -267,7 +268,7 @@ static struct sk_buff *dequeue_func(struct codel_vars *vars, void *ctx)
 	if (flow->head) {
 		skb = dequeue_head(flow);
 		q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
-		q->memory_usage -= skb->truesize;
+		q->memory_usage -= get_codel_cb(skb)->mem_usage;
 		sch->q.qlen--;
 		sch->qstats.backlog -= qdisc_pkt_len(skb);
 	}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps
  2016-06-22  6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet
  2016-06-22  6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet
  2016-06-22  6:16 ` [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb Eric Dumazet
@ 2016-06-22  6:16 ` Eric Dumazet
  2016-06-22  6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22  6:16 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer,
	Eric Dumazet

We already get child qdisc qlen, we also can get its backlog
so that class dumps can report it.

Also replace qstats by a single drop counter, but move it in
a separate cache line so that drops do not dirty useful cache lines.

Tested:

$ tc -s cl sh dev eth0
class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
 Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
 rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
 lended: 9021770 borrowed: 0 giants: 0
 tokens: 9 ctokens: 9

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_htb.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index f3882259c385..ba098f2654b4 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -117,7 +117,6 @@ struct htb_class {
 	 * Written often fields
 	 */
 	struct gnet_stats_basic_packed bstats;
-	struct gnet_stats_queue	qstats;
 	struct tc_htb_xstats	xstats;	/* our special stats */
 
 	/* token bucket parameters */
@@ -140,6 +139,8 @@ struct htb_class {
 	enum htb_cmode		cmode;		/* current mode of the class */
 	struct rb_node		pq_node;	/* node for event queue */
 	struct rb_node		node[TC_HTB_NUMPRIO];	/* node for self or feed tree */
+
+	unsigned int drops ____cacheline_aligned_in_smp;
 };
 
 struct htb_level {
@@ -595,7 +596,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 					to_free)) != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(ret)) {
 			qdisc_qstats_drop(sch);
-			cl->qstats.drops++;
+			cl->drops++;
 		}
 		return ret;
 	} else {
@@ -1110,17 +1111,22 @@ static int
 htb_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
 {
 	struct htb_class *cl = (struct htb_class *)arg;
+	struct gnet_stats_queue qs = {
+		.drops = cl->drops,
+	};
 	__u32 qlen = 0;
 
-	if (!cl->level && cl->un.leaf.q)
+	if (!cl->level && cl->un.leaf.q) {
 		qlen = cl->un.leaf.q->q.qlen;
+		qs.backlog = cl->un.leaf.q->qstats.backlog;
+	}
 	cl->xstats.tokens = PSCHED_NS2TICKS(cl->tokens);
 	cl->xstats.ctokens = PSCHED_NS2TICKS(cl->ctokens);
 
 	if (gnet_stats_copy_basic(qdisc_root_sleeping_running(sch),
 				  d, NULL, &cl->bstats) < 0 ||
 	    gnet_stats_copy_rate_est(d, NULL, &cl->rate_est) < 0 ||
-	    gnet_stats_copy_queue(d, NULL, &cl->qstats, qlen) < 0)
+	    gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0)
 		return -1;
 
 	return gnet_stats_copy_app(d, &cl->xstats, sizeof(cl->xstats));
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 4/4] net_sched: generalize bulk dequeue
  2016-06-22  6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet
                   ` (2 preceding siblings ...)
  2016-06-22  6:16 ` [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps Eric Dumazet
@ 2016-06-22  6:16 ` Eric Dumazet
  2016-06-22 15:03   ` Jesper Dangaard Brouer
  2016-06-23  7:26   ` Paolo Abeni
  2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer
  2016-06-25 16:20 ` David Miller
  5 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22  6:16 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann

When qdisc bulk dequeue was added in linux-3.18 (commit
5772e9a3463b "qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE"), it was constrained to some
specific qdiscs.

With some extra care, we can extend this to all qdiscs,
so that typical traffic shaping solutions can benefit from
small batches (8 packets in this patch).

For example, HTB is often used on some multi queue device.
And bonding/team are multi queue devices...

Idea is to bulk-dequeue packets mapping to the same transmit queue.

This brings between 35 and 80 % performance increase in HTB setup
under pressure on a bonding setup :

1) NUMA node contention :   610,000 pps -> 1,110,000 pps
2) No node contention   : 1,380,000 pps -> 1,930,000 pps

Now we should work to add batches on the enqueue() side ;)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/net/sch_generic.h |  7 ++---
 net/sched/sch_generic.c   | 68 ++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 04e84c07c94f..909aff2db2b3 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -75,13 +75,14 @@ struct Qdisc {
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
 	 */
-	struct Qdisc		*next_sched ____cacheline_aligned_in_smp;
-	struct sk_buff		*gso_skb;
-	unsigned long		state;
+	struct sk_buff		*gso_skb ____cacheline_aligned_in_smp;
 	struct sk_buff_head	q;
 	struct gnet_stats_basic_packed bstats;
 	seqcount_t		running;
 	struct gnet_stats_queue	qstats;
+	unsigned long		state;
+	struct Qdisc            *next_sched;
+	struct sk_buff		*skb_bad_txq;
 	struct rcu_head		rcu_head;
 	int			padded;
 	atomic_t		refcnt;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ff86606954f2..e95b67cd5718 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -77,6 +77,34 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 	skb->next = NULL;
 }
 
+/* This variant of try_bulk_dequeue_skb() makes sure
+ * all skbs in the chain are for the same txq
+ */
+static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
+				      struct sk_buff *skb,
+				      int *packets)
+{
+	int mapping = skb_get_queue_mapping(skb);
+	struct sk_buff *nskb;
+	int cnt = 0;
+
+	do {
+		nskb = q->dequeue(q);
+		if (!nskb)
+			break;
+		if (unlikely(skb_get_queue_mapping(nskb) != mapping)) {
+			q->skb_bad_txq = nskb;
+			qdisc_qstats_backlog_inc(q, nskb);
+			q->q.qlen++;
+			break;
+		}
+		skb->next = nskb;
+		skb = nskb;
+	} while (++cnt < 8);
+	(*packets) += cnt;
+	skb->next = NULL;
+}
+
 /* Note that dequeue_skb can possibly return a SKB list (via skb->next).
  * A requeued skb (via q->gso_skb) can also be a SKB list.
  */
@@ -87,8 +115,9 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	const struct netdev_queue *txq = q->dev_queue;
 
 	*packets = 1;
-	*validate = true;
 	if (unlikely(skb)) {
+		/* skb in gso_skb were already validated */
+		*validate = false;
 		/* check the reason of requeuing without tx lock first */
 		txq = skb_get_tx_queue(txq->dev, skb);
 		if (!netif_xmit_frozen_or_stopped(txq)) {
@@ -97,15 +126,30 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 			q->q.qlen--;
 		} else
 			skb = NULL;
-		/* skb in gso_skb were already validated */
-		*validate = false;
-	} else {
-		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
-		    !netif_xmit_frozen_or_stopped(txq)) {
-			skb = q->dequeue(q);
-			if (skb && qdisc_may_bulk(q))
-				try_bulk_dequeue_skb(q, skb, txq, packets);
+		return skb;
+	}
+	*validate = true;
+	skb = q->skb_bad_txq;
+	if (unlikely(skb)) {
+		/* check the reason of requeuing without tx lock first */
+		txq = skb_get_tx_queue(txq->dev, skb);
+		if (!netif_xmit_frozen_or_stopped(txq)) {
+			q->skb_bad_txq = NULL;
+			qdisc_qstats_backlog_dec(q, skb);
+			q->q.qlen--;
+			goto bulk;
 		}
+		return NULL;
+	}
+	if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+	    !netif_xmit_frozen_or_stopped(txq))
+		skb = q->dequeue(q);
+	if (skb) {
+bulk:
+		if (qdisc_may_bulk(q))
+			try_bulk_dequeue_skb(q, skb, txq, packets);
+		else
+			try_bulk_dequeue_skb_slow(q, skb, packets);
 	}
 	return skb;
 }
@@ -624,11 +668,14 @@ void qdisc_reset(struct Qdisc *qdisc)
 	if (ops->reset)
 		ops->reset(qdisc);
 
+	kfree_skb(qdisc->skb_bad_txq);
+	qdisc->skb_bad_txq = NULL;
+
 	if (qdisc->gso_skb) {
 		kfree_skb_list(qdisc->gso_skb);
 		qdisc->gso_skb = NULL;
-		qdisc->q.qlen = 0;
 	}
+	qdisc->q.qlen = 0;
 }
 EXPORT_SYMBOL(qdisc_reset);
 
@@ -667,6 +714,7 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	dev_put(qdisc_dev(qdisc));
 
 	kfree_skb_list(qdisc->gso_skb);
+	kfree_skb(qdisc->skb_bad_txq);
 	/*
 	 * gen_estimator est_timer() might access qdisc->q.lock,
 	 * wait a RCU grace period before freeing qdisc.
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 4/4] net_sched: generalize bulk dequeue
  2016-06-22  6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet
@ 2016-06-22 15:03   ` Jesper Dangaard Brouer
  2016-06-23  7:26   ` Paolo Abeni
  1 sibling, 0 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2016-06-22 15:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, John Fastabend, Eric Dumazet,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann, brouer

On Tue, 21 Jun 2016 23:16:52 -0700
Eric Dumazet <edumazet@google.com> wrote:

> When qdisc bulk dequeue was added in linux-3.18 (commit
> 5772e9a3463b "qdisc: bulk dequeue support for qdiscs
> with TCQ_F_ONETXQUEUE"), it was constrained to some
> specific qdiscs.

Thanks for extending this!

> With some extra care, we can extend this to all qdiscs,
> so that typical traffic shaping solutions can benefit from
> small batches (8 packets in this patch).

I'm fine with limiting this to 8 packets (xmit_more), as that seem to
be the minimum needed TX batch size (according to Luigi's original
netmap article[1] figure 7) for doing 10G wirespeed.

[1] http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf
 

> For example, HTB is often used on some multi queue device.
> And bonding/team are multi queue devices...
> 
> Idea is to bulk-dequeue packets mapping to the same transmit queue.
> 
> This brings between 35 and 80 % performance increase in HTB setup
> under pressure on a bonding setup :
> 
> 1) NUMA node contention :   610,000 pps -> 1,110,000 pps
> 2) No node contention   : 1,380,000 pps -> 1,930,000 pps
> 
> Now we should work to add batches on the enqueue() side ;)

Yes, please! :-))) That will be the next big step!


> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Cc: Florian Westphal <fw@strlen.de>
> Cc: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

This is great stuff!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 4/4] net_sched: generalize bulk dequeue
  2016-06-22  6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet
  2016-06-22 15:03   ` Jesper Dangaard Brouer
@ 2016-06-23  7:26   ` Paolo Abeni
  1 sibling, 0 replies; 15+ messages in thread
From: Paolo Abeni @ 2016-06-23  7:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, John Fastabend, Jesper Dangaard Brouer,
	Eric Dumazet, Hannes Frederic Sowa, Florian Westphal,
	Daniel Borkmann

On Tue, 2016-06-21 at 23:16 -0700, Eric Dumazet wrote:
> When qdisc bulk dequeue was added in linux-3.18 (commit
> 5772e9a3463b "qdisc: bulk dequeue support for qdiscs
> with TCQ_F_ONETXQUEUE"), it was constrained to some
> specific qdiscs.
> 
> With some extra care, we can extend this to all qdiscs,
> so that typical traffic shaping solutions can benefit from
> small batches (8 packets in this patch).
> 
> For example, HTB is often used on some multi queue device.
> And bonding/team are multi queue devices...
> 
> Idea is to bulk-dequeue packets mapping to the same transmit queue.
> 
> This brings between 35 and 80 % performance increase in HTB setup
> under pressure on a bonding setup :
> 
> 1) NUMA node contention :   610,000 pps -> 1,110,000 pps
> 2) No node contention   : 1,380,000 pps -> 1,930,000 pps
> 
> Now we should work to add batches on the enqueue() side ;)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Cc: Florian Westphal <fw@strlen.de>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/net/sch_generic.h |  7 ++---
>  net/sched/sch_generic.c   | 68 ++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 62 insertions(+), 13 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 04e84c07c94f..909aff2db2b3 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -75,13 +75,14 @@ struct Qdisc {
>  	/*
>  	 * For performance sake on SMP, we put highly modified fields at the end
>  	 */
> -	struct Qdisc		*next_sched ____cacheline_aligned_in_smp;
> -	struct sk_buff		*gso_skb;
> -	unsigned long		state;
> +	struct sk_buff		*gso_skb ____cacheline_aligned_in_smp;
>  	struct sk_buff_head	q;
>  	struct gnet_stats_basic_packed bstats;
>  	seqcount_t		running;
>  	struct gnet_stats_queue	qstats;
> +	unsigned long		state;
> +	struct Qdisc            *next_sched;
> +	struct sk_buff		*skb_bad_txq;
>  	struct rcu_head		rcu_head;
>  	int			padded;
>  	atomic_t		refcnt;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index ff86606954f2..e95b67cd5718 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -77,6 +77,34 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
>  	skb->next = NULL;
>  }
>  
> +/* This variant of try_bulk_dequeue_skb() makes sure
> + * all skbs in the chain are for the same txq
> + */
> +static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
> +				      struct sk_buff *skb,
> +				      int *packets)
> +{
> +	int mapping = skb_get_queue_mapping(skb);
> +	struct sk_buff *nskb;
> +	int cnt = 0;
> +
> +	do {
> +		nskb = q->dequeue(q);
> +		if (!nskb)
> +			break;
> +		if (unlikely(skb_get_queue_mapping(nskb) != mapping)) {
> +			q->skb_bad_txq = nskb;
> +			qdisc_qstats_backlog_inc(q, nskb);
> +			q->q.qlen++;
> +			break;
> +		}
> +		skb->next = nskb;
> +		skb = nskb;
> +	} while (++cnt < 8);
> +	(*packets) += cnt;
> +	skb->next = NULL;
> +}
> +
>  /* Note that dequeue_skb can possibly return a SKB list (via skb->next).
>   * A requeued skb (via q->gso_skb) can also be a SKB list.
>   */
> @@ -87,8 +115,9 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>  	const struct netdev_queue *txq = q->dev_queue;
>  
>  	*packets = 1;
> -	*validate = true;
>  	if (unlikely(skb)) {
> +		/* skb in gso_skb were already validated */
> +		*validate = false;
>  		/* check the reason of requeuing without tx lock first */
>  		txq = skb_get_tx_queue(txq->dev, skb);
>  		if (!netif_xmit_frozen_or_stopped(txq)) {
> @@ -97,15 +126,30 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>  			q->q.qlen--;
>  		} else
>  			skb = NULL;
> -		/* skb in gso_skb were already validated */
> -		*validate = false;
> -	} else {
> -		if (!(q->flags & TCQ_F_ONETXQUEUE) ||
> -		    !netif_xmit_frozen_or_stopped(txq)) {
> -			skb = q->dequeue(q);
> -			if (skb && qdisc_may_bulk(q))
> -				try_bulk_dequeue_skb(q, skb, txq, packets);
> +		return skb;
> +	}
> +	*validate = true;
> +	skb = q->skb_bad_txq;
> +	if (unlikely(skb)) {
> +		/* check the reason of requeuing without tx lock first */
> +		txq = skb_get_tx_queue(txq->dev, skb);
> +		if (!netif_xmit_frozen_or_stopped(txq)) {
> +			q->skb_bad_txq = NULL;
> +			qdisc_qstats_backlog_dec(q, skb);
> +			q->q.qlen--;
> +			goto bulk;
>  		}
> +		return NULL;
> +	}
> +	if (!(q->flags & TCQ_F_ONETXQUEUE) ||
               
You can use qdisc_may_bulk() here, I guess. Not a functional change,
just to improve readability.

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
  2016-06-22  6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet
                   ` (3 preceding siblings ...)
  2016-06-22  6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet
@ 2016-06-22 14:47 ` Jesper Dangaard Brouer
  2016-06-22 14:55   ` Eric Dumazet
  2016-06-25 16:20 ` David Miller
  5 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2016-06-22 14:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, John Fastabend, Eric Dumazet, brouer

On Tue, 21 Jun 2016 23:16:48 -0700
Eric Dumazet <edumazet@google.com> wrote:

> First patch adds an additional parameter to ->enqueue() qdisc method
> so that drops can be done outside of critical section
> (after locks are released).
> 
> Then fq_codel can have a small optimization to reduce number of cache
> lines misses during a drop event
> (possibly accumulating hundreds of packets to be freed).
> 
> A small htb change exports the backlog in class dumps.
> 
> Final patch adds bulk dequeue to qdiscs that were lacking this feature.
> 
> This series brings a nice qdisc performance increase (more than 80 %
> in some cases).

Thanks for working on this Eric! this is great work! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
  2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer
@ 2016-06-22 14:55   ` Eric Dumazet
  2016-06-22 15:44     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22 14:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend

On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 21 Jun 2016 23:16:48 -0700
> Eric Dumazet <edumazet@google.com> wrote:
> 
> > First patch adds an additional parameter to ->enqueue() qdisc method
> > so that drops can be done outside of critical section
> > (after locks are released).
> > 
> > Then fq_codel can have a small optimization to reduce number of cache
> > lines misses during a drop event
> > (possibly accumulating hundreds of packets to be freed).
> > 
> > A small htb change exports the backlog in class dumps.
> > 
> > Final patch adds bulk dequeue to qdiscs that were lacking this feature.
> > 
> > This series brings a nice qdisc performance increase (more than 80 %
> > in some cases).
> 
> Thanks for working on this Eric! this is great work! :-)

Thanks Jesper

I worked yesterday on bulk enqueues, but initial results are not that
great.

Here is my current patch, on top of my last series :

(A second patch would remove the busylock spinlock, of course)

 include/net/sch_generic.h |    9 ++
 net/core/dev.c            |  135 ++++++++++++++++++++++++++++--------
 2 files changed, 115 insertions(+), 29 deletions(-)


diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 909aff2db2b3..1975a6fab10f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -87,7 +87,14 @@ struct Qdisc {
 	int			padded;
 	atomic_t		refcnt;
 
-	spinlock_t		busylock ____cacheline_aligned_in_smp;
+	spinlock_t		busylock;
+
+#ifdef CONFIG_SMP
+	struct pcpu_skb_context *prequeue0 ____cacheline_aligned_in_smp;
+#ifdef CONFIG_NUMA
+	struct pcpu_skb_context *prequeue1 ____cacheline_aligned_in_smp;
+#endif
+#endif
 };
 
 static inline bool qdisc_is_running(const struct Qdisc *qdisc)
diff --git a/net/core/dev.c b/net/core/dev.c
index aba10d2a8bc3..5f0d3fe5b109 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3065,26 +3065,117 @@ static void qdisc_pkt_len_init(struct sk_buff *skb)
 	}
 }
 
-static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
-				 struct net_device *dev,
-				 struct netdev_queue *txq)
+
+#ifdef CONFIG_SMP
+
+struct pcpu_skb_context {
+	struct pcpu_skb_context		*next;
+	union {
+		struct sk_buff		*skb;
+		unsigned long		status;
+		struct pcpu_skb_context	*self;
+	};
+};
+static DEFINE_PER_CPU_ALIGNED(struct pcpu_skb_context, pcpu_skb_context);
+
+/* Provides a small queue before qdisc so that we can batch ->enqueue()
+ * under SMP stress.
+ */
+static noinline int dev_xmit_skb_slow(struct sk_buff *skb, struct Qdisc *q,
+				      struct pcpu_skb_context **prequeue)
+{
+	struct pcpu_skb_context *prev, *myptr;
+	struct sk_buff *to_free = NULL;
+	spinlock_t *root_lock;
+	void *status;
+	int i, rc;
+
+	myptr = this_cpu_ptr(&pcpu_skb_context);
+	myptr->skb = skb;
+	myptr->next = NULL;
+
+	/* Take our ticket in prequeue file, a la MCS lock */
+	prev = xchg(prequeue, myptr);
+	if (prev) {
+		/* Ok, another cpu got the lock and serves the prequeue.
+		 * Wait that it either processed our skb or it exhausted
+		 * its budget and told us to process a batch ourself.
+		 */
+		WRITE_ONCE(prev->next, myptr);
+
+		while ((status = READ_ONCE(myptr->skb)) == skb)
+			cpu_relax_lowlatency();
+
+		/* Nice ! Our skb was handled by another cpu */
+		if ((unsigned long)status < NET_XMIT_MASK)
+			return (int)(unsigned long)status;
+
+		/* Oh well, we got the responsability of next batch */
+		BUG_ON(myptr != status);
+	}
+	root_lock = qdisc_lock(q);
+	spin_lock(root_lock);
+
+	for (i = 0; i < 16; i++) {
+		bool may_release = true;
+
+		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+			__qdisc_drop(skb, &to_free);
+			rc = NET_XMIT_DROP;
+		} else {
+			rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+		}
+		while (!(prev = READ_ONCE(myptr->next))) {
+			if (may_release) {
+				if (cmpxchg_release(prequeue, myptr, NULL) == myptr)
+					break;
+				may_release = false;
+			}
+			cpu_relax_lowlatency();
+		}
+		smp_store_release(&myptr->status, (unsigned long)rc);
+		myptr = prev;
+		if (!myptr)
+			break;
+		skb = READ_ONCE(myptr->skb);
+	}
+
+	qdisc_run(q);
+	spin_unlock(root_lock);
+
+	/* Give control to another cpu for following batch */
+	if (myptr)
+		smp_store_release(&myptr->self, myptr);
+
+	if (unlikely(to_free))
+		kfree_skb_list(to_free);
+
+	return (int)this_cpu_read(pcpu_skb_context.status);
+}
+#endif
+
+static int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
+			  struct net_device *dev,
+			  struct netdev_queue *txq)
 {
 	spinlock_t *root_lock = qdisc_lock(q);
 	struct sk_buff *to_free = NULL;
-	bool contended;
 	int rc;
 
 	qdisc_calculate_pkt_len(skb, q);
-	/*
-	 * Heuristic to force contended enqueues to serialize on a
-	 * separate lock before trying to get qdisc main lock.
-	 * This permits qdisc->running owner to get the lock more
-	 * often and dequeue packets faster.
-	 */
-	contended = qdisc_is_running(q);
-	if (unlikely(contended))
-		spin_lock(&q->busylock);
 
+#ifdef CONFIG_SMP
+	{
+	struct pcpu_skb_context **prequeue = &q->prequeue0;
+
+#ifdef CONFIG_NUMA
+	if (numa_node_id() & 1)
+		prequeue = &q->prequeue1;
+#endif
+	if (unlikely(*prequeue || qdisc_is_running(q)))
+		return dev_xmit_skb_slow(skb, q, prequeue);
+	}
+#endif
 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
 		__qdisc_drop(skb, &to_free);
@@ -3099,31 +3190,19 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 		qdisc_bstats_update(q, skb);
 
-		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
+		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
 			__qdisc_run(q);
-		} else
+		else
 			qdisc_run_end(q);
 
 		rc = NET_XMIT_SUCCESS;
 	} else {
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-		if (qdisc_run_begin(q)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
-			__qdisc_run(q);
-		}
+		qdisc_run(q);
 	}
 	spin_unlock(root_lock);
 	if (unlikely(to_free))
 		kfree_skb_list(to_free);
-	if (unlikely(contended))
-		spin_unlock(&q->busylock);
 	return rc;
 }
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
  2016-06-22 14:55   ` Eric Dumazet
@ 2016-06-22 15:44     ` Jesper Dangaard Brouer
  2016-06-22 16:49       ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2016-06-22 15:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend, brouer,
	Luigi Rizzo

On Wed, 22 Jun 2016 07:55:43 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote:
> > On Tue, 21 Jun 2016 23:16:48 -0700
> > Eric Dumazet <edumazet@google.com> wrote:
> >   
> > > First patch adds an additional parameter to ->enqueue() qdisc method
> > > so that drops can be done outside of critical section
> > > (after locks are released).
> > > 
> > > Then fq_codel can have a small optimization to reduce number of cache
> > > lines misses during a drop event
> > > (possibly accumulating hundreds of packets to be freed).
> > > 
> > > A small htb change exports the backlog in class dumps.
> > > 
> > > Final patch adds bulk dequeue to qdiscs that were lacking this feature.
> > > 
> > > This series brings a nice qdisc performance increase (more than 80 %
> > > in some cases).  
> > 
> > Thanks for working on this Eric! this is great work! :-)  
> 
> Thanks Jesper
> 
> I worked yesterday on bulk enqueues, but initial results are not that
> great.

Hi Eric,

This is interesting work! But I think you should read Luigi Rizzo's
(Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling
Architecture"[1]

[1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf

Luigi will be at Netfilter Workshop next week, and will actually
present on topic/paper.... you two should talk ;-)

The article is not a 100% match for what we need, but there is some
good ideas.  The article also have a sort of "prequeue" that
enqueue'ing CPUs will place packets into.

My understanding of the article:

1. transmitters submit packets to an intermediate queue
   (replace q->enqueue call) lockless submit as queue per CPU
   (runs in parallel)

2. like we only have _one_ qdisc dequeue process, this process (called
   arbiter) empty the intermediate queues, and then invoke q->enqueue()
   and q->dequeue(). (in a locked session/region)

3. Packets returned from q->dequeue() is placed on an outgoing
   intermediate queue.

4. the transmitter then looks to see there are any packets to drain()
   from the outgoing queue.  This can run in parallel.

If the transmitter submitting a packet, detect no arbiter is running,
it can become the arbiter itself.  Like we do with qdisc_run_begin()
setting state __QDISC___STATE_RUNNING.

The problem with this scheme is push-back from qdisc->enqueue
(NET_XMIT_CN) does not "reach" us.  And push-back in-form of processes
blocking on qdisc root lock, but that could be handled by either
blocking in article's submit() or returning some congestion return code
from submit(). 


(left patch intact below for Luigi to see)

> Here is my current patch, on top of my last series :
> 
> (A second patch would remove the busylock spinlock, of course)
> 
>  include/net/sch_generic.h |    9 ++
>  net/core/dev.c            |  135 ++++++++++++++++++++++++++++--------
>  2 files changed, 115 insertions(+), 29 deletions(-)
> 
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 909aff2db2b3..1975a6fab10f 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -87,7 +87,14 @@ struct Qdisc {
>  	int			padded;
>  	atomic_t		refcnt;
>  
> -	spinlock_t		busylock ____cacheline_aligned_in_smp;
> +	spinlock_t		busylock;
> +
> +#ifdef CONFIG_SMP
> +	struct pcpu_skb_context *prequeue0 ____cacheline_aligned_in_smp;
> +#ifdef CONFIG_NUMA
> +	struct pcpu_skb_context *prequeue1 ____cacheline_aligned_in_smp;
> +#endif
> +#endif
>  };
>  
>  static inline bool qdisc_is_running(const struct Qdisc *qdisc)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index aba10d2a8bc3..5f0d3fe5b109 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3065,26 +3065,117 @@ static void qdisc_pkt_len_init(struct sk_buff *skb)
>  	}
>  }
>  
> -static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> -				 struct net_device *dev,
> -				 struct netdev_queue *txq)
> +
> +#ifdef CONFIG_SMP
> +
> +struct pcpu_skb_context {
> +	struct pcpu_skb_context		*next;
> +	union {
> +		struct sk_buff		*skb;
> +		unsigned long		status;
> +		struct pcpu_skb_context	*self;
> +	};
> +};
> +static DEFINE_PER_CPU_ALIGNED(struct pcpu_skb_context, pcpu_skb_context);
> +
> +/* Provides a small queue before qdisc so that we can batch ->enqueue()
> + * under SMP stress.
> + */
> +static noinline int dev_xmit_skb_slow(struct sk_buff *skb, struct Qdisc *q,
> +				      struct pcpu_skb_context **prequeue)
> +{
> +	struct pcpu_skb_context *prev, *myptr;
> +	struct sk_buff *to_free = NULL;
> +	spinlock_t *root_lock;
> +	void *status;
> +	int i, rc;
> +
> +	myptr = this_cpu_ptr(&pcpu_skb_context);
> +	myptr->skb = skb;
> +	myptr->next = NULL;
> +
> +	/* Take our ticket in prequeue file, a la MCS lock */
> +	prev = xchg(prequeue, myptr);
> +	if (prev) {
> +		/* Ok, another cpu got the lock and serves the prequeue.
> +		 * Wait that it either processed our skb or it exhausted
> +		 * its budget and told us to process a batch ourself.
> +		 */
> +		WRITE_ONCE(prev->next, myptr);
> +
> +		while ((status = READ_ONCE(myptr->skb)) == skb)
> +			cpu_relax_lowlatency();
> +
> +		/* Nice ! Our skb was handled by another cpu */
> +		if ((unsigned long)status < NET_XMIT_MASK)
> +			return (int)(unsigned long)status;
> +
> +		/* Oh well, we got the responsability of next batch */
> +		BUG_ON(myptr != status);
> +	}
> +	root_lock = qdisc_lock(q);
> +	spin_lock(root_lock);
> +
> +	for (i = 0; i < 16; i++) {
> +		bool may_release = true;
> +
> +		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> +			__qdisc_drop(skb, &to_free);
> +			rc = NET_XMIT_DROP;
> +		} else {
> +			rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> +		}
> +		while (!(prev = READ_ONCE(myptr->next))) {
> +			if (may_release) {
> +				if (cmpxchg_release(prequeue, myptr, NULL) == myptr)
> +					break;
> +				may_release = false;
> +			}
> +			cpu_relax_lowlatency();
> +		}
> +		smp_store_release(&myptr->status, (unsigned long)rc);
> +		myptr = prev;
> +		if (!myptr)
> +			break;
> +		skb = READ_ONCE(myptr->skb);
> +	}
> +
> +	qdisc_run(q);
> +	spin_unlock(root_lock);
> +
> +	/* Give control to another cpu for following batch */
> +	if (myptr)
> +		smp_store_release(&myptr->self, myptr);
> +
> +	if (unlikely(to_free))
> +		kfree_skb_list(to_free);
> +
> +	return (int)this_cpu_read(pcpu_skb_context.status);
> +}
> +#endif
> +
> +static int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> +			  struct net_device *dev,
> +			  struct netdev_queue *txq)
>  {
>  	spinlock_t *root_lock = qdisc_lock(q);
>  	struct sk_buff *to_free = NULL;
> -	bool contended;
>  	int rc;
>  
>  	qdisc_calculate_pkt_len(skb, q);
> -	/*
> -	 * Heuristic to force contended enqueues to serialize on a
> -	 * separate lock before trying to get qdisc main lock.
> -	 * This permits qdisc->running owner to get the lock more
> -	 * often and dequeue packets faster.
> -	 */
> -	contended = qdisc_is_running(q);
> -	if (unlikely(contended))
> -		spin_lock(&q->busylock);
>  
> +#ifdef CONFIG_SMP
> +	{
> +	struct pcpu_skb_context **prequeue = &q->prequeue0;
> +
> +#ifdef CONFIG_NUMA
> +	if (numa_node_id() & 1)
> +		prequeue = &q->prequeue1;
> +#endif
> +	if (unlikely(*prequeue || qdisc_is_running(q)))
> +		return dev_xmit_skb_slow(skb, q, prequeue);
> +	}
> +#endif
>  	spin_lock(root_lock);
>  	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>  		__qdisc_drop(skb, &to_free);
> @@ -3099,31 +3190,19 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  
>  		qdisc_bstats_update(q, skb);
>  
> -		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
> -			if (unlikely(contended)) {
> -				spin_unlock(&q->busylock);
> -				contended = false;
> -			}
> +		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>  			__qdisc_run(q);
> -		} else
> +		else
>  			qdisc_run_end(q);
>  
>  		rc = NET_XMIT_SUCCESS;
>  	} else {
>  		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> -		if (qdisc_run_begin(q)) {
> -			if (unlikely(contended)) {
> -				spin_unlock(&q->busylock);
> -				contended = false;
> -			}
> -			__qdisc_run(q);
> -		}
> +		qdisc_run(q);
>  	}
>  	spin_unlock(root_lock);
>  	if (unlikely(to_free))
>  		kfree_skb_list(to_free);
> -	if (unlikely(contended))
> -		spin_unlock(&q->busylock);
>  	return rc;
>  }
>  
> 
> 
> 
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
  2016-06-22 15:44     ` Jesper Dangaard Brouer
@ 2016-06-22 16:49       ` Eric Dumazet
  2016-06-23 14:22         ` Jesper Dangaard Brouer
  2016-06-23 16:21         ` Luigi Rizzo
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22 16:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend,
	Luigi Rizzo

On Wed, 2016-06-22 at 17:44 +0200, Jesper Dangaard Brouer wrote:
> On Wed, 22 Jun 2016 07:55:43 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote:
> > > On Tue, 21 Jun 2016 23:16:48 -0700
> > > Eric Dumazet <edumazet@google.com> wrote:
> > >   
> > > > First patch adds an additional parameter to ->enqueue() qdisc method
> > > > so that drops can be done outside of critical section
> > > > (after locks are released).
> > > > 
> > > > Then fq_codel can have a small optimization to reduce number of cache
> > > > lines misses during a drop event
> > > > (possibly accumulating hundreds of packets to be freed).
> > > > 
> > > > A small htb change exports the backlog in class dumps.
> > > > 
> > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature.
> > > > 
> > > > This series brings a nice qdisc performance increase (more than 80 %
> > > > in some cases).  
> > > 
> > > Thanks for working on this Eric! this is great work! :-)  
> > 
> > Thanks Jesper
> > 
> > I worked yesterday on bulk enqueues, but initial results are not that
> > great.
> 
> Hi Eric,
> 
> This is interesting work! But I think you should read Luigi Rizzo's
> (Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling
> Architecture"[1]
> 
> [1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf
> 
> Luigi will be at Netfilter Workshop next week, and will actually
> present on topic/paper.... you two should talk ;-)
> 
> The article is not a 100% match for what we need, but there is some
> good ideas.  The article also have a sort of "prequeue" that
> enqueue'ing CPUs will place packets into.
> 
> My understanding of the article:
> 
> 1. transmitters submit packets to an intermediate queue
>    (replace q->enqueue call) lockless submit as queue per CPU
>    (runs in parallel)
> 
> 2. like we only have _one_ qdisc dequeue process, this process (called
>    arbiter) empty the intermediate queues, and then invoke q->enqueue()
>    and q->dequeue(). (in a locked session/region)
> 
> 3. Packets returned from q->dequeue() is placed on an outgoing
>    intermediate queue.
> 
> 4. the transmitter then looks to see there are any packets to drain()
>    from the outgoing queue.  This can run in parallel.
> 
> If the transmitter submitting a packet, detect no arbiter is running,
> it can become the arbiter itself.  Like we do with qdisc_run_begin()
> setting state __QDISC___STATE_RUNNING.
> 
> The problem with this scheme is push-back from qdisc->enqueue
> (NET_XMIT_CN) does not "reach" us.  And push-back in-form of processes
> blocking on qdisc root lock, but that could be handled by either
> blocking in article's submit() or returning some congestion return code
> from submit(). 

Okay, I see that you prepare upcoming conference in Amsterdam,
but please keep this thread about existing kernel code, not the one that
eventually reach a new operating system in 5 years ;)

1) We _want_ the result of the sends, obviously.

2) We also want back pressure, without adding complex callbacks and
ref-counting.

3) We do not want to burn a cpu per TX queue (at least one per NUMA
node ???) only to send few packets per second,
Our model is still interrupt based, plus NAPI for interrupt mitigation.

4) I do not want to lock an innocent cpu to send packets from other
threads/cpu without a tight control.

In the patch I sent, I basically replaced a locked operation
(spin_lock(&q->busylock)) with another one (xchg()) , but I did not add
yet another queue before the qdisc ones, bufferbloat forbids.

The virtual queue here is one packet per cpu, which basically is the
same than before this patch, since each cpu spinning on busylock has one
skb to send anyway.

This is basically a simple extension of MCS locks, where the cpu at the
head of the queue can queue up to 16 packets, instead of queueing its
own packet only and give queue owner ship to the following cpu.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
  2016-06-22 16:49       ` Eric Dumazet
@ 2016-06-23 14:22         ` Jesper Dangaard Brouer
  2016-06-23 16:21         ` Luigi Rizzo
  1 sibling, 0 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2016-06-23 14:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend,
	Luigi Rizzo, brouer

On Wed, 22 Jun 2016 09:49:48 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-06-22 at 17:44 +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 22 Jun 2016 07:55:43 -0700
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >   
> > > On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote:  
> > > > On Tue, 21 Jun 2016 23:16:48 -0700
> > > > Eric Dumazet <edumazet@google.com> wrote:
> > > >     
> > > > > First patch adds an additional parameter to ->enqueue() qdisc method
> > > > > so that drops can be done outside of critical section
> > > > > (after locks are released).
> > > > > 
> > > > > Then fq_codel can have a small optimization to reduce number of cache
> > > > > lines misses during a drop event
> > > > > (possibly accumulating hundreds of packets to be freed).
> > > > > 
> > > > > A small htb change exports the backlog in class dumps.
> > > > > 
> > > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature.
> > > > > 
> > > > > This series brings a nice qdisc performance increase (more than 80 %
> > > > > in some cases).    
> > > > 
> > > > Thanks for working on this Eric! this is great work! :-)    
> > > 
> > > Thanks Jesper
> > > 
> > > I worked yesterday on bulk enqueues, but initial results are not that
> > > great.  
> > 
> > Hi Eric,
> > 
> > This is interesting work! But I think you should read Luigi Rizzo's
> > (Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling
> > Architecture"[1]
> > 
> > [1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf
> > 
> > Luigi will be at Netfilter Workshop next week, and will actually
> > present on topic/paper.... you two should talk ;-)
> > 
> > The article is not a 100% match for what we need, but there is some
> > good ideas.  The article also have a sort of "prequeue" that
> > enqueue'ing CPUs will place packets into.
> > 
> > My understanding of the article:
> > 
> > 1. transmitters submit packets to an intermediate queue
> >    (replace q->enqueue call) lockless submit as queue per CPU
> >    (runs in parallel)
> > 
> > 2. like we only have _one_ qdisc dequeue process, this process (called
> >    arbiter) empty the intermediate queues, and then invoke q->enqueue()
> >    and q->dequeue(). (in a locked session/region)
> > 
> > 3. Packets returned from q->dequeue() is placed on an outgoing
> >    intermediate queue.
> > 
> > 4. the transmitter then looks to see there are any packets to drain()
> >    from the outgoing queue.  This can run in parallel.
> > 
> > If the transmitter submitting a packet, detect no arbiter is running,
> > it can become the arbiter itself.  Like we do with qdisc_run_begin()
> > setting state __QDISC___STATE_RUNNING.
> > 
> > The problem with this scheme is push-back from qdisc->enqueue
> > (NET_XMIT_CN) does not "reach" us.  And push-back in-form of processes
> > blocking on qdisc root lock, but that could be handled by either
> > blocking in article's submit() or returning some congestion return code
> > from submit().   
> 
> Okay, I see that you prepare upcoming conference in Amsterdam,
> but please keep this thread about existing kernel code, not the one that
> eventually reach a new operating system in 5 years ;)
> 
> 1) We _want_ the result of the sends, obviously.

How dependent are we on the return codes?

E.g. the NET_XMIT_CN return is not that accurate, it does not mean this
packet was dropped, it could be from an unrelated flow.


> 2) We also want back pressure, without adding complex callbacks and
> ref-counting.
> 
> 3) We do not want to burn a cpu per TX queue (at least one per NUMA
> node ???) only to send few packets per second,
> Our model is still interrupt based, plus NAPI for interrupt mitigation.
>
> 4) I do not want to lock an innocent cpu to send packets from other
> threads/cpu without a tight control.

Article present two modes: 1) a dedicated CPU runs the "arbiter",
2) submitting CPU becomes the arbiter (iif not other CPU is the arbiter).

I imagine we use mode 2.  Which is almost what we already do now.
The qdisc layer only allow a single CPU to be dequeue'ing packets.  This
process can be seen as the "arbiter".  The only difference is that it
will pickup packets from an intermediate queue, and invoke q->enqueue().
(Still keeping the quota in __qdisc_run()).

 
> In the patch I sent, I basically replaced a locked operation
> (spin_lock(&q->busylock)) with another one (xchg()) , but I did not add
> yet another queue before the qdisc ones, bufferbloat forbids.

Is it really bufferbloat to introduce an intermidiate queue at this
point.  The enqueue/submit process, can see that qdisc_is_running, thus
it knows these packets will be picked up very shortly (within 200
cycles) and "arbiter" will invoke q->enqueue() allowing qdisc to react
to bufferbloat.


> The virtual queue here is one packet per cpu, which basically is the
> same than before this patch, since each cpu spinning on busylock has one
> skb to send anyway.
> 
> This is basically a simple extension of MCS locks, where the cpu at the
> head of the queue can queue up to 16 packets, instead of queueing its
> own packet only and give queue owner ship to the following cpu.

I do like MCS locks. You sort of open-coded it. I am impress by the
code, but it really takes some time to read and understand (not
necessarily a bad thing).  I am impress how you get the return code
back (from the remote sender).  I was a problem I've been struggling to
solve but I couldn't.

Thanks for working on this Eric!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
  2016-06-22 16:49       ` Eric Dumazet
  2016-06-23 14:22         ` Jesper Dangaard Brouer
@ 2016-06-23 16:21         ` Luigi Rizzo
  1 sibling, 0 replies; 15+ messages in thread
From: Luigi Rizzo @ 2016-06-23 16:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S . Miller, netdev,
	John Fastabend

On Wed, Jun 22, 2016 at 6:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-06-22 at 17:44 +0200, Jesper Dangaard Brouer wrote:
>> On Wed, 22 Jun 2016 07:55:43 -0700
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote:
>> > > On Tue, 21 Jun 2016 23:16:48 -0700
>> > > Eric Dumazet <edumazet@google.com> wrote:
>> > >
>> > > > First patch adds an additional parameter to ->enqueue() qdisc method
>> > > > so that drops can be done outside of critical section
>> > > > (after locks are released).
>> > > >
>> > > > Then fq_codel can have a small optimization to reduce number of cache
>> > > > lines misses during a drop event
>> > > > (possibly accumulating hundreds of packets to be freed).
>> > > >
>> > > > A small htb change exports the backlog in class dumps.
>> > > >
>> > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature.
>> > > >
>> > > > This series brings a nice qdisc performance increase (more than 80 %
>> > > > in some cases).
>> > >
>> > > Thanks for working on this Eric! this is great work! :-)
>> >
>> > Thanks Jesper
>> >
>> > I worked yesterday on bulk enqueues, but initial results are not that
>> > great.
>>
>> Hi Eric,
>>
>> This is interesting work! But I think you should read Luigi Rizzo's
>> (Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling
>> Architecture"[1]
>>
>> [1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf
>>
>> Luigi will be at Netfilter Workshop next week, and will actually
>> present on topic/paper.... you two should talk ;-)
>>
>> The article is not a 100% match for what we need, but there is some
>> good ideas.  The article also have a sort of "prequeue" that
>> enqueue'ing CPUs will place packets into.
>>
>> My understanding of the article:
>>
>> 1. transmitters submit packets to an intermediate queue
>>    (replace q->enqueue call) lockless submit as queue per CPU
>>    (runs in parallel)
>>
>> 2. like we only have _one_ qdisc dequeue process, this process (called
>>    arbiter) empty the intermediate queues, and then invoke q->enqueue()
>>    and q->dequeue(). (in a locked session/region)
>>
>> 3. Packets returned from q->dequeue() is placed on an outgoing
>>    intermediate queue.
>>
>> 4. the transmitter then looks to see there are any packets to drain()
>>    from the outgoing queue.  This can run in parallel.
>>
>> If the transmitter submitting a packet, detect no arbiter is running,
>> it can become the arbiter itself.  Like we do with qdisc_run_begin()
>> setting state __QDISC___STATE_RUNNING.
>>
>> The problem with this scheme is push-back from qdisc->enqueue
>> (NET_XMIT_CN) does not "reach" us.  And push-back in-form of processes
>> blocking on qdisc root lock, but that could be handled by either
>> blocking in article's submit() or returning some congestion return code
>> from submit().
>
> Okay, I see that you prepare upcoming conference in Amsterdam,
> but please keep this thread about existing kernel code, not the one that
> eventually reach a new operating system in 5 years ;)
>
> 1) We _want_ the result of the sends, obviously.
>
> 2) We also want back pressure, without adding complex callbacks and
> ref-counting.
>
> 3) We do not want to burn a cpu per TX queue (at least one per NUMA
> node ???) only to send few packets per second,
> Our model is still interrupt based, plus NAPI for interrupt mitigation.
>
> 4) I do not want to lock an innocent cpu to send packets from other
> threads/cpu without a tight control.
>
> In the patch I sent, I basically replaced a locked operation
> (spin_lock(&q->busylock)) with another one (xchg()) , but I did not add
> yet another queue before the qdisc ones, bufferbloat forbids.
>
> The virtual queue here is one packet per cpu, which basically is the
> same than before this patch, since each cpu spinning on busylock has one
> skb to send anyway.
>
> This is basically a simple extension of MCS locks, where the cpu at the
> head of the queue can queue up to 16 packets, instead of queueing its
> own packet only and give queue owner ship to the following cpu.

Hi Eric (and others),

don't worry, my proposal (PSPAT) is not specifically addressing/targeting
the linux qdisc now, but at the same time it does have any of the
faults you are worried about.

My target, at a high level, is a VM hosting node where the guest VMs
may create large amounts of traffic, maybe most of it doomed to be dropped,
but still consuming theirs and system's resources by creating the
packets and pounding on the xmit calls.

The goal of PSPAT is to let those clients know very early (possibly even
before doing lookups or encapsulation) when the underlying path
to the NIC will be essentially free for transmission, at which
point the sender can complete building the packet and push it out.

To comment on your observations, PSPAT has the following features:

1) it does return the result of the send, which is run by the individual
   thread who submitted the packet when it gets grant to
   transmit (this also covers your point #4). Note that by the time
   you can do a send, the underlying path is free so it should
   not fail unless the link goes down. Conversely, if the user-visible
   queue is full or almost full, you'll get a drop or congestion
   notification even before talking to the arbiter.

2) it does support backpressure, because said thread has direct access
   to the queue's state when it submits the request.
   The queue visible to the thread is the same per-flow queue used
   by the scheduler, so it does not create additional queueing,
   and the qdisc->enqueue() will never fail
   (see #A below for more details on queues)

3) it does not burn a core per tx-queue to run the arbiter.
   The thread that in PSPAT runs the arbiter is essentially
   the NAPI thread, and under normal conditions it is asleep.
   It wakes up only on tx-complete interrupts (or timers)
   and does even less than the NAPI thread, as it does not
   even have to (though it can) run the device_xmit code.

   The only case in which the arbiter runs "continuously"
   (with many options to throttle it) is when the load is so
   high that even the NAPI thread would have to run
   continuously.

A) additional remarks on queues:

the model of operation of PSPAT is that
each flow (however you define it) has its own queue,
and the scheduler handles the collection of queues with some
discipline (round robin, fair queueing, priorities...).
When the arbiter marks a packet as "good to send", the underlying
path to the NIC is virtually free for you to use, so you can expect
zero extra delay (other than your own processing).
Individual queue management policies (codel, RED, ECN,
whatever looks at the individual queue) can be done locally.

If you want a single global queue where everybody can write to without
individual limitations (e.g. a bounded size FIFO with K slots), keep in
mind that this model gives no service guarantees other than

   "if you manage to get in, you'll never wait more than K packets"

But, you may never get in so your delay may be infinite
(as it is the case for me now on the beach with flakey cell coverage).

The obvious way to implement this FIFO service is no scheduler,
one TX ring with K slots and let clients contend on it.

If you expect to use M tx rings and still want the 'K packets delay'
guarantee, then PSPAT may still help because you can create a queue
with one or a few slots per client, and when the arbiter
runs (which is more frequently than K packets' time) it may tell you
"you were lucky" or "sorry no room for your packets, drop them".

Finally:

The paper has a ton more details, but it targets an academic audience
which is generally not very interested in implementation issues.
This is why (besides lack of space and some ignorance on my side)
I tried to avoid suggesting how PSPAT could be implemented in
Linux or FreeBSD or userspace software routers or other architectures.

I am happy to hear comments, criticism and suggestions and discuss
details with you or other interested, however I would still suggest
to have first a look at the paper (perhaps skipping the boring section
with formal proofs of the service guarantees) because the description
that we can give on this mailing list is necessarily incomplete
due to space limitations.

Hope to see many of you next week in Amsterdam.

cheers
luigi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
  2016-06-22  6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet
                   ` (4 preceding siblings ...)
  2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer
@ 2016-06-25 16:20 ` David Miller
  5 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2016-06-25 16:20 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, john.r.fastabend, brouer, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Tue, 21 Jun 2016 23:16:48 -0700

> First patch adds an additional parameter to ->enqueue() qdisc method
> so that drops can be done outside of critical section
> (after locks are released).
> 
> Then fq_codel can have a small optimization to reduce number of cache
> lines misses during a drop event
> (possibly accumulating hundreds of packets to be freed).
> 
> A small htb change exports the backlog in class dumps.
> 
> Final patch adds bulk dequeue to qdiscs that were lacking this feature.
> 
> This series brings a nice qdisc performance increase (more than 80 %
> in some cases).

Series applied, thanks Eric.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-06-25 16:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-22  6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet
2016-06-22  6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet
2016-06-22 15:14   ` Jesper Dangaard Brouer
2016-06-22  6:16 ` [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb Eric Dumazet
2016-06-22  6:16 ` [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps Eric Dumazet
2016-06-22  6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet
2016-06-22 15:03   ` Jesper Dangaard Brouer
2016-06-23  7:26   ` Paolo Abeni
2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer
2016-06-22 14:55   ` Eric Dumazet
2016-06-22 15:44     ` Jesper Dangaard Brouer
2016-06-22 16:49       ` Eric Dumazet
2016-06-23 14:22         ` Jesper Dangaard Brouer
2016-06-23 16:21         ` Luigi Rizzo
2016-06-25 16:20 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).