* [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops
@ 2016-06-22 6:16 Eric Dumazet
2016-06-22 6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet
` (5 more replies)
0 siblings, 6 replies; 15+ messages in thread
From: Eric Dumazet @ 2016-06-22 6:16 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer,
Eric Dumazet
First patch adds an additional parameter to ->enqueue() qdisc method
so that drops can be done outside of critical section
(after locks are released).
Then fq_codel can have a small optimization to reduce number of cache
lines misses during a drop event
(possibly accumulating hundreds of packets to be freed).
A small htb change exports the backlog in class dumps.
Final patch adds bulk dequeue to qdiscs that were lacking this feature.
This series brings a nice qdisc performance increase (more than 80 %
in some cases).
Eric Dumazet (4):
net_sched: drop packets after root qdisc lock is released
net_sched: fq_codel: cache skb->truesize into skb->cb
net_sched: sch_htb: export class backlog in dumps
net_sched: generalize bulk dequeue
include/net/codel_qdisc.h | 1 +
include/net/sch_generic.h | 48 ++++++++++++++++++++---------
net/core/dev.c | 7 +++--
net/sched/sch_atm.c | 9 +++---
net/sched/sch_blackhole.c | 5 +--
net/sched/sch_cbq.c | 7 +++--
net/sched/sch_choke.c | 16 +++++-----
net/sched/sch_codel.c | 8 +++--
net/sched/sch_drr.c | 7 +++--
net/sched/sch_dsmark.c | 9 +++---
net/sched/sch_fifo.c | 15 +++++----
net/sched/sch_fq.c | 7 +++--
net/sched/sch_fq_codel.c | 22 +++++++------
net/sched/sch_generic.c | 78 ++++++++++++++++++++++++++++++++++++++---------
net/sched/sch_gred.c | 7 +++--
net/sched/sch_hfsc.c | 6 ++--
net/sched/sch_hhf.c | 10 +++---
net/sched/sch_htb.c | 24 ++++++++++-----
net/sched/sch_multiq.c | 7 +++--
net/sched/sch_netem.c | 25 +++++++++------
net/sched/sch_pie.c | 5 +--
net/sched/sch_plug.c | 5 +--
net/sched/sch_prio.c | 4 +--
net/sched/sch_qfq.c | 7 +++--
net/sched/sch_red.c | 7 +++--
net/sched/sch_sfb.c | 7 +++--
net/sched/sch_sfq.c | 8 ++---
net/sched/sch_tbf.c | 16 +++++-----
net/sched/sch_teql.c | 4 +--
29 files changed, 247 insertions(+), 134 deletions(-)
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released 2016-06-22 6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet @ 2016-06-22 6:16 ` Eric Dumazet 2016-06-22 15:14 ` Jesper Dangaard Brouer 2016-06-22 6:16 ` [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb Eric Dumazet ` (4 subsequent siblings) 5 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2016-06-22 6:16 UTC (permalink / raw) To: David S . Miller Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer, Eric Dumazet Qdisc performance suffers when packets are dropped at enqueue() time because drops (kfree_skb()) are done while qdisc lock is held, delaying a dequeue() draining the queue. Nominal throughput can be reduced by 50 % when this happens, at a time we would like the dequeue() to proceed as fast as possible. Even FQ is vulnerable to this problem, while one of FQ goals was to provide some flow isolation. This patch adds a 'struct sk_buff **to_free' parameter to all qdisc->enqueue(), and in qdisc_drop() helper. I measured a performance increase of up to 12 %, but this patch is a prereq so that future batches in enqueue() can fly. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/net/sch_generic.h | 41 ++++++++++++++++++++++++++++++----------- net/core/dev.c | 7 +++++-- net/sched/sch_atm.c | 9 +++++---- net/sched/sch_blackhole.c | 5 +++-- net/sched/sch_cbq.c | 7 ++++--- net/sched/sch_choke.c | 16 +++++++++------- net/sched/sch_codel.c | 8 +++++--- net/sched/sch_drr.c | 7 ++++--- net/sched/sch_dsmark.c | 9 +++++---- net/sched/sch_fifo.c | 15 +++++++++------ net/sched/sch_fq.c | 7 ++++--- net/sched/sch_fq_codel.c | 15 +++++++++------ net/sched/sch_generic.c | 10 ++++++---- net/sched/sch_gred.c | 7 ++++--- net/sched/sch_hfsc.c | 6 +++--- net/sched/sch_hhf.c | 10 +++++----- net/sched/sch_htb.c | 10 ++++++---- net/sched/sch_multiq.c | 7 ++++--- net/sched/sch_netem.c | 25 +++++++++++++++---------- net/sched/sch_pie.c | 5 +++-- net/sched/sch_plug.c | 5 +++-- net/sched/sch_prio.c | 4 ++-- net/sched/sch_qfq.c | 7 ++++--- net/sched/sch_red.c | 7 ++++--- net/sched/sch_sfb.c | 7 ++++--- net/sched/sch_sfq.c | 8 ++++---- net/sched/sch_tbf.c | 16 +++++++++------- net/sched/sch_teql.c | 4 ++-- 28 files changed, 170 insertions(+), 114 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 4f7cee8344c4..04e84c07c94f 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -37,8 +37,10 @@ struct qdisc_size_table { }; struct Qdisc { - int (*enqueue)(struct sk_buff *skb, struct Qdisc *dev); - struct sk_buff * (*dequeue)(struct Qdisc *dev); + int (*enqueue)(struct sk_buff *skb, + struct Qdisc *sch, + struct sk_buff **to_free); + struct sk_buff * (*dequeue)(struct Qdisc *sch); unsigned int flags; #define TCQ_F_BUILTIN 1 #define TCQ_F_INGRESS 2 @@ -160,7 +162,9 @@ struct Qdisc_ops { char id[IFNAMSIZ]; int priv_size; - int (*enqueue)(struct sk_buff *, struct Qdisc *); + int (*enqueue)(struct sk_buff *skb, + struct Qdisc *sch, + struct sk_buff **to_free); struct sk_buff * (*dequeue)(struct Qdisc *); struct sk_buff * (*peek)(struct Qdisc *); @@ -498,10 +502,11 @@ static inline void qdisc_calculate_pkt_len(struct sk_buff *skb, #endif } -static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { qdisc_calculate_pkt_len(skb, sch); - return sch->enqueue(skb, sch); + return sch->enqueue(skb, sch, to_free); } static inline bool qdisc_is_percpu_stats(const struct Qdisc *q) @@ -626,24 +631,36 @@ static inline struct sk_buff *qdisc_dequeue_head(struct Qdisc *sch) return __qdisc_dequeue_head(sch, &sch->q); } +/* Instead of calling kfree_skb() while root qdisc lock is held, + * queue the skb for future freeing at end of __dev_xmit_skb() + */ +static inline void __qdisc_drop(struct sk_buff *skb, struct sk_buff **to_free) +{ + skb->next = *to_free; + *to_free = skb; +} + static inline unsigned int __qdisc_queue_drop_head(struct Qdisc *sch, - struct sk_buff_head *list) + struct sk_buff_head *list, + struct sk_buff **to_free) { struct sk_buff *skb = __skb_dequeue(list); if (likely(skb != NULL)) { unsigned int len = qdisc_pkt_len(skb); + qdisc_qstats_backlog_dec(sch, skb); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return len; } return 0; } -static inline unsigned int qdisc_queue_drop_head(struct Qdisc *sch) +static inline unsigned int qdisc_queue_drop_head(struct Qdisc *sch, + struct sk_buff **to_free) { - return __qdisc_queue_drop_head(sch, &sch->q); + return __qdisc_queue_drop_head(sch, &sch->q, to_free); } static inline struct sk_buff *qdisc_peek_head(struct Qdisc *sch) @@ -724,9 +741,11 @@ static inline void rtnl_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch) qdisc_qstats_drop(sch); } -static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch) + +static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { - kfree_skb(skb); + __qdisc_drop(skb, to_free); qdisc_qstats_drop(sch); return NET_XMIT_DROP; diff --git a/net/core/dev.c b/net/core/dev.c index d40593b3b9fb..aba10d2a8bc3 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3070,6 +3070,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct netdev_queue *txq) { spinlock_t *root_lock = qdisc_lock(q); + struct sk_buff *to_free = NULL; bool contended; int rc; @@ -3086,7 +3087,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { - kfree_skb(skb); + __qdisc_drop(skb, &to_free); rc = NET_XMIT_DROP; } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && qdisc_run_begin(q)) { @@ -3109,7 +3110,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, rc = NET_XMIT_SUCCESS; } else { - rc = q->enqueue(skb, q) & NET_XMIT_MASK; + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; if (qdisc_run_begin(q)) { if (unlikely(contended)) { spin_unlock(&q->busylock); @@ -3119,6 +3120,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, } } spin_unlock(root_lock); + if (unlikely(to_free)) + kfree_skb_list(to_free); if (unlikely(contended)) spin_unlock(&q->busylock); return rc; diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c index e04ea6994d1c..481e4f12aeb4 100644 --- a/net/sched/sch_atm.c +++ b/net/sched/sch_atm.c @@ -357,7 +357,8 @@ static struct tcf_proto __rcu **atm_tc_find_tcf(struct Qdisc *sch, /* --------------------------- Qdisc operations ---------------------------- */ -static int atm_tc_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int atm_tc_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct atm_qdisc_data *p = qdisc_priv(sch); struct atm_flow_data *flow; @@ -398,10 +399,10 @@ done: switch (result) { case TC_ACT_QUEUED: case TC_ACT_STOLEN: - kfree_skb(skb); + __qdisc_drop(skb, to_free); return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN; case TC_ACT_SHOT: - kfree_skb(skb); + __qdisc_drop(skb, to_free); goto drop; case TC_ACT_RECLASSIFY: if (flow->excess) @@ -413,7 +414,7 @@ done: #endif } - ret = qdisc_enqueue(skb, flow->q); + ret = qdisc_enqueue(skb, flow->q, to_free); if (ret != NET_XMIT_SUCCESS) { drop: __maybe_unused if (net_xmit_drop_count(ret)) { diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c index 3fee70d9814f..c98a61e980ba 100644 --- a/net/sched/sch_blackhole.c +++ b/net/sched/sch_blackhole.c @@ -17,9 +17,10 @@ #include <linux/skbuff.h> #include <net/pkt_sched.h> -static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); return NET_XMIT_SUCCESS; } diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c index a29fd811d7b9..beb554aa8cfb 100644 --- a/net/sched/sch_cbq.c +++ b/net/sched/sch_cbq.c @@ -358,7 +358,8 @@ cbq_mark_toplevel(struct cbq_sched_data *q, struct cbq_class *cl) } static int -cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch) +cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct cbq_sched_data *q = qdisc_priv(sch); int uninitialized_var(ret); @@ -370,11 +371,11 @@ cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (cl == NULL) { if (ret & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return ret; } - ret = qdisc_enqueue(skb, cl->q); + ret = qdisc_enqueue(skb, cl->q, to_free); if (ret == NET_XMIT_SUCCESS) { sch->q.qlen++; cbq_mark_toplevel(q, cl); diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c index 789b69ee9e51..3b6d5bd69101 100644 --- a/net/sched/sch_choke.c +++ b/net/sched/sch_choke.c @@ -115,7 +115,8 @@ static void choke_zap_tail_holes(struct choke_sched_data *q) } /* Drop packet from queue array by creating a "hole" */ -static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx) +static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx, + struct sk_buff **to_free) { struct choke_sched_data *q = qdisc_priv(sch); struct sk_buff *skb = q->tab[idx]; @@ -129,7 +130,7 @@ static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx) qdisc_qstats_backlog_dec(sch, skb); qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb)); - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); --sch->q.qlen; } @@ -261,7 +262,8 @@ static bool choke_match_random(const struct choke_sched_data *q, return choke_match_flow(oskb, nskb); } -static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { int ret = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; struct choke_sched_data *q = qdisc_priv(sch); @@ -288,7 +290,7 @@ static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch) /* Draw a packet at random from queue and compare flow */ if (choke_match_random(q, skb, &idx)) { q->stats.matched++; - choke_drop_by_idx(sch, idx); + choke_drop_by_idx(sch, idx, to_free); goto congestion_drop; } @@ -331,16 +333,16 @@ static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch) } q->stats.pdrop++; - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); congestion_drop: - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); return NET_XMIT_CN; other_drop: if (ret & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return ret; } diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c index c5bc424e3b3c..4002df3c7d9f 100644 --- a/net/sched/sch_codel.c +++ b/net/sched/sch_codel.c @@ -82,7 +82,8 @@ static void drop_func(struct sk_buff *skb, void *ctx) { struct Qdisc *sch = ctx; - qdisc_drop(skb, sch); + kfree_skb(skb); + qdisc_qstats_drop(sch); } static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch) @@ -107,7 +108,8 @@ static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch) return skb; } -static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct codel_sched_data *q; @@ -117,7 +119,7 @@ static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch) } q = qdisc_priv(sch); q->drop_overlimit++; - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } static const struct nla_policy codel_policy[TCA_CODEL_MAX + 1] = { diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c index 22609e4e845f..8af5c59eef84 100644 --- a/net/sched/sch_drr.c +++ b/net/sched/sch_drr.c @@ -350,7 +350,8 @@ static struct drr_class *drr_classify(struct sk_buff *skb, struct Qdisc *sch, return NULL; } -static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct drr_sched *q = qdisc_priv(sch); struct drr_class *cl; @@ -360,11 +361,11 @@ static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (cl == NULL) { if (err & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return err; } - err = qdisc_enqueue(skb, cl->qdisc); + err = qdisc_enqueue(skb, cl->qdisc, to_free); if (unlikely(err != NET_XMIT_SUCCESS)) { if (net_xmit_drop_count(err)) { cl->qstats.drops++; diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c index b9ba5f658528..1308bbf460f7 100644 --- a/net/sched/sch_dsmark.c +++ b/net/sched/sch_dsmark.c @@ -191,7 +191,8 @@ static inline struct tcf_proto __rcu **dsmark_find_tcf(struct Qdisc *sch, /* --------------------------- Qdisc operations ---------------------------- */ -static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct dsmark_qdisc_data *p = qdisc_priv(sch); int err; @@ -234,7 +235,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch) #ifdef CONFIG_NET_CLS_ACT case TC_ACT_QUEUED: case TC_ACT_STOLEN: - kfree_skb(skb); + __qdisc_drop(skb, to_free); return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN; case TC_ACT_SHOT: @@ -251,7 +252,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch) } } - err = qdisc_enqueue(skb, p->q); + err = qdisc_enqueue(skb, p->q, to_free); if (err != NET_XMIT_SUCCESS) { if (net_xmit_drop_count(err)) qdisc_qstats_drop(sch); @@ -264,7 +265,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch) return NET_XMIT_SUCCESS; drop: - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; } diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c index dea70e3ef0ba..6ea0db427f91 100644 --- a/net/sched/sch_fifo.c +++ b/net/sched/sch_fifo.c @@ -19,29 +19,32 @@ /* 1 band FIFO pseudo-"scheduler" */ -static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= sch->limit)) return qdisc_enqueue_tail(skb, sch); - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } -static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { if (likely(skb_queue_len(&sch->q) < sch->limit)) return qdisc_enqueue_tail(skb, sch); - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } -static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { if (likely(skb_queue_len(&sch->q) < sch->limit)) return qdisc_enqueue_tail(skb, sch); /* queue full, remove one skb to fulfill the limit */ - __qdisc_queue_drop_head(sch, &sch->q); + __qdisc_queue_drop_head(sch, &sch->q, to_free); qdisc_qstats_drop(sch); qdisc_enqueue_tail(skb, sch); diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index 6eb06674f778..e5458b99e09c 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -368,18 +368,19 @@ static void flow_queue_add(struct fq_flow *flow, struct sk_buff *skb) } } -static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct fq_sched_data *q = qdisc_priv(sch); struct fq_flow *f; if (unlikely(sch->q.qlen >= sch->limit)) - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); f = fq_classify(skb, q); if (unlikely(f->qlen >= q->flow_plimit && f != &q->internal)) { q->stat_flows_plimit++; - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } f->qlen++; diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c index 2dc0a849515a..f715195459c9 100644 --- a/net/sched/sch_fq_codel.c +++ b/net/sched/sch_fq_codel.c @@ -139,7 +139,8 @@ static inline void flow_queue_add(struct fq_codel_flow *flow, skb->next = NULL; } -static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets) +static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets, + struct sk_buff **to_free) { struct fq_codel_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; @@ -172,7 +173,7 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets) skb = dequeue_head(flow); len += qdisc_pkt_len(skb); mem += skb->truesize; - kfree_skb(skb); + __qdisc_drop(skb, to_free); } while (++i < max_packets && len < threshold); flow->dropped += i; @@ -184,7 +185,8 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets) return idx; } -static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct fq_codel_sched_data *q = qdisc_priv(sch); unsigned int idx, prev_backlog, prev_qlen; @@ -197,7 +199,7 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (idx == 0) { if (ret & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return ret; } idx--; @@ -229,7 +231,7 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch) * So instead of dropping a single packet, drop half of its backlog * with a 64 packets limit to not add a too big cpu spike here. */ - ret = fq_codel_drop(sch, q->drop_batch_size); + ret = fq_codel_drop(sch, q->drop_batch_size, to_free); prev_qlen -= sch->q.qlen; prev_backlog -= sch->qstats.backlog; @@ -276,7 +278,8 @@ static void drop_func(struct sk_buff *skb, void *ctx) { struct Qdisc *sch = ctx; - qdisc_drop(skb, sch); + kfree_skb(skb); + qdisc_qstats_drop(sch); } static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 773b632e1e33..ff86606954f2 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -348,9 +348,10 @@ EXPORT_SYMBOL(netif_carrier_off); cheaper. */ -static int noop_enqueue(struct sk_buff *skb, struct Qdisc *qdisc) +static int noop_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, + struct sk_buff **to_free) { - kfree_skb(skb); + __qdisc_drop(skb, to_free); return NET_XMIT_CN; } @@ -439,7 +440,8 @@ static inline struct sk_buff_head *band2list(struct pfifo_fast_priv *priv, return priv->q + band; } -static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc) +static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, + struct sk_buff **to_free) { if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) { int band = prio2band[skb->priority & TC_PRIO_MAX]; @@ -451,7 +453,7 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc) return __qdisc_enqueue_tail(skb, qdisc, list); } - return qdisc_drop(skb, qdisc); + return qdisc_drop(skb, qdisc, to_free); } static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c index b5fb63c7be02..c78a093c551a 100644 --- a/net/sched/sch_gred.c +++ b/net/sched/sch_gred.c @@ -149,7 +149,8 @@ static inline int gred_use_harddrop(struct gred_sched *t) return t->red_flags & TC_RED_HARDDROP; } -static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct gred_sched_data *q = NULL; struct gred_sched *t = qdisc_priv(sch); @@ -237,10 +238,10 @@ static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch) q->stats.pdrop++; drop: - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); congestion_drop: - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); return NET_XMIT_CN; } diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index bd08c363a26d..8cb5eff7b79c 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -1572,7 +1572,7 @@ hfsc_dump_qdisc(struct Qdisc *sch, struct sk_buff *skb) } static int -hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch) +hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { struct hfsc_class *cl; int uninitialized_var(err); @@ -1581,11 +1581,11 @@ hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (cl == NULL) { if (err & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return err; } - err = qdisc_enqueue(skb, cl->qdisc); + err = qdisc_enqueue(skb, cl->qdisc, to_free); if (unlikely(err != NET_XMIT_SUCCESS)) { if (net_xmit_drop_count(err)) { cl->qstats.drops++; diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c index c44593b8e65a..e3d0458af17b 100644 --- a/net/sched/sch_hhf.c +++ b/net/sched/sch_hhf.c @@ -345,7 +345,7 @@ static void bucket_add(struct wdrr_bucket *bucket, struct sk_buff *skb) skb->next = NULL; } -static unsigned int hhf_drop(struct Qdisc *sch) +static unsigned int hhf_drop(struct Qdisc *sch, struct sk_buff **to_free) { struct hhf_sched_data *q = qdisc_priv(sch); struct wdrr_bucket *bucket; @@ -359,16 +359,16 @@ static unsigned int hhf_drop(struct Qdisc *sch) struct sk_buff *skb = dequeue_head(bucket); sch->q.qlen--; - qdisc_qstats_drop(sch); qdisc_qstats_backlog_dec(sch, skb); - kfree_skb(skb); + qdisc_drop(skb, sch, to_free); } /* Return id of the bucket from which the packet was dropped. */ return bucket - q->buckets; } -static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct hhf_sched_data *q = qdisc_priv(sch); enum wdrr_bucket_idx idx; @@ -406,7 +406,7 @@ static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch) /* Return Congestion Notification only if we dropped a packet from this * bucket. */ - if (hhf_drop(sch) == idx) + if (hhf_drop(sch, to_free) == idx) return NET_XMIT_CN; /* As we dropped a packet, better let upper stack know this. */ diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index a454605ab5cb..f3882259c385 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -569,7 +569,8 @@ static inline void htb_deactivate(struct htb_sched *q, struct htb_class *cl) list_del_init(&cl->un.leaf.drop_list); } -static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { int uninitialized_var(ret); struct htb_sched *q = qdisc_priv(sch); @@ -581,16 +582,17 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch) __skb_queue_tail(&q->direct_queue, skb); q->direct_pkts++; } else { - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } #ifdef CONFIG_NET_CLS_ACT } else if (!cl) { if (ret & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return ret; #endif - } else if ((ret = qdisc_enqueue(skb, cl->un.leaf.q)) != NET_XMIT_SUCCESS) { + } else if ((ret = qdisc_enqueue(skb, cl->un.leaf.q, + to_free)) != NET_XMIT_SUCCESS) { if (net_xmit_drop_count(ret)) { qdisc_qstats_drop(sch); cl->qstats.drops++; diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c index 5ea93305d705..9ffbb025b37e 100644 --- a/net/sched/sch_multiq.c +++ b/net/sched/sch_multiq.c @@ -65,7 +65,8 @@ multiq_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) } static int -multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch) +multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct Qdisc *qdisc; int ret; @@ -76,12 +77,12 @@ multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (ret & __NET_XMIT_BYPASS) qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return ret; } #endif - ret = qdisc_enqueue(skb, qdisc); + ret = qdisc_enqueue(skb, qdisc, to_free); if (ret == NET_XMIT_SUCCESS) { sch->q.qlen++; return NET_XMIT_SUCCESS; diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index e271967439bf..ccca8ca4c722 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -397,7 +397,8 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch) * when we statistically choose to corrupt one, we instead segment it, returning * the first packet to be corrupted, and re-enqueue the remaining frames */ -static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch) +static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct sk_buff *segs; netdev_features_t features = netif_skb_features(skb); @@ -405,7 +406,7 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch) segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); if (IS_ERR_OR_NULL(segs)) { - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); return NULL; } consume_skb(skb); @@ -418,7 +419,8 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch) * NET_XMIT_DROP: queue length didn't change. * NET_XMIT_SUCCESS: one skb was queued. */ -static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct netem_sched_data *q = qdisc_priv(sch); /* We don't fill cb now as skb_unshare() may invalidate it */ @@ -443,7 +445,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch) } if (count == 0) { qdisc_qstats_drop(sch); - kfree_skb(skb); + __qdisc_drop(skb, to_free); return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; } @@ -463,7 +465,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch) u32 dupsave = q->duplicate; /* prevent duplicating a dup... */ q->duplicate = 0; - rootq->enqueue(skb2, rootq); + rootq->enqueue(skb2, rootq, to_free); q->duplicate = dupsave; } @@ -475,7 +477,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch) */ if (q->corrupt && q->corrupt >= get_crandom(&q->corrupt_cor)) { if (skb_is_gso(skb)) { - segs = netem_segment(skb, sch); + segs = netem_segment(skb, sch, to_free); if (!segs) return NET_XMIT_DROP; } else { @@ -488,7 +490,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (!(skb = skb_unshare(skb, GFP_ATOMIC)) || (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))) { - rc = qdisc_drop(skb, sch); + rc = qdisc_drop(skb, sch, to_free); goto finish_segs; } @@ -497,7 +499,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch) } if (unlikely(skb_queue_len(&sch->q) >= sch->limit)) - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); qdisc_qstats_backlog_inc(sch, skb); @@ -557,7 +559,7 @@ finish_segs: segs->next = NULL; qdisc_skb_cb(segs)->pkt_len = segs->len; last_len = segs->len; - rc = qdisc_enqueue(segs, sch); + rc = qdisc_enqueue(segs, sch, to_free); if (rc != NET_XMIT_SUCCESS) { if (net_xmit_drop_count(rc)) qdisc_qstats_drop(sch); @@ -615,8 +617,11 @@ deliver: #endif if (q->qdisc) { - int err = qdisc_enqueue(skb, q->qdisc); + struct sk_buff *to_free = NULL; + int err; + err = qdisc_enqueue(skb, q->qdisc, &to_free); + kfree_skb_list(to_free); if (unlikely(err != NET_XMIT_SUCCESS)) { if (net_xmit_drop_count(err)) { qdisc_qstats_drop(sch); diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c index 912a46a5d02e..a570b0bb254c 100644 --- a/net/sched/sch_pie.c +++ b/net/sched/sch_pie.c @@ -134,7 +134,8 @@ static bool drop_early(struct Qdisc *sch, u32 packet_size) return false; } -static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct pie_sched_data *q = qdisc_priv(sch); bool enqueue = false; @@ -166,7 +167,7 @@ static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch) out: q->stats.dropped++; - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } static const struct nla_policy pie_policy[TCA_PIE_MAX + 1] = { diff --git a/net/sched/sch_plug.c b/net/sched/sch_plug.c index a12cd37680f8..1c6cbab3e7b9 100644 --- a/net/sched/sch_plug.c +++ b/net/sched/sch_plug.c @@ -88,7 +88,8 @@ struct plug_sched_data { u32 pkts_to_release; }; -static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct plug_sched_data *q = qdisc_priv(sch); @@ -98,7 +99,7 @@ static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch) return qdisc_enqueue_tail(skb, sch); } - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } static struct sk_buff *plug_dequeue(struct Qdisc *sch) diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index de492682caee..f4d443aeae54 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -67,7 +67,7 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) } static int -prio_enqueue(struct sk_buff *skb, struct Qdisc *sch) +prio_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { struct Qdisc *qdisc; int ret; @@ -83,7 +83,7 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch) } #endif - ret = qdisc_enqueue(skb, qdisc); + ret = qdisc_enqueue(skb, qdisc, to_free); if (ret == NET_XMIT_SUCCESS) { qdisc_qstats_backlog_inc(sch, skb); sch->q.qlen++; diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c index 0427fa8b23f2..f27ffee106f6 100644 --- a/net/sched/sch_qfq.c +++ b/net/sched/sch_qfq.c @@ -1217,7 +1217,8 @@ static struct qfq_aggregate *qfq_choose_next_agg(struct qfq_sched *q) return agg; } -static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct qfq_sched *q = qdisc_priv(sch); struct qfq_class *cl; @@ -1240,11 +1241,11 @@ static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) qdisc_pkt_len(skb)); if (err) { cl->qstats.drops++; - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } } - err = qdisc_enqueue(skb, cl->qdisc); + err = qdisc_enqueue(skb, cl->qdisc, to_free); if (unlikely(err != NET_XMIT_SUCCESS)) { pr_debug("qfq_enqueue: enqueue failed %d\n", err); if (net_xmit_drop_count(err)) { diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c index a0d57530335e..249b2a18acbd 100644 --- a/net/sched/sch_red.c +++ b/net/sched/sch_red.c @@ -56,7 +56,8 @@ static inline int red_use_harddrop(struct red_sched_data *q) return q->flags & TC_RED_HARDDROP; } -static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct red_sched_data *q = qdisc_priv(sch); struct Qdisc *child = q->qdisc; @@ -95,7 +96,7 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch) break; } - ret = qdisc_enqueue(skb, child); + ret = qdisc_enqueue(skb, child, to_free); if (likely(ret == NET_XMIT_SUCCESS)) { qdisc_qstats_backlog_inc(sch, skb); sch->q.qlen++; @@ -106,7 +107,7 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch) return ret; congestion_drop: - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); return NET_XMIT_CN; } diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c index c69611640fa5..add3cc7d37ec 100644 --- a/net/sched/sch_sfb.c +++ b/net/sched/sch_sfb.c @@ -275,7 +275,8 @@ static bool sfb_classify(struct sk_buff *skb, struct tcf_proto *fl, return false; } -static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct sfb_sched_data *q = qdisc_priv(sch); @@ -397,7 +398,7 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch) } enqueue: - ret = qdisc_enqueue(skb, child); + ret = qdisc_enqueue(skb, child, to_free); if (likely(ret == NET_XMIT_SUCCESS)) { sch->q.qlen++; increment_qlen(skb, q); @@ -408,7 +409,7 @@ enqueue: return ret; drop: - qdisc_drop(skb, sch); + qdisc_drop(skb, sch, to_free); return NET_XMIT_CN; other_drop: if (ret & __NET_XMIT_BYPASS) diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 57d118b41cad..7f195ed4d568 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -343,7 +343,7 @@ static int sfq_headdrop(const struct sfq_sched_data *q) } static int -sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) +sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { struct sfq_sched_data *q = qdisc_priv(sch); unsigned int hash, dropped; @@ -367,7 +367,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (x == SFQ_EMPTY_SLOT) { x = q->dep[0].next; /* get a free slot */ if (x >= SFQ_MAX_FLOWS) - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); q->ht[hash] = x; slot = &q->slots[x]; slot->hash = hash; @@ -424,14 +424,14 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) if (slot->qlen >= q->maxdepth) { congestion_drop: if (!sfq_headdrop(q)) - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); /* We know we have at least one packet in queue */ head = slot_dequeue_head(slot); delta = qdisc_pkt_len(head) - qdisc_pkt_len(skb); sch->qstats.backlog -= delta; slot->backlog -= delta; - qdisc_drop(head, sch); + qdisc_drop(head, sch, to_free); slot_queue_add(slot, skb); return NET_XMIT_CN; diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c index c12df84d1078..303355c449ab 100644 --- a/net/sched/sch_tbf.c +++ b/net/sched/sch_tbf.c @@ -155,7 +155,8 @@ static unsigned int skb_gso_mac_seglen(const struct sk_buff *skb) /* GSO packet is too big, segment it so that tbf can transmit * each segment in time */ -static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch) +static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct tbf_sched_data *q = qdisc_priv(sch); struct sk_buff *segs, *nskb; @@ -166,7 +167,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch) segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); if (IS_ERR_OR_NULL(segs)) - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); nb = 0; while (segs) { @@ -174,7 +175,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch) segs->next = NULL; qdisc_skb_cb(segs)->pkt_len = segs->len; len += segs->len; - ret = qdisc_enqueue(segs, q->qdisc); + ret = qdisc_enqueue(segs, q->qdisc, to_free); if (ret != NET_XMIT_SUCCESS) { if (net_xmit_drop_count(ret)) qdisc_qstats_drop(sch); @@ -190,17 +191,18 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch) return nb > 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP; } -static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch) +static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) { struct tbf_sched_data *q = qdisc_priv(sch); int ret; if (qdisc_pkt_len(skb) > q->max_size) { if (skb_is_gso(skb) && skb_gso_mac_seglen(skb) <= q->max_size) - return tbf_segment(skb, sch); - return qdisc_drop(skb, sch); + return tbf_segment(skb, sch, to_free); + return qdisc_drop(skb, sch, to_free); } - ret = qdisc_enqueue(skb, q->qdisc); + ret = qdisc_enqueue(skb, q->qdisc, to_free); if (ret != NET_XMIT_SUCCESS) { if (net_xmit_drop_count(ret)) qdisc_qstats_drop(sch); diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c index e02687185a59..2cd9b4478b92 100644 --- a/net/sched/sch_teql.c +++ b/net/sched/sch_teql.c @@ -77,7 +77,7 @@ struct teql_sched_data { /* "teql*" qdisc routines */ static int -teql_enqueue(struct sk_buff *skb, struct Qdisc *sch) +teql_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { struct net_device *dev = qdisc_dev(sch); struct teql_sched_data *q = qdisc_priv(sch); @@ -87,7 +87,7 @@ teql_enqueue(struct sk_buff *skb, struct Qdisc *sch) return NET_XMIT_SUCCESS; } - return qdisc_drop(skb, sch); + return qdisc_drop(skb, sch, to_free); } static struct sk_buff * -- 2.8.0.rc3.226.g39d4020 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released 2016-06-22 6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet @ 2016-06-22 15:14 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 15+ messages in thread From: Jesper Dangaard Brouer @ 2016-06-22 15:14 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, netdev, John Fastabend, Eric Dumazet, brouer On Tue, 21 Jun 2016 23:16:49 -0700 Eric Dumazet <edumazet@google.com> wrote: > Qdisc performance suffers when packets are dropped at enqueue() > time because drops (kfree_skb()) are done while qdisc lock is held, > delaying a dequeue() draining the queue. > > Nominal throughput can be reduced by 50 % when this happens, > at a time we would like the dequeue() to proceed as fast as possible. > > Even FQ is vulnerable to this problem, while one of FQ goals was > to provide some flow isolation. > > This patch adds a 'struct sk_buff **to_free' parameter to all > qdisc->enqueue(), and in qdisc_drop() helper. > > I measured a performance increase of up to 12 %, but this patch > is a prereq so that future batches in enqueue() can fly. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- [...] > +/* Instead of calling kfree_skb() while root qdisc lock is held, > + * queue the skb for future freeing at end of __dev_xmit_skb() > + */ > +static inline void __qdisc_drop(struct sk_buff *skb, struct sk_buff **to_free) > +{ > + skb->next = *to_free; > + *to_free = skb; > +} > + [...] > diff --git a/net/core/dev.c b/net/core/dev.c > index d40593b3b9fb..aba10d2a8bc3 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3070,6 +3070,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > struct netdev_queue *txq) > { > spinlock_t *root_lock = qdisc_lock(q); > + struct sk_buff *to_free = NULL; > bool contended; > int rc; > > @@ -3086,7 +3087,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > > spin_lock(root_lock); > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > - kfree_skb(skb); > + __qdisc_drop(skb, &to_free); > rc = NET_XMIT_DROP; > } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && > qdisc_run_begin(q)) { > @@ -3109,7 +3110,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > > rc = NET_XMIT_SUCCESS; > } else { > - rc = q->enqueue(skb, q) & NET_XMIT_MASK; > + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > if (qdisc_run_begin(q)) { > if (unlikely(contended)) { > spin_unlock(&q->busylock); > @@ -3119,6 +3120,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > } > } > spin_unlock(root_lock); > + if (unlikely(to_free)) > + kfree_skb_list(to_free); Great, now there is a good argument for implementing kmem_cache bulk freeing inside kfree_skb_list(). I did a ugly PoC implementation once, but there was no use-case that really needed the performance boost. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb 2016-06-22 6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet @ 2016-06-22 6:16 ` Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps Eric Dumazet ` (3 subsequent siblings) 5 siblings, 0 replies; 15+ messages in thread From: Eric Dumazet @ 2016-06-22 6:16 UTC (permalink / raw) To: David S . Miller Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer, Eric Dumazet Now we defer skb drops, it makes sense to keep a copy of skb->truesize in struct codel_skb_cb to avoid one cache line miss per dropped skb in fq_codel_drop(), to reduce latencies a bit further. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/net/codel_qdisc.h | 1 + net/sched/sch_fq_codel.c | 7 ++++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/net/codel_qdisc.h b/include/net/codel_qdisc.h index 8144d9cd2908..098630f83a55 100644 --- a/include/net/codel_qdisc.h +++ b/include/net/codel_qdisc.h @@ -52,6 +52,7 @@ /* Qdiscs using codel plugin must use codel_skb_cb in their own cb[] */ struct codel_skb_cb { codel_time_t enqueue_time; + unsigned int mem_usage; }; static struct codel_skb_cb *get_codel_cb(const struct sk_buff *skb) diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c index f715195459c9..a5ea0e9b6be4 100644 --- a/net/sched/sch_fq_codel.c +++ b/net/sched/sch_fq_codel.c @@ -172,7 +172,7 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets, do { skb = dequeue_head(flow); len += qdisc_pkt_len(skb); - mem += skb->truesize; + mem += get_codel_cb(skb)->mem_usage; __qdisc_drop(skb, to_free); } while (++i < max_packets && len < threshold); @@ -216,7 +216,8 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch, flow->deficit = q->quantum; flow->dropped = 0; } - q->memory_usage += skb->truesize; + get_codel_cb(skb)->mem_usage = skb->truesize; + q->memory_usage += get_codel_cb(skb)->mem_usage; memory_limited = q->memory_usage > q->memory_limit; if (++sch->q.qlen <= sch->limit && !memory_limited) return NET_XMIT_SUCCESS; @@ -267,7 +268,7 @@ static struct sk_buff *dequeue_func(struct codel_vars *vars, void *ctx) if (flow->head) { skb = dequeue_head(flow); q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb); - q->memory_usage -= skb->truesize; + q->memory_usage -= get_codel_cb(skb)->mem_usage; sch->q.qlen--; sch->qstats.backlog -= qdisc_pkt_len(skb); } -- 2.8.0.rc3.226.g39d4020 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps 2016-06-22 6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb Eric Dumazet @ 2016-06-22 6:16 ` Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet ` (2 subsequent siblings) 5 siblings, 0 replies; 15+ messages in thread From: Eric Dumazet @ 2016-06-22 6:16 UTC (permalink / raw) To: David S . Miller Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer, Eric Dumazet We already get child qdisc qlen, we also can get its backlog so that class dumps can report it. Also replace qstats by a single drop counter, but move it in a separate cache line so that drops do not dirty useful cache lines. Tested: $ tc -s cl sh dev eth0 class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0) rate 1001Mbit 517543pps backlog 120758b 499p requeues 0 lended: 9021770 borrowed: 0 giants: 0 tokens: 9 ctokens: 9 Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/sched/sch_htb.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index f3882259c385..ba098f2654b4 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -117,7 +117,6 @@ struct htb_class { * Written often fields */ struct gnet_stats_basic_packed bstats; - struct gnet_stats_queue qstats; struct tc_htb_xstats xstats; /* our special stats */ /* token bucket parameters */ @@ -140,6 +139,8 @@ struct htb_class { enum htb_cmode cmode; /* current mode of the class */ struct rb_node pq_node; /* node for event queue */ struct rb_node node[TC_HTB_NUMPRIO]; /* node for self or feed tree */ + + unsigned int drops ____cacheline_aligned_in_smp; }; struct htb_level { @@ -595,7 +596,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch, to_free)) != NET_XMIT_SUCCESS) { if (net_xmit_drop_count(ret)) { qdisc_qstats_drop(sch); - cl->qstats.drops++; + cl->drops++; } return ret; } else { @@ -1110,17 +1111,22 @@ static int htb_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d) { struct htb_class *cl = (struct htb_class *)arg; + struct gnet_stats_queue qs = { + .drops = cl->drops, + }; __u32 qlen = 0; - if (!cl->level && cl->un.leaf.q) + if (!cl->level && cl->un.leaf.q) { qlen = cl->un.leaf.q->q.qlen; + qs.backlog = cl->un.leaf.q->qstats.backlog; + } cl->xstats.tokens = PSCHED_NS2TICKS(cl->tokens); cl->xstats.ctokens = PSCHED_NS2TICKS(cl->ctokens); if (gnet_stats_copy_basic(qdisc_root_sleeping_running(sch), d, NULL, &cl->bstats) < 0 || gnet_stats_copy_rate_est(d, NULL, &cl->rate_est) < 0 || - gnet_stats_copy_queue(d, NULL, &cl->qstats, qlen) < 0) + gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0) return -1; return gnet_stats_copy_app(d, &cl->xstats, sizeof(cl->xstats)); -- 2.8.0.rc3.226.g39d4020 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next 4/4] net_sched: generalize bulk dequeue 2016-06-22 6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet ` (2 preceding siblings ...) 2016-06-22 6:16 ` [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps Eric Dumazet @ 2016-06-22 6:16 ` Eric Dumazet 2016-06-22 15:03 ` Jesper Dangaard Brouer 2016-06-23 7:26 ` Paolo Abeni 2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer 2016-06-25 16:20 ` David Miller 5 siblings, 2 replies; 15+ messages in thread From: Eric Dumazet @ 2016-06-22 6:16 UTC (permalink / raw) To: David S . Miller Cc: netdev, Eric Dumazet, John Fastabend, Jesper Dangaard Brouer, Eric Dumazet, Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann When qdisc bulk dequeue was added in linux-3.18 (commit 5772e9a3463b "qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE"), it was constrained to some specific qdiscs. With some extra care, we can extend this to all qdiscs, so that typical traffic shaping solutions can benefit from small batches (8 packets in this patch). For example, HTB is often used on some multi queue device. And bonding/team are multi queue devices... Idea is to bulk-dequeue packets mapping to the same transmit queue. This brings between 35 and 80 % performance increase in HTB setup under pressure on a bonding setup : 1) NUMA node contention : 610,000 pps -> 1,110,000 pps 2) No node contention : 1,380,000 pps -> 1,930,000 pps Now we should work to add batches on the enqueue() side ;) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> --- include/net/sch_generic.h | 7 ++--- net/sched/sch_generic.c | 68 ++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 62 insertions(+), 13 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 04e84c07c94f..909aff2db2b3 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -75,13 +75,14 @@ struct Qdisc { /* * For performance sake on SMP, we put highly modified fields at the end */ - struct Qdisc *next_sched ____cacheline_aligned_in_smp; - struct sk_buff *gso_skb; - unsigned long state; + struct sk_buff *gso_skb ____cacheline_aligned_in_smp; struct sk_buff_head q; struct gnet_stats_basic_packed bstats; seqcount_t running; struct gnet_stats_queue qstats; + unsigned long state; + struct Qdisc *next_sched; + struct sk_buff *skb_bad_txq; struct rcu_head rcu_head; int padded; atomic_t refcnt; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ff86606954f2..e95b67cd5718 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -77,6 +77,34 @@ static void try_bulk_dequeue_skb(struct Qdisc *q, skb->next = NULL; } +/* This variant of try_bulk_dequeue_skb() makes sure + * all skbs in the chain are for the same txq + */ +static void try_bulk_dequeue_skb_slow(struct Qdisc *q, + struct sk_buff *skb, + int *packets) +{ + int mapping = skb_get_queue_mapping(skb); + struct sk_buff *nskb; + int cnt = 0; + + do { + nskb = q->dequeue(q); + if (!nskb) + break; + if (unlikely(skb_get_queue_mapping(nskb) != mapping)) { + q->skb_bad_txq = nskb; + qdisc_qstats_backlog_inc(q, nskb); + q->q.qlen++; + break; + } + skb->next = nskb; + skb = nskb; + } while (++cnt < 8); + (*packets) += cnt; + skb->next = NULL; +} + /* Note that dequeue_skb can possibly return a SKB list (via skb->next). * A requeued skb (via q->gso_skb) can also be a SKB list. */ @@ -87,8 +115,9 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, const struct netdev_queue *txq = q->dev_queue; *packets = 1; - *validate = true; if (unlikely(skb)) { + /* skb in gso_skb were already validated */ + *validate = false; /* check the reason of requeuing without tx lock first */ txq = skb_get_tx_queue(txq->dev, skb); if (!netif_xmit_frozen_or_stopped(txq)) { @@ -97,15 +126,30 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, q->q.qlen--; } else skb = NULL; - /* skb in gso_skb were already validated */ - *validate = false; - } else { - if (!(q->flags & TCQ_F_ONETXQUEUE) || - !netif_xmit_frozen_or_stopped(txq)) { - skb = q->dequeue(q); - if (skb && qdisc_may_bulk(q)) - try_bulk_dequeue_skb(q, skb, txq, packets); + return skb; + } + *validate = true; + skb = q->skb_bad_txq; + if (unlikely(skb)) { + /* check the reason of requeuing without tx lock first */ + txq = skb_get_tx_queue(txq->dev, skb); + if (!netif_xmit_frozen_or_stopped(txq)) { + q->skb_bad_txq = NULL; + qdisc_qstats_backlog_dec(q, skb); + q->q.qlen--; + goto bulk; } + return NULL; + } + if (!(q->flags & TCQ_F_ONETXQUEUE) || + !netif_xmit_frozen_or_stopped(txq)) + skb = q->dequeue(q); + if (skb) { +bulk: + if (qdisc_may_bulk(q)) + try_bulk_dequeue_skb(q, skb, txq, packets); + else + try_bulk_dequeue_skb_slow(q, skb, packets); } return skb; } @@ -624,11 +668,14 @@ void qdisc_reset(struct Qdisc *qdisc) if (ops->reset) ops->reset(qdisc); + kfree_skb(qdisc->skb_bad_txq); + qdisc->skb_bad_txq = NULL; + if (qdisc->gso_skb) { kfree_skb_list(qdisc->gso_skb); qdisc->gso_skb = NULL; - qdisc->q.qlen = 0; } + qdisc->q.qlen = 0; } EXPORT_SYMBOL(qdisc_reset); @@ -667,6 +714,7 @@ void qdisc_destroy(struct Qdisc *qdisc) dev_put(qdisc_dev(qdisc)); kfree_skb_list(qdisc->gso_skb); + kfree_skb(qdisc->skb_bad_txq); /* * gen_estimator est_timer() might access qdisc->q.lock, * wait a RCU grace period before freeing qdisc. -- 2.8.0.rc3.226.g39d4020 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 4/4] net_sched: generalize bulk dequeue 2016-06-22 6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet @ 2016-06-22 15:03 ` Jesper Dangaard Brouer 2016-06-23 7:26 ` Paolo Abeni 1 sibling, 0 replies; 15+ messages in thread From: Jesper Dangaard Brouer @ 2016-06-22 15:03 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, netdev, John Fastabend, Eric Dumazet, Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann, brouer On Tue, 21 Jun 2016 23:16:52 -0700 Eric Dumazet <edumazet@google.com> wrote: > When qdisc bulk dequeue was added in linux-3.18 (commit > 5772e9a3463b "qdisc: bulk dequeue support for qdiscs > with TCQ_F_ONETXQUEUE"), it was constrained to some > specific qdiscs. Thanks for extending this! > With some extra care, we can extend this to all qdiscs, > so that typical traffic shaping solutions can benefit from > small batches (8 packets in this patch). I'm fine with limiting this to 8 packets (xmit_more), as that seem to be the minimum needed TX batch size (according to Luigi's original netmap article[1] figure 7) for doing 10G wirespeed. [1] http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf > For example, HTB is often used on some multi queue device. > And bonding/team are multi queue devices... > > Idea is to bulk-dequeue packets mapping to the same transmit queue. > > This brings between 35 and 80 % performance increase in HTB setup > under pressure on a bonding setup : > > 1) NUMA node contention : 610,000 pps -> 1,110,000 pps > 2) No node contention : 1,380,000 pps -> 1,930,000 pps > > Now we should work to add batches on the enqueue() side ;) Yes, please! :-))) That will be the next big step! > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: John Fastabend <john.r.fastabend@intel.com> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> > Cc: Florian Westphal <fw@strlen.de> > Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> This is great stuff! -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 4/4] net_sched: generalize bulk dequeue 2016-06-22 6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet 2016-06-22 15:03 ` Jesper Dangaard Brouer @ 2016-06-23 7:26 ` Paolo Abeni 1 sibling, 0 replies; 15+ messages in thread From: Paolo Abeni @ 2016-06-23 7:26 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, netdev, John Fastabend, Jesper Dangaard Brouer, Eric Dumazet, Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann On Tue, 2016-06-21 at 23:16 -0700, Eric Dumazet wrote: > When qdisc bulk dequeue was added in linux-3.18 (commit > 5772e9a3463b "qdisc: bulk dequeue support for qdiscs > with TCQ_F_ONETXQUEUE"), it was constrained to some > specific qdiscs. > > With some extra care, we can extend this to all qdiscs, > so that typical traffic shaping solutions can benefit from > small batches (8 packets in this patch). > > For example, HTB is often used on some multi queue device. > And bonding/team are multi queue devices... > > Idea is to bulk-dequeue packets mapping to the same transmit queue. > > This brings between 35 and 80 % performance increase in HTB setup > under pressure on a bonding setup : > > 1) NUMA node contention : 610,000 pps -> 1,110,000 pps > 2) No node contention : 1,380,000 pps -> 1,930,000 pps > > Now we should work to add batches on the enqueue() side ;) > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: John Fastabend <john.r.fastabend@intel.com> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> > Cc: Florian Westphal <fw@strlen.de> > Cc: Daniel Borkmann <daniel@iogearbox.net> > --- > include/net/sch_generic.h | 7 ++--- > net/sched/sch_generic.c | 68 ++++++++++++++++++++++++++++++++++++++++------- > 2 files changed, 62 insertions(+), 13 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index 04e84c07c94f..909aff2db2b3 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -75,13 +75,14 @@ struct Qdisc { > /* > * For performance sake on SMP, we put highly modified fields at the end > */ > - struct Qdisc *next_sched ____cacheline_aligned_in_smp; > - struct sk_buff *gso_skb; > - unsigned long state; > + struct sk_buff *gso_skb ____cacheline_aligned_in_smp; > struct sk_buff_head q; > struct gnet_stats_basic_packed bstats; > seqcount_t running; > struct gnet_stats_queue qstats; > + unsigned long state; > + struct Qdisc *next_sched; > + struct sk_buff *skb_bad_txq; > struct rcu_head rcu_head; > int padded; > atomic_t refcnt; > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index ff86606954f2..e95b67cd5718 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -77,6 +77,34 @@ static void try_bulk_dequeue_skb(struct Qdisc *q, > skb->next = NULL; > } > > +/* This variant of try_bulk_dequeue_skb() makes sure > + * all skbs in the chain are for the same txq > + */ > +static void try_bulk_dequeue_skb_slow(struct Qdisc *q, > + struct sk_buff *skb, > + int *packets) > +{ > + int mapping = skb_get_queue_mapping(skb); > + struct sk_buff *nskb; > + int cnt = 0; > + > + do { > + nskb = q->dequeue(q); > + if (!nskb) > + break; > + if (unlikely(skb_get_queue_mapping(nskb) != mapping)) { > + q->skb_bad_txq = nskb; > + qdisc_qstats_backlog_inc(q, nskb); > + q->q.qlen++; > + break; > + } > + skb->next = nskb; > + skb = nskb; > + } while (++cnt < 8); > + (*packets) += cnt; > + skb->next = NULL; > +} > + > /* Note that dequeue_skb can possibly return a SKB list (via skb->next). > * A requeued skb (via q->gso_skb) can also be a SKB list. > */ > @@ -87,8 +115,9 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, > const struct netdev_queue *txq = q->dev_queue; > > *packets = 1; > - *validate = true; > if (unlikely(skb)) { > + /* skb in gso_skb were already validated */ > + *validate = false; > /* check the reason of requeuing without tx lock first */ > txq = skb_get_tx_queue(txq->dev, skb); > if (!netif_xmit_frozen_or_stopped(txq)) { > @@ -97,15 +126,30 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, > q->q.qlen--; > } else > skb = NULL; > - /* skb in gso_skb were already validated */ > - *validate = false; > - } else { > - if (!(q->flags & TCQ_F_ONETXQUEUE) || > - !netif_xmit_frozen_or_stopped(txq)) { > - skb = q->dequeue(q); > - if (skb && qdisc_may_bulk(q)) > - try_bulk_dequeue_skb(q, skb, txq, packets); > + return skb; > + } > + *validate = true; > + skb = q->skb_bad_txq; > + if (unlikely(skb)) { > + /* check the reason of requeuing without tx lock first */ > + txq = skb_get_tx_queue(txq->dev, skb); > + if (!netif_xmit_frozen_or_stopped(txq)) { > + q->skb_bad_txq = NULL; > + qdisc_qstats_backlog_dec(q, skb); > + q->q.qlen--; > + goto bulk; > } > + return NULL; > + } > + if (!(q->flags & TCQ_F_ONETXQUEUE) || You can use qdisc_may_bulk() here, I guess. Not a functional change, just to improve readability. Paolo ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops 2016-06-22 6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet ` (3 preceding siblings ...) 2016-06-22 6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet @ 2016-06-22 14:47 ` Jesper Dangaard Brouer 2016-06-22 14:55 ` Eric Dumazet 2016-06-25 16:20 ` David Miller 5 siblings, 1 reply; 15+ messages in thread From: Jesper Dangaard Brouer @ 2016-06-22 14:47 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, netdev, John Fastabend, Eric Dumazet, brouer On Tue, 21 Jun 2016 23:16:48 -0700 Eric Dumazet <edumazet@google.com> wrote: > First patch adds an additional parameter to ->enqueue() qdisc method > so that drops can be done outside of critical section > (after locks are released). > > Then fq_codel can have a small optimization to reduce number of cache > lines misses during a drop event > (possibly accumulating hundreds of packets to be freed). > > A small htb change exports the backlog in class dumps. > > Final patch adds bulk dequeue to qdiscs that were lacking this feature. > > This series brings a nice qdisc performance increase (more than 80 % > in some cases). Thanks for working on this Eric! this is great work! :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops 2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer @ 2016-06-22 14:55 ` Eric Dumazet 2016-06-22 15:44 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2016-06-22 14:55 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote: > On Tue, 21 Jun 2016 23:16:48 -0700 > Eric Dumazet <edumazet@google.com> wrote: > > > First patch adds an additional parameter to ->enqueue() qdisc method > > so that drops can be done outside of critical section > > (after locks are released). > > > > Then fq_codel can have a small optimization to reduce number of cache > > lines misses during a drop event > > (possibly accumulating hundreds of packets to be freed). > > > > A small htb change exports the backlog in class dumps. > > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature. > > > > This series brings a nice qdisc performance increase (more than 80 % > > in some cases). > > Thanks for working on this Eric! this is great work! :-) Thanks Jesper I worked yesterday on bulk enqueues, but initial results are not that great. Here is my current patch, on top of my last series : (A second patch would remove the busylock spinlock, of course) include/net/sch_generic.h | 9 ++ net/core/dev.c | 135 ++++++++++++++++++++++++++++-------- 2 files changed, 115 insertions(+), 29 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 909aff2db2b3..1975a6fab10f 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -87,7 +87,14 @@ struct Qdisc { int padded; atomic_t refcnt; - spinlock_t busylock ____cacheline_aligned_in_smp; + spinlock_t busylock; + +#ifdef CONFIG_SMP + struct pcpu_skb_context *prequeue0 ____cacheline_aligned_in_smp; +#ifdef CONFIG_NUMA + struct pcpu_skb_context *prequeue1 ____cacheline_aligned_in_smp; +#endif +#endif }; static inline bool qdisc_is_running(const struct Qdisc *qdisc) diff --git a/net/core/dev.c b/net/core/dev.c index aba10d2a8bc3..5f0d3fe5b109 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3065,26 +3065,117 @@ static void qdisc_pkt_len_init(struct sk_buff *skb) } } -static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, - struct net_device *dev, - struct netdev_queue *txq) + +#ifdef CONFIG_SMP + +struct pcpu_skb_context { + struct pcpu_skb_context *next; + union { + struct sk_buff *skb; + unsigned long status; + struct pcpu_skb_context *self; + }; +}; +static DEFINE_PER_CPU_ALIGNED(struct pcpu_skb_context, pcpu_skb_context); + +/* Provides a small queue before qdisc so that we can batch ->enqueue() + * under SMP stress. + */ +static noinline int dev_xmit_skb_slow(struct sk_buff *skb, struct Qdisc *q, + struct pcpu_skb_context **prequeue) +{ + struct pcpu_skb_context *prev, *myptr; + struct sk_buff *to_free = NULL; + spinlock_t *root_lock; + void *status; + int i, rc; + + myptr = this_cpu_ptr(&pcpu_skb_context); + myptr->skb = skb; + myptr->next = NULL; + + /* Take our ticket in prequeue file, a la MCS lock */ + prev = xchg(prequeue, myptr); + if (prev) { + /* Ok, another cpu got the lock and serves the prequeue. + * Wait that it either processed our skb or it exhausted + * its budget and told us to process a batch ourself. + */ + WRITE_ONCE(prev->next, myptr); + + while ((status = READ_ONCE(myptr->skb)) == skb) + cpu_relax_lowlatency(); + + /* Nice ! Our skb was handled by another cpu */ + if ((unsigned long)status < NET_XMIT_MASK) + return (int)(unsigned long)status; + + /* Oh well, we got the responsability of next batch */ + BUG_ON(myptr != status); + } + root_lock = qdisc_lock(q); + spin_lock(root_lock); + + for (i = 0; i < 16; i++) { + bool may_release = true; + + if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { + __qdisc_drop(skb, &to_free); + rc = NET_XMIT_DROP; + } else { + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; + } + while (!(prev = READ_ONCE(myptr->next))) { + if (may_release) { + if (cmpxchg_release(prequeue, myptr, NULL) == myptr) + break; + may_release = false; + } + cpu_relax_lowlatency(); + } + smp_store_release(&myptr->status, (unsigned long)rc); + myptr = prev; + if (!myptr) + break; + skb = READ_ONCE(myptr->skb); + } + + qdisc_run(q); + spin_unlock(root_lock); + + /* Give control to another cpu for following batch */ + if (myptr) + smp_store_release(&myptr->self, myptr); + + if (unlikely(to_free)) + kfree_skb_list(to_free); + + return (int)this_cpu_read(pcpu_skb_context.status); +} +#endif + +static int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, + struct net_device *dev, + struct netdev_queue *txq) { spinlock_t *root_lock = qdisc_lock(q); struct sk_buff *to_free = NULL; - bool contended; int rc; qdisc_calculate_pkt_len(skb, q); - /* - * Heuristic to force contended enqueues to serialize on a - * separate lock before trying to get qdisc main lock. - * This permits qdisc->running owner to get the lock more - * often and dequeue packets faster. - */ - contended = qdisc_is_running(q); - if (unlikely(contended)) - spin_lock(&q->busylock); +#ifdef CONFIG_SMP + { + struct pcpu_skb_context **prequeue = &q->prequeue0; + +#ifdef CONFIG_NUMA + if (numa_node_id() & 1) + prequeue = &q->prequeue1; +#endif + if (unlikely(*prequeue || qdisc_is_running(q))) + return dev_xmit_skb_slow(skb, q, prequeue); + } +#endif spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); @@ -3099,31 +3190,19 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, qdisc_bstats_update(q, skb); - if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) { - if (unlikely(contended)) { - spin_unlock(&q->busylock); - contended = false; - } + if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) __qdisc_run(q); - } else + else qdisc_run_end(q); rc = NET_XMIT_SUCCESS; } else { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - if (qdisc_run_begin(q)) { - if (unlikely(contended)) { - spin_unlock(&q->busylock); - contended = false; - } - __qdisc_run(q); - } + qdisc_run(q); } spin_unlock(root_lock); if (unlikely(to_free)) kfree_skb_list(to_free); - if (unlikely(contended)) - spin_unlock(&q->busylock); return rc; } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops 2016-06-22 14:55 ` Eric Dumazet @ 2016-06-22 15:44 ` Jesper Dangaard Brouer 2016-06-22 16:49 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Jesper Dangaard Brouer @ 2016-06-22 15:44 UTC (permalink / raw) To: Eric Dumazet Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend, brouer, Luigi Rizzo On Wed, 22 Jun 2016 07:55:43 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote: > > On Tue, 21 Jun 2016 23:16:48 -0700 > > Eric Dumazet <edumazet@google.com> wrote: > > > > > First patch adds an additional parameter to ->enqueue() qdisc method > > > so that drops can be done outside of critical section > > > (after locks are released). > > > > > > Then fq_codel can have a small optimization to reduce number of cache > > > lines misses during a drop event > > > (possibly accumulating hundreds of packets to be freed). > > > > > > A small htb change exports the backlog in class dumps. > > > > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature. > > > > > > This series brings a nice qdisc performance increase (more than 80 % > > > in some cases). > > > > Thanks for working on this Eric! this is great work! :-) > > Thanks Jesper > > I worked yesterday on bulk enqueues, but initial results are not that > great. Hi Eric, This is interesting work! But I think you should read Luigi Rizzo's (Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling Architecture"[1] [1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf Luigi will be at Netfilter Workshop next week, and will actually present on topic/paper.... you two should talk ;-) The article is not a 100% match for what we need, but there is some good ideas. The article also have a sort of "prequeue" that enqueue'ing CPUs will place packets into. My understanding of the article: 1. transmitters submit packets to an intermediate queue (replace q->enqueue call) lockless submit as queue per CPU (runs in parallel) 2. like we only have _one_ qdisc dequeue process, this process (called arbiter) empty the intermediate queues, and then invoke q->enqueue() and q->dequeue(). (in a locked session/region) 3. Packets returned from q->dequeue() is placed on an outgoing intermediate queue. 4. the transmitter then looks to see there are any packets to drain() from the outgoing queue. This can run in parallel. If the transmitter submitting a packet, detect no arbiter is running, it can become the arbiter itself. Like we do with qdisc_run_begin() setting state __QDISC___STATE_RUNNING. The problem with this scheme is push-back from qdisc->enqueue (NET_XMIT_CN) does not "reach" us. And push-back in-form of processes blocking on qdisc root lock, but that could be handled by either blocking in article's submit() or returning some congestion return code from submit(). (left patch intact below for Luigi to see) > Here is my current patch, on top of my last series : > > (A second patch would remove the busylock spinlock, of course) > > include/net/sch_generic.h | 9 ++ > net/core/dev.c | 135 ++++++++++++++++++++++++++++-------- > 2 files changed, 115 insertions(+), 29 deletions(-) > > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index 909aff2db2b3..1975a6fab10f 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -87,7 +87,14 @@ struct Qdisc { > int padded; > atomic_t refcnt; > > - spinlock_t busylock ____cacheline_aligned_in_smp; > + spinlock_t busylock; > + > +#ifdef CONFIG_SMP > + struct pcpu_skb_context *prequeue0 ____cacheline_aligned_in_smp; > +#ifdef CONFIG_NUMA > + struct pcpu_skb_context *prequeue1 ____cacheline_aligned_in_smp; > +#endif > +#endif > }; > > static inline bool qdisc_is_running(const struct Qdisc *qdisc) > diff --git a/net/core/dev.c b/net/core/dev.c > index aba10d2a8bc3..5f0d3fe5b109 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3065,26 +3065,117 @@ static void qdisc_pkt_len_init(struct sk_buff *skb) > } > } > > -static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > - struct net_device *dev, > - struct netdev_queue *txq) > + > +#ifdef CONFIG_SMP > + > +struct pcpu_skb_context { > + struct pcpu_skb_context *next; > + union { > + struct sk_buff *skb; > + unsigned long status; > + struct pcpu_skb_context *self; > + }; > +}; > +static DEFINE_PER_CPU_ALIGNED(struct pcpu_skb_context, pcpu_skb_context); > + > +/* Provides a small queue before qdisc so that we can batch ->enqueue() > + * under SMP stress. > + */ > +static noinline int dev_xmit_skb_slow(struct sk_buff *skb, struct Qdisc *q, > + struct pcpu_skb_context **prequeue) > +{ > + struct pcpu_skb_context *prev, *myptr; > + struct sk_buff *to_free = NULL; > + spinlock_t *root_lock; > + void *status; > + int i, rc; > + > + myptr = this_cpu_ptr(&pcpu_skb_context); > + myptr->skb = skb; > + myptr->next = NULL; > + > + /* Take our ticket in prequeue file, a la MCS lock */ > + prev = xchg(prequeue, myptr); > + if (prev) { > + /* Ok, another cpu got the lock and serves the prequeue. > + * Wait that it either processed our skb or it exhausted > + * its budget and told us to process a batch ourself. > + */ > + WRITE_ONCE(prev->next, myptr); > + > + while ((status = READ_ONCE(myptr->skb)) == skb) > + cpu_relax_lowlatency(); > + > + /* Nice ! Our skb was handled by another cpu */ > + if ((unsigned long)status < NET_XMIT_MASK) > + return (int)(unsigned long)status; > + > + /* Oh well, we got the responsability of next batch */ > + BUG_ON(myptr != status); > + } > + root_lock = qdisc_lock(q); > + spin_lock(root_lock); > + > + for (i = 0; i < 16; i++) { > + bool may_release = true; > + > + if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > + __qdisc_drop(skb, &to_free); > + rc = NET_XMIT_DROP; > + } else { > + rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > + } > + while (!(prev = READ_ONCE(myptr->next))) { > + if (may_release) { > + if (cmpxchg_release(prequeue, myptr, NULL) == myptr) > + break; > + may_release = false; > + } > + cpu_relax_lowlatency(); > + } > + smp_store_release(&myptr->status, (unsigned long)rc); > + myptr = prev; > + if (!myptr) > + break; > + skb = READ_ONCE(myptr->skb); > + } > + > + qdisc_run(q); > + spin_unlock(root_lock); > + > + /* Give control to another cpu for following batch */ > + if (myptr) > + smp_store_release(&myptr->self, myptr); > + > + if (unlikely(to_free)) > + kfree_skb_list(to_free); > + > + return (int)this_cpu_read(pcpu_skb_context.status); > +} > +#endif > + > +static int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > + struct net_device *dev, > + struct netdev_queue *txq) > { > spinlock_t *root_lock = qdisc_lock(q); > struct sk_buff *to_free = NULL; > - bool contended; > int rc; > > qdisc_calculate_pkt_len(skb, q); > - /* > - * Heuristic to force contended enqueues to serialize on a > - * separate lock before trying to get qdisc main lock. > - * This permits qdisc->running owner to get the lock more > - * often and dequeue packets faster. > - */ > - contended = qdisc_is_running(q); > - if (unlikely(contended)) > - spin_lock(&q->busylock); > > +#ifdef CONFIG_SMP > + { > + struct pcpu_skb_context **prequeue = &q->prequeue0; > + > +#ifdef CONFIG_NUMA > + if (numa_node_id() & 1) > + prequeue = &q->prequeue1; > +#endif > + if (unlikely(*prequeue || qdisc_is_running(q))) > + return dev_xmit_skb_slow(skb, q, prequeue); > + } > +#endif > spin_lock(root_lock); > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > __qdisc_drop(skb, &to_free); > @@ -3099,31 +3190,19 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, > > qdisc_bstats_update(q, skb); > > - if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) { > - if (unlikely(contended)) { > - spin_unlock(&q->busylock); > - contended = false; > - } > + if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) > __qdisc_run(q); > - } else > + else > qdisc_run_end(q); > > rc = NET_XMIT_SUCCESS; > } else { > rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; > - if (qdisc_run_begin(q)) { > - if (unlikely(contended)) { > - spin_unlock(&q->busylock); > - contended = false; > - } > - __qdisc_run(q); > - } > + qdisc_run(q); > } > spin_unlock(root_lock); > if (unlikely(to_free)) > kfree_skb_list(to_free); > - if (unlikely(contended)) > - spin_unlock(&q->busylock); > return rc; > } > > > > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops 2016-06-22 15:44 ` Jesper Dangaard Brouer @ 2016-06-22 16:49 ` Eric Dumazet 2016-06-23 14:22 ` Jesper Dangaard Brouer 2016-06-23 16:21 ` Luigi Rizzo 0 siblings, 2 replies; 15+ messages in thread From: Eric Dumazet @ 2016-06-22 16:49 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend, Luigi Rizzo On Wed, 2016-06-22 at 17:44 +0200, Jesper Dangaard Brouer wrote: > On Wed, 22 Jun 2016 07:55:43 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote: > > > On Tue, 21 Jun 2016 23:16:48 -0700 > > > Eric Dumazet <edumazet@google.com> wrote: > > > > > > > First patch adds an additional parameter to ->enqueue() qdisc method > > > > so that drops can be done outside of critical section > > > > (after locks are released). > > > > > > > > Then fq_codel can have a small optimization to reduce number of cache > > > > lines misses during a drop event > > > > (possibly accumulating hundreds of packets to be freed). > > > > > > > > A small htb change exports the backlog in class dumps. > > > > > > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature. > > > > > > > > This series brings a nice qdisc performance increase (more than 80 % > > > > in some cases). > > > > > > Thanks for working on this Eric! this is great work! :-) > > > > Thanks Jesper > > > > I worked yesterday on bulk enqueues, but initial results are not that > > great. > > Hi Eric, > > This is interesting work! But I think you should read Luigi Rizzo's > (Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling > Architecture"[1] > > [1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf > > Luigi will be at Netfilter Workshop next week, and will actually > present on topic/paper.... you two should talk ;-) > > The article is not a 100% match for what we need, but there is some > good ideas. The article also have a sort of "prequeue" that > enqueue'ing CPUs will place packets into. > > My understanding of the article: > > 1. transmitters submit packets to an intermediate queue > (replace q->enqueue call) lockless submit as queue per CPU > (runs in parallel) > > 2. like we only have _one_ qdisc dequeue process, this process (called > arbiter) empty the intermediate queues, and then invoke q->enqueue() > and q->dequeue(). (in a locked session/region) > > 3. Packets returned from q->dequeue() is placed on an outgoing > intermediate queue. > > 4. the transmitter then looks to see there are any packets to drain() > from the outgoing queue. This can run in parallel. > > If the transmitter submitting a packet, detect no arbiter is running, > it can become the arbiter itself. Like we do with qdisc_run_begin() > setting state __QDISC___STATE_RUNNING. > > The problem with this scheme is push-back from qdisc->enqueue > (NET_XMIT_CN) does not "reach" us. And push-back in-form of processes > blocking on qdisc root lock, but that could be handled by either > blocking in article's submit() or returning some congestion return code > from submit(). Okay, I see that you prepare upcoming conference in Amsterdam, but please keep this thread about existing kernel code, not the one that eventually reach a new operating system in 5 years ;) 1) We _want_ the result of the sends, obviously. 2) We also want back pressure, without adding complex callbacks and ref-counting. 3) We do not want to burn a cpu per TX queue (at least one per NUMA node ???) only to send few packets per second, Our model is still interrupt based, plus NAPI for interrupt mitigation. 4) I do not want to lock an innocent cpu to send packets from other threads/cpu without a tight control. In the patch I sent, I basically replaced a locked operation (spin_lock(&q->busylock)) with another one (xchg()) , but I did not add yet another queue before the qdisc ones, bufferbloat forbids. The virtual queue here is one packet per cpu, which basically is the same than before this patch, since each cpu spinning on busylock has one skb to send anyway. This is basically a simple extension of MCS locks, where the cpu at the head of the queue can queue up to 16 packets, instead of queueing its own packet only and give queue owner ship to the following cpu. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops 2016-06-22 16:49 ` Eric Dumazet @ 2016-06-23 14:22 ` Jesper Dangaard Brouer 2016-06-23 16:21 ` Luigi Rizzo 1 sibling, 0 replies; 15+ messages in thread From: Jesper Dangaard Brouer @ 2016-06-23 14:22 UTC (permalink / raw) To: Eric Dumazet Cc: Eric Dumazet, David S . Miller, netdev, John Fastabend, Luigi Rizzo, brouer On Wed, 22 Jun 2016 09:49:48 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2016-06-22 at 17:44 +0200, Jesper Dangaard Brouer wrote: > > On Wed, 22 Jun 2016 07:55:43 -0700 > > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote: > > > > On Tue, 21 Jun 2016 23:16:48 -0700 > > > > Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > First patch adds an additional parameter to ->enqueue() qdisc method > > > > > so that drops can be done outside of critical section > > > > > (after locks are released). > > > > > > > > > > Then fq_codel can have a small optimization to reduce number of cache > > > > > lines misses during a drop event > > > > > (possibly accumulating hundreds of packets to be freed). > > > > > > > > > > A small htb change exports the backlog in class dumps. > > > > > > > > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature. > > > > > > > > > > This series brings a nice qdisc performance increase (more than 80 % > > > > > in some cases). > > > > > > > > Thanks for working on this Eric! this is great work! :-) > > > > > > Thanks Jesper > > > > > > I worked yesterday on bulk enqueues, but initial results are not that > > > great. > > > > Hi Eric, > > > > This is interesting work! But I think you should read Luigi Rizzo's > > (Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling > > Architecture"[1] > > > > [1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf > > > > Luigi will be at Netfilter Workshop next week, and will actually > > present on topic/paper.... you two should talk ;-) > > > > The article is not a 100% match for what we need, but there is some > > good ideas. The article also have a sort of "prequeue" that > > enqueue'ing CPUs will place packets into. > > > > My understanding of the article: > > > > 1. transmitters submit packets to an intermediate queue > > (replace q->enqueue call) lockless submit as queue per CPU > > (runs in parallel) > > > > 2. like we only have _one_ qdisc dequeue process, this process (called > > arbiter) empty the intermediate queues, and then invoke q->enqueue() > > and q->dequeue(). (in a locked session/region) > > > > 3. Packets returned from q->dequeue() is placed on an outgoing > > intermediate queue. > > > > 4. the transmitter then looks to see there are any packets to drain() > > from the outgoing queue. This can run in parallel. > > > > If the transmitter submitting a packet, detect no arbiter is running, > > it can become the arbiter itself. Like we do with qdisc_run_begin() > > setting state __QDISC___STATE_RUNNING. > > > > The problem with this scheme is push-back from qdisc->enqueue > > (NET_XMIT_CN) does not "reach" us. And push-back in-form of processes > > blocking on qdisc root lock, but that could be handled by either > > blocking in article's submit() or returning some congestion return code > > from submit(). > > Okay, I see that you prepare upcoming conference in Amsterdam, > but please keep this thread about existing kernel code, not the one that > eventually reach a new operating system in 5 years ;) > > 1) We _want_ the result of the sends, obviously. How dependent are we on the return codes? E.g. the NET_XMIT_CN return is not that accurate, it does not mean this packet was dropped, it could be from an unrelated flow. > 2) We also want back pressure, without adding complex callbacks and > ref-counting. > > 3) We do not want to burn a cpu per TX queue (at least one per NUMA > node ???) only to send few packets per second, > Our model is still interrupt based, plus NAPI for interrupt mitigation. > > 4) I do not want to lock an innocent cpu to send packets from other > threads/cpu without a tight control. Article present two modes: 1) a dedicated CPU runs the "arbiter", 2) submitting CPU becomes the arbiter (iif not other CPU is the arbiter). I imagine we use mode 2. Which is almost what we already do now. The qdisc layer only allow a single CPU to be dequeue'ing packets. This process can be seen as the "arbiter". The only difference is that it will pickup packets from an intermediate queue, and invoke q->enqueue(). (Still keeping the quota in __qdisc_run()). > In the patch I sent, I basically replaced a locked operation > (spin_lock(&q->busylock)) with another one (xchg()) , but I did not add > yet another queue before the qdisc ones, bufferbloat forbids. Is it really bufferbloat to introduce an intermidiate queue at this point. The enqueue/submit process, can see that qdisc_is_running, thus it knows these packets will be picked up very shortly (within 200 cycles) and "arbiter" will invoke q->enqueue() allowing qdisc to react to bufferbloat. > The virtual queue here is one packet per cpu, which basically is the > same than before this patch, since each cpu spinning on busylock has one > skb to send anyway. > > This is basically a simple extension of MCS locks, where the cpu at the > head of the queue can queue up to 16 packets, instead of queueing its > own packet only and give queue owner ship to the following cpu. I do like MCS locks. You sort of open-coded it. I am impress by the code, but it really takes some time to read and understand (not necessarily a bad thing). I am impress how you get the return code back (from the remote sender). I was a problem I've been struggling to solve but I couldn't. Thanks for working on this Eric! -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops 2016-06-22 16:49 ` Eric Dumazet 2016-06-23 14:22 ` Jesper Dangaard Brouer @ 2016-06-23 16:21 ` Luigi Rizzo 1 sibling, 0 replies; 15+ messages in thread From: Luigi Rizzo @ 2016-06-23 16:21 UTC (permalink / raw) To: Eric Dumazet Cc: Jesper Dangaard Brouer, Eric Dumazet, David S . Miller, netdev, John Fastabend On Wed, Jun 22, 2016 at 6:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2016-06-22 at 17:44 +0200, Jesper Dangaard Brouer wrote: >> On Wed, 22 Jun 2016 07:55:43 -0700 >> Eric Dumazet <eric.dumazet@gmail.com> wrote: >> >> > On Wed, 2016-06-22 at 16:47 +0200, Jesper Dangaard Brouer wrote: >> > > On Tue, 21 Jun 2016 23:16:48 -0700 >> > > Eric Dumazet <edumazet@google.com> wrote: >> > > >> > > > First patch adds an additional parameter to ->enqueue() qdisc method >> > > > so that drops can be done outside of critical section >> > > > (after locks are released). >> > > > >> > > > Then fq_codel can have a small optimization to reduce number of cache >> > > > lines misses during a drop event >> > > > (possibly accumulating hundreds of packets to be freed). >> > > > >> > > > A small htb change exports the backlog in class dumps. >> > > > >> > > > Final patch adds bulk dequeue to qdiscs that were lacking this feature. >> > > > >> > > > This series brings a nice qdisc performance increase (more than 80 % >> > > > in some cases). >> > > >> > > Thanks for working on this Eric! this is great work! :-) >> > >> > Thanks Jesper >> > >> > I worked yesterday on bulk enqueues, but initial results are not that >> > great. >> >> Hi Eric, >> >> This is interesting work! But I think you should read Luigi Rizzo's >> (Cc'ed) paper on title "A Fast and Practical Software Packet Scheduling >> Architecture"[1] >> >> [1] http://info.iet.unipi.it/~luigi/papers/20160511-mysched-preprint.pdf >> >> Luigi will be at Netfilter Workshop next week, and will actually >> present on topic/paper.... you two should talk ;-) >> >> The article is not a 100% match for what we need, but there is some >> good ideas. The article also have a sort of "prequeue" that >> enqueue'ing CPUs will place packets into. >> >> My understanding of the article: >> >> 1. transmitters submit packets to an intermediate queue >> (replace q->enqueue call) lockless submit as queue per CPU >> (runs in parallel) >> >> 2. like we only have _one_ qdisc dequeue process, this process (called >> arbiter) empty the intermediate queues, and then invoke q->enqueue() >> and q->dequeue(). (in a locked session/region) >> >> 3. Packets returned from q->dequeue() is placed on an outgoing >> intermediate queue. >> >> 4. the transmitter then looks to see there are any packets to drain() >> from the outgoing queue. This can run in parallel. >> >> If the transmitter submitting a packet, detect no arbiter is running, >> it can become the arbiter itself. Like we do with qdisc_run_begin() >> setting state __QDISC___STATE_RUNNING. >> >> The problem with this scheme is push-back from qdisc->enqueue >> (NET_XMIT_CN) does not "reach" us. And push-back in-form of processes >> blocking on qdisc root lock, but that could be handled by either >> blocking in article's submit() or returning some congestion return code >> from submit(). > > Okay, I see that you prepare upcoming conference in Amsterdam, > but please keep this thread about existing kernel code, not the one that > eventually reach a new operating system in 5 years ;) > > 1) We _want_ the result of the sends, obviously. > > 2) We also want back pressure, without adding complex callbacks and > ref-counting. > > 3) We do not want to burn a cpu per TX queue (at least one per NUMA > node ???) only to send few packets per second, > Our model is still interrupt based, plus NAPI for interrupt mitigation. > > 4) I do not want to lock an innocent cpu to send packets from other > threads/cpu without a tight control. > > In the patch I sent, I basically replaced a locked operation > (spin_lock(&q->busylock)) with another one (xchg()) , but I did not add > yet another queue before the qdisc ones, bufferbloat forbids. > > The virtual queue here is one packet per cpu, which basically is the > same than before this patch, since each cpu spinning on busylock has one > skb to send anyway. > > This is basically a simple extension of MCS locks, where the cpu at the > head of the queue can queue up to 16 packets, instead of queueing its > own packet only and give queue owner ship to the following cpu. Hi Eric (and others), don't worry, my proposal (PSPAT) is not specifically addressing/targeting the linux qdisc now, but at the same time it does have any of the faults you are worried about. My target, at a high level, is a VM hosting node where the guest VMs may create large amounts of traffic, maybe most of it doomed to be dropped, but still consuming theirs and system's resources by creating the packets and pounding on the xmit calls. The goal of PSPAT is to let those clients know very early (possibly even before doing lookups or encapsulation) when the underlying path to the NIC will be essentially free for transmission, at which point the sender can complete building the packet and push it out. To comment on your observations, PSPAT has the following features: 1) it does return the result of the send, which is run by the individual thread who submitted the packet when it gets grant to transmit (this also covers your point #4). Note that by the time you can do a send, the underlying path is free so it should not fail unless the link goes down. Conversely, if the user-visible queue is full or almost full, you'll get a drop or congestion notification even before talking to the arbiter. 2) it does support backpressure, because said thread has direct access to the queue's state when it submits the request. The queue visible to the thread is the same per-flow queue used by the scheduler, so it does not create additional queueing, and the qdisc->enqueue() will never fail (see #A below for more details on queues) 3) it does not burn a core per tx-queue to run the arbiter. The thread that in PSPAT runs the arbiter is essentially the NAPI thread, and under normal conditions it is asleep. It wakes up only on tx-complete interrupts (or timers) and does even less than the NAPI thread, as it does not even have to (though it can) run the device_xmit code. The only case in which the arbiter runs "continuously" (with many options to throttle it) is when the load is so high that even the NAPI thread would have to run continuously. A) additional remarks on queues: the model of operation of PSPAT is that each flow (however you define it) has its own queue, and the scheduler handles the collection of queues with some discipline (round robin, fair queueing, priorities...). When the arbiter marks a packet as "good to send", the underlying path to the NIC is virtually free for you to use, so you can expect zero extra delay (other than your own processing). Individual queue management policies (codel, RED, ECN, whatever looks at the individual queue) can be done locally. If you want a single global queue where everybody can write to without individual limitations (e.g. a bounded size FIFO with K slots), keep in mind that this model gives no service guarantees other than "if you manage to get in, you'll never wait more than K packets" But, you may never get in so your delay may be infinite (as it is the case for me now on the beach with flakey cell coverage). The obvious way to implement this FIFO service is no scheduler, one TX ring with K slots and let clients contend on it. If you expect to use M tx rings and still want the 'K packets delay' guarantee, then PSPAT may still help because you can create a queue with one or a few slots per client, and when the arbiter runs (which is more frequently than K packets' time) it may tell you "you were lucky" or "sorry no room for your packets, drop them". Finally: The paper has a ton more details, but it targets an academic audience which is generally not very interested in implementation issues. This is why (besides lack of space and some ignorance on my side) I tried to avoid suggesting how PSPAT could be implemented in Linux or FreeBSD or userspace software routers or other architectures. I am happy to hear comments, criticism and suggestions and discuss details with you or other interested, however I would still suggest to have first a look at the paper (perhaps skipping the boring section with formal proofs of the service guarantees) because the description that we can give on this mailing list is necessarily incomplete due to space limitations. Hope to see many of you next week in Amsterdam. cheers luigi ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops 2016-06-22 6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet ` (4 preceding siblings ...) 2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer @ 2016-06-25 16:20 ` David Miller 5 siblings, 0 replies; 15+ messages in thread From: David Miller @ 2016-06-25 16:20 UTC (permalink / raw) To: edumazet; +Cc: netdev, john.r.fastabend, brouer, eric.dumazet From: Eric Dumazet <edumazet@google.com> Date: Tue, 21 Jun 2016 23:16:48 -0700 > First patch adds an additional parameter to ->enqueue() qdisc method > so that drops can be done outside of critical section > (after locks are released). > > Then fq_codel can have a small optimization to reduce number of cache > lines misses during a drop event > (possibly accumulating hundreds of packets to be freed). > > A small htb change exports the backlog in class dumps. > > Final patch adds bulk dequeue to qdiscs that were lacking this feature. > > This series brings a nice qdisc performance increase (more than 80 % > in some cases). Series applied, thanks Eric. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-06-25 16:20 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-06-22 6:16 [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 1/4] net_sched: drop packets after root qdisc lock is released Eric Dumazet 2016-06-22 15:14 ` Jesper Dangaard Brouer 2016-06-22 6:16 ` [PATCH net-next 2/4] net_sched: fq_codel: cache skb->truesize into skb->cb Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 3/4] net_sched: sch_htb: export class backlog in dumps Eric Dumazet 2016-06-22 6:16 ` [PATCH net-next 4/4] net_sched: generalize bulk dequeue Eric Dumazet 2016-06-22 15:03 ` Jesper Dangaard Brouer 2016-06-23 7:26 ` Paolo Abeni 2016-06-22 14:47 ` [PATCH net-next 0/4] net_sched: bulk dequeue and deferred drops Jesper Dangaard Brouer 2016-06-22 14:55 ` Eric Dumazet 2016-06-22 15:44 ` Jesper Dangaard Brouer 2016-06-22 16:49 ` Eric Dumazet 2016-06-23 14:22 ` Jesper Dangaard Brouer 2016-06-23 16:21 ` Luigi Rizzo 2016-06-25 16:20 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).