* [PATCH v2 net-next 01/14] net_sched: make room for (struct qdisc_skb_cb)->pkt_segs
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 02/14] net: init shinfo->gso_segs from qdisc_pkt_len_init() Eric Dumazet
` (14 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Add a new u16 field, next to pkt_len : pkt_segs
This will cache shinfo->gso_segs to speed up qdisc deqeue().
Move slave_dev_queue_mapping at the end of qdisc_skb_cb,
and move three bits from tc_skb_cb :
- post_ct
- post_ct_snat
- post_ct_dnat
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sch_generic.h | 18 +++++++++---------
net/core/dev.c | 2 +-
net/sched/act_ct.c | 8 ++++----
net/sched/cls_api.c | 6 +++---
net/sched/cls_flower.c | 2 +-
5 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 94966692ccdf51db085c236319705aecba8c30cf..9cd8b5d4b23698fd8959ef40c303468e31c1d4af 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -429,13 +429,16 @@ struct tcf_proto {
};
struct qdisc_skb_cb {
- struct {
- unsigned int pkt_len;
- u16 slave_dev_queue_mapping;
- u16 tc_classid;
- };
+ unsigned int pkt_len;
+ u16 pkt_segs;
+ u16 tc_classid;
#define QDISC_CB_PRIV_LEN 20
unsigned char data[QDISC_CB_PRIV_LEN];
+
+ u16 slave_dev_queue_mapping;
+ u8 post_ct:1;
+ u8 post_ct_snat:1;
+ u8 post_ct_dnat:1;
};
typedef void tcf_chain_head_change_t(struct tcf_proto *tp_head, void *priv);
@@ -1064,11 +1067,8 @@ struct tc_skb_cb {
struct qdisc_skb_cb qdisc_cb;
u32 drop_reason;
- u16 zone; /* Only valid if post_ct = true */
+ u16 zone; /* Only valid if qdisc_skb_cb(skb)->post_ct = true */
u16 mru;
- u8 post_ct:1;
- u8 post_ct_snat:1;
- u8 post_ct_dnat:1;
};
static inline struct tc_skb_cb *tc_skb_cb(const struct sk_buff *skb)
diff --git a/net/core/dev.c b/net/core/dev.c
index 69515edd17bc6a157046f31b3dd343a59ae192ab..46ce6c6107805132b1322128e86634eca91e3340 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4355,7 +4355,7 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
return ret;
tc_skb_cb(skb)->mru = 0;
- tc_skb_cb(skb)->post_ct = false;
+ qdisc_skb_cb(skb)->post_ct = false;
tcf_set_drop_reason(skb, *drop_reason);
mini_qdisc_bstats_cpu_update(miniq, skb);
diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 6749a4a9a9cd0a43897fcd20d228721ce057cb88..2b6ac7069dc168da2c534bddc5d4398e5e7a18c4 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -948,9 +948,9 @@ static int tcf_ct_act_nat(struct sk_buff *skb,
return err & NF_VERDICT_MASK;
if (action & BIT(NF_NAT_MANIP_SRC))
- tc_skb_cb(skb)->post_ct_snat = 1;
+ qdisc_skb_cb(skb)->post_ct_snat = 1;
if (action & BIT(NF_NAT_MANIP_DST))
- tc_skb_cb(skb)->post_ct_dnat = 1;
+ qdisc_skb_cb(skb)->post_ct_dnat = 1;
return err;
#else
@@ -986,7 +986,7 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
tcf_action_update_bstats(&c->common, skb);
if (clear) {
- tc_skb_cb(skb)->post_ct = false;
+ qdisc_skb_cb(skb)->post_ct = false;
ct = nf_ct_get(skb, &ctinfo);
if (ct) {
nf_ct_put(ct);
@@ -1097,7 +1097,7 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
out_push:
skb_push_rcsum(skb, nh_ofs);
- tc_skb_cb(skb)->post_ct = true;
+ qdisc_skb_cb(skb)->post_ct = true;
tc_skb_cb(skb)->zone = p->zone;
out_clear:
if (defrag)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index f751cd5eeac8d72b4c4d138f45d25a8ba62fb1bd..ebca4b926dcf76daa3abb8ffe221503e33de30e3 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1872,9 +1872,9 @@ int tcf_classify(struct sk_buff *skb,
}
ext->chain = last_executed_chain;
ext->mru = cb->mru;
- ext->post_ct = cb->post_ct;
- ext->post_ct_snat = cb->post_ct_snat;
- ext->post_ct_dnat = cb->post_ct_dnat;
+ ext->post_ct = qdisc_skb_cb(skb)->post_ct;
+ ext->post_ct_snat = qdisc_skb_cb(skb)->post_ct_snat;
+ ext->post_ct_dnat = qdisc_skb_cb(skb)->post_ct_dnat;
ext->zone = cb->zone;
}
}
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 099ff6a3e1f516a50cfac578666f6d5f4fbe8f29..7669371c1354c27ede83c2c83aaea5c0402e6552 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -326,7 +326,7 @@ TC_INDIRECT_SCOPE int fl_classify(struct sk_buff *skb,
struct tcf_result *res)
{
struct cls_fl_head *head = rcu_dereference_bh(tp->root);
- bool post_ct = tc_skb_cb(skb)->post_ct;
+ bool post_ct = qdisc_skb_cb(skb)->post_ct;
u16 zone = tc_skb_cb(skb)->zone;
struct fl_flow_key skb_key;
struct fl_flow_mask *mask;
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 02/14] net: init shinfo->gso_segs from qdisc_pkt_len_init()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 01/14] net_sched: make room for (struct qdisc_skb_cb)->pkt_segs Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 03/14] net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init() Eric Dumazet
` (13 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Qdisc use shinfo->gso_segs for their pkts stats in bstats_update(),
but this field needs to be initialized for SKB_GSO_DODGY users.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/dev.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 46ce6c6107805132b1322128e86634eca91e3340..dba9eef8bd83dda89b5edd870b47373722264f48 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4071,7 +4071,7 @@ EXPORT_SYMBOL_GPL(validate_xmit_skb_list);
static void qdisc_pkt_len_init(struct sk_buff *skb)
{
- const struct skb_shared_info *shinfo = skb_shinfo(skb);
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
qdisc_skb_cb(skb)->pkt_len = skb->len;
@@ -4112,6 +4112,7 @@ static void qdisc_pkt_len_init(struct sk_buff *skb)
if (payload <= 0)
return;
gso_segs = DIV_ROUND_UP(payload, shinfo->gso_size);
+ shinfo->gso_segs = gso_segs;
}
qdisc_skb_cb(skb)->pkt_len += (gso_segs - 1) * hdr_len;
}
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 03/14] net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 01/14] net_sched: make room for (struct qdisc_skb_cb)->pkt_segs Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 02/14] net: init shinfo->gso_segs from qdisc_pkt_len_init() Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 04/14] net: use qdisc_pkt_len_segs_init() in sch_handle_ingress() Eric Dumazet
` (12 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
qdisc_pkt_len_init() is currently initalizing qdisc_skb_cb(skb)->pkt_len.
Add qdisc_skb_cb(skb)->pkt_segs initialization and rename this function
to qdisc_pkt_len_segs_init().
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/dev.c | 15 +++++++++++----
net/sched/sch_cake.c | 2 +-
2 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index dba9eef8bd83dda89b5edd870b47373722264f48..895c3e37e686f0f625bd5eec7079a43cbd33a7eb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4069,17 +4069,23 @@ struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *d
}
EXPORT_SYMBOL_GPL(validate_xmit_skb_list);
-static void qdisc_pkt_len_init(struct sk_buff *skb)
+static void qdisc_pkt_len_segs_init(struct sk_buff *skb)
{
struct skb_shared_info *shinfo = skb_shinfo(skb);
+ u16 gso_segs;
qdisc_skb_cb(skb)->pkt_len = skb->len;
+ if (!shinfo->gso_size) {
+ qdisc_skb_cb(skb)->pkt_segs = 1;
+ return;
+ }
+
+ qdisc_skb_cb(skb)->pkt_segs = gso_segs = shinfo->gso_segs;
/* To get more precise estimation of bytes sent on wire,
* we add to pkt_len the headers size of all segments
*/
- if (shinfo->gso_size && skb_transport_header_was_set(skb)) {
- u16 gso_segs = shinfo->gso_segs;
+ if (skb_transport_header_was_set(skb)) {
unsigned int hdr_len;
/* mac layer + network layer */
@@ -4113,6 +4119,7 @@ static void qdisc_pkt_len_init(struct sk_buff *skb)
return;
gso_segs = DIV_ROUND_UP(payload, shinfo->gso_size);
shinfo->gso_segs = gso_segs;
+ qdisc_skb_cb(skb)->pkt_segs = gso_segs;
}
qdisc_skb_cb(skb)->pkt_len += (gso_segs - 1) * hdr_len;
}
@@ -4738,7 +4745,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
skb_update_prio(skb);
- qdisc_pkt_len_init(skb);
+ qdisc_pkt_len_segs_init(skb);
tcx_set_ingress(skb, false);
#ifdef CONFIG_NET_EGRESS
if (static_branch_unlikely(&egress_needed_key)) {
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 32bacfc314c260dccf94178d309ccb2be22d69e4..9213129f0de10bc67ce418f77c36fed2581f3781 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1406,7 +1406,7 @@ static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb)
if (!shinfo->gso_size)
return cake_calc_overhead(q, len, off);
- /* borrowed from qdisc_pkt_len_init() */
+ /* borrowed from qdisc_pkt_len_segs_init() */
if (!skb->encapsulation)
hdr_len = skb_transport_offset(skb);
else
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 04/14] net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (2 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 03/14] net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init() Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 05/14] net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update() Eric Dumazet
` (11 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
sch_handle_ingress() sets qdisc_skb_cb(skb)->pkt_len.
We also need to initialize qdisc_skb_cb(skb)->pkt_segs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 895c3e37e686f0f625bd5eec7079a43cbd33a7eb..e19eb4e9d77c27535ab2a0ce14299281e3ef9397 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4434,7 +4434,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
*pt_prev = NULL;
}
- qdisc_skb_cb(skb)->pkt_len = skb->len;
+ qdisc_pkt_len_segs_init(skb);
tcx_set_ingress(skb, true);
if (static_branch_unlikely(&tcx_needed_key)) {
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 05/14] net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (3 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 04/14] net: use qdisc_pkt_len_segs_init() in sch_handle_ingress() Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 06/14] net_sched: cake: use qdisc_pkt_segs() Eric Dumazet
` (10 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Avoid up to two cache line misses in qdisc dequeue() to fetch
skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held.
This gives a 5 % improvement in a TX intensive workload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sch_generic.h | 13 ++++++++++---
net/sched/sch_cake.c | 1 +
net/sched/sch_dualpi2.c | 1 +
net/sched/sch_netem.c | 1 +
net/sched/sch_qfq.c | 2 +-
net/sched/sch_taprio.c | 1 +
net/sched/sch_tbf.c | 1 +
7 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9cd8b5d4b23698fd8959ef40c303468e31c1d4af..cdf7a58ebcf5ef2b5f8b76eb6fbe92d5f0e07899 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -829,6 +829,15 @@ static inline unsigned int qdisc_pkt_len(const struct sk_buff *skb)
return qdisc_skb_cb(skb)->pkt_len;
}
+static inline unsigned int qdisc_pkt_segs(const struct sk_buff *skb)
+{
+ u32 pkt_segs = qdisc_skb_cb(skb)->pkt_segs;
+
+ DEBUG_NET_WARN_ON_ONCE(pkt_segs !=
+ (skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs : 1));
+ return pkt_segs;
+}
+
/* additional qdisc xmit flags (NET_XMIT_MASK in linux/netdevice.h) */
enum net_xmit_qdisc_t {
__NET_XMIT_STOLEN = 0x00010000,
@@ -870,9 +879,7 @@ static inline void _bstats_update(struct gnet_stats_basic_sync *bstats,
static inline void bstats_update(struct gnet_stats_basic_sync *bstats,
const struct sk_buff *skb)
{
- _bstats_update(bstats,
- qdisc_pkt_len(skb),
- skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs : 1);
+ _bstats_update(bstats, qdisc_pkt_len(skb), qdisc_pkt_segs(skb));
}
static inline void qdisc_bstats_cpu_update(struct Qdisc *sch,
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 9213129f0de10bc67ce418f77c36fed2581f3781..a20880034aa5eacec0c25977406104448b336397 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1800,6 +1800,7 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
skb_list_walk_safe(segs, segs, nskb) {
skb_mark_not_on_list(segs);
qdisc_skb_cb(segs)->pkt_len = segs->len;
+ qdisc_skb_cb(segs)->pkt_segs = 1;
cobalt_set_enqueue_time(segs, now);
get_cobalt_cb(segs)->adjusted_len = cake_overhead(q,
segs);
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index 4b975feb52b1f3d3b37b31713d1477de5f5806d9..6d7e6389758dc8e645b1116efe4e11fb7290ac86 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -475,6 +475,7 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
* (3) Enqueue fragment & set ts in dualpi2_enqueue_skb
*/
qdisc_skb_cb(nskb)->pkt_len = nskb->len;
+ qdisc_skb_cb(nskb)->pkt_segs = 1;
dualpi2_skb_cb(nskb)->classified =
dualpi2_skb_cb(skb)->classified;
dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect;
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index eafc316ae319e3f8c23b0cb0c58fdf54be102213..32a5f33040461f3be952055c097b5f2fe760a858 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -429,6 +429,7 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch,
struct sk_buff *segs;
netdev_features_t features = netif_skb_features(skb);
+ qdisc_skb_cb(skb)->pkt_segs = 1;
segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
if (IS_ERR_OR_NULL(segs)) {
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index 2255355e51d350eded4549c1584b60d4d9b00fff..d920f57dc6d7659c510a98956c6dd2ed9e5ee5b8 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -1250,7 +1250,7 @@ static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
}
}
- gso_segs = skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs : 1;
+ gso_segs = qdisc_pkt_segs(skb);
err = qdisc_enqueue(skb, cl->qdisc, to_free);
if (unlikely(err != NET_XMIT_SUCCESS)) {
pr_debug("qfq_enqueue: enqueue failed %d\n", err);
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 39b735386996eb59712a1fc28f7bb903ec1b2220..300d577b328699eb42d2b829ecfc76464fd7b186 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -595,6 +595,7 @@ static int taprio_enqueue_segmented(struct sk_buff *skb, struct Qdisc *sch,
skb_list_walk_safe(segs, segs, nskb) {
skb_mark_not_on_list(segs);
qdisc_skb_cb(segs)->pkt_len = segs->len;
+ qdisc_skb_cb(segs)->pkt_segs = 1;
slen += segs->len;
/* FIXME: we should be segmenting to a smaller size
diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index 4c977f049670a600eafd219c898e5f29597be2c1..f2340164f579a25431979e12ec3d23ab828edd16 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -221,6 +221,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch,
skb_mark_not_on_list(segs);
seg_len = segs->len;
qdisc_skb_cb(segs)->pkt_len = seg_len;
+ qdisc_skb_cb(segs)->pkt_segs = 1;
ret = qdisc_enqueue(segs, q->qdisc, to_free);
if (ret != NET_XMIT_SUCCESS) {
if (net_xmit_drop_count(ret))
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 06/14] net_sched: cake: use qdisc_pkt_segs()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (4 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 05/14] net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update() Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 07/14] net_sched: add Qdisc_read_mostly and Qdisc_write groups Eric Dumazet
` (9 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Use new qdisc_pkt_segs() to avoid a cache line miss in cake_enqueue()
for non GSO packets.
cake_overhead() does not have to recompute it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/sched/sch_cake.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index a20880034aa5eacec0c25977406104448b336397..5948a149129c6de041ba949e2e2b5b6b4eb54166 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1398,12 +1398,12 @@ static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb)
const struct skb_shared_info *shinfo = skb_shinfo(skb);
unsigned int hdr_len, last_len = 0;
u32 off = skb_network_offset(skb);
+ u16 segs = qdisc_pkt_segs(skb);
u32 len = qdisc_pkt_len(skb);
- u16 segs = 1;
q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
- if (!shinfo->gso_size)
+ if (segs == 1)
return cake_calc_overhead(q, len, off);
/* borrowed from qdisc_pkt_len_segs_init() */
@@ -1430,12 +1430,6 @@ static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb)
hdr_len += sizeof(struct udphdr);
}
- if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
- segs = DIV_ROUND_UP(skb->len - hdr_len,
- shinfo->gso_size);
- else
- segs = shinfo->gso_segs;
-
len = shinfo->gso_size + hdr_len;
last_len = skb->len - shinfo->gso_size * (segs - 1);
@@ -1788,7 +1782,7 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
if (unlikely(len > b->max_skblen))
b->max_skblen = len;
- if (skb_is_gso(skb) && q->rate_flags & CAKE_FLAG_SPLIT_GSO) {
+ if (qdisc_pkt_segs(skb) > 1 && q->rate_flags & CAKE_FLAG_SPLIT_GSO) {
struct sk_buff *segs, *nskb;
netdev_features_t features = netif_skb_features(skb);
unsigned int slen = 0, numsegs = 0;
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 07/14] net_sched: add Qdisc_read_mostly and Qdisc_write groups
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (5 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 06/14] net_sched: cake: use qdisc_pkt_segs() Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 08/14] net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb() Eric Dumazet
` (8 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
It is possible to reorg Qdisc to avoid always dirtying 2 cache lines in
fast path by reducing this to a single dirtied cache line.
In current layout, we change only four/six fields in the first cache line:
- q.spinlock
- q.qlen
- bstats.bytes
- bstats.packets
- some Qdisc also change q.next/q.prev
In the second cache line we change in the fast path:
- running
- state
- qstats.backlog
/* --- cacheline 2 boundary (128 bytes) --- */
struct sk_buff_head gso_skb __attribute__((__aligned__(64))); /* 0x80 0x18 */
struct qdisc_skb_head q; /* 0x98 0x18 */
struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /* 0xb0 0x10 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct gnet_stats_queue qstats; /* 0xc0 0x14 */
bool running; /* 0xd4 0x1 */
/* XXX 3 bytes hole, try to pack */
unsigned long state; /* 0xd8 0x8 */
struct Qdisc * next_sched; /* 0xe0 0x8 */
struct sk_buff_head skb_bad_txq; /* 0xe8 0x18 */
/* --- cacheline 4 boundary (256 bytes) --- */
Reorganize things to have a first cache line mostly read,
then a mostly written one.
This gives a ~3% increase of performance under tx stress.
Note that there is an additional hole because @qstats now spans over a third cache line.
/* --- cacheline 2 boundary (128 bytes) --- */
__u8 __cacheline_group_begin__Qdisc_read_mostly[0] __attribute__((__aligned__(64))); /* 0x80 0 */
struct sk_buff_head gso_skb; /* 0x80 0x18 */
struct Qdisc * next_sched; /* 0x98 0x8 */
struct sk_buff_head skb_bad_txq; /* 0xa0 0x18 */
__u8 __cacheline_group_end__Qdisc_read_mostly[0]; /* 0xb8 0 */
/* XXX 8 bytes hole, try to pack */
/* --- cacheline 3 boundary (192 bytes) --- */
__u8 __cacheline_group_begin__Qdisc_write[0] __attribute__((__aligned__(64))); /* 0xc0 0 */
struct qdisc_skb_head q; /* 0xc0 0x18 */
unsigned long state; /* 0xd8 0x8 */
struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /* 0xe0 0x10 */
bool running; /* 0xf0 0x1 */
/* XXX 3 bytes hole, try to pack */
struct gnet_stats_queue qstats; /* 0xf4 0x14 */
/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
__u8 __cacheline_group_end__Qdisc_write[0]; /* 0x108 0 */
/* XXX 56 bytes hole, try to pack */
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sch_generic.h | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index cdf7a58ebcf5ef2b5f8b76eb6fbe92d5f0e07899..79501499dafba56271b9ebd97a8f379ffdc83cac 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -103,17 +103,24 @@ struct Qdisc {
int pad;
refcount_t refcnt;
- /*
- * For performance sake on SMP, we put highly modified fields at the end
- */
- struct sk_buff_head gso_skb ____cacheline_aligned_in_smp;
- struct qdisc_skb_head q;
- struct gnet_stats_basic_sync bstats;
- struct gnet_stats_queue qstats;
- bool running; /* must be written under qdisc spinlock */
- unsigned long state;
- struct Qdisc *next_sched;
- struct sk_buff_head skb_bad_txq;
+ /* Cache line potentially dirtied in dequeue() or __netif_reschedule(). */
+ __cacheline_group_begin(Qdisc_read_mostly) ____cacheline_aligned;
+ struct sk_buff_head gso_skb;
+ struct Qdisc *next_sched;
+ struct sk_buff_head skb_bad_txq;
+ __cacheline_group_end(Qdisc_read_mostly);
+
+ /* Fields dirtied in dequeue() fast path. */
+ __cacheline_group_begin(Qdisc_write) ____cacheline_aligned;
+ struct qdisc_skb_head q;
+ unsigned long state;
+ struct gnet_stats_basic_sync bstats;
+ bool running; /* must be written under qdisc spinlock */
+
+ /* Note : we only change qstats.backlog in fast path. */
+ struct gnet_stats_queue qstats;
+ __cacheline_group_end(Qdisc_write);
+
atomic_long_t defer_count ____cacheline_aligned_in_smp;
struct llist_head defer_list;
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 08/14] net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (6 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 07/14] net_sched: add Qdisc_read_mostly and Qdisc_write groups Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 09/14] net_sched: sch_fq: prefetch one skb ahead in dequeue() Eric Dumazet
` (7 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Group together changes to qdisc fields to reduce chances of false sharing
if another cpu attempts to acquire the qdisc spinlock.
qdisc_qstats_backlog_dec(sch, skb);
sch->q.qlen--;
qdisc_bstats_update(sch, skb);
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/sched/sch_fq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index fee922da2f99c0c7ac6d86569cf3bbce47898951..0b0ca1aa9251f959e87dd5dc504fbe0f4cbc75eb 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -497,6 +497,7 @@ static void fq_dequeue_skb(struct Qdisc *sch, struct fq_flow *flow,
skb_mark_not_on_list(skb);
qdisc_qstats_backlog_dec(sch, skb);
sch->q.qlen--;
+ qdisc_bstats_update(sch, skb);
}
static void flow_queue_add(struct fq_flow *flow, struct sk_buff *skb)
@@ -776,7 +777,6 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
f->time_next_packet = now + len;
}
out:
- qdisc_bstats_update(sch, skb);
return skb;
}
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 09/14] net_sched: sch_fq: prefetch one skb ahead in dequeue()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (7 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 08/14] net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb() Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:31 ` [PATCH v2 net-next 10/14] net: prefech skb->priority in __dev_xmit_skb() Eric Dumazet
` (6 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
prefetch the skb that we are likely to dequeue at the next dequeue().
Also call fq_dequeue_skb() a bit sooner in fq_dequeue().
This reduces the window between read of q.qlen and
changes of fields in the cache line that could be dirtied
by another cpu trying to queue a packet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/sched/sch_fq.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 0b0ca1aa9251f959e87dd5dc504fbe0f4cbc75eb..6e5f2f4f241546605f8ba37f96275446c8836eee 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -480,7 +480,10 @@ static void fq_erase_head(struct Qdisc *sch, struct fq_flow *flow,
struct sk_buff *skb)
{
if (skb == flow->head) {
- flow->head = skb->next;
+ struct sk_buff *next = skb->next;
+
+ prefetch(next);
+ flow->head = next;
} else {
rb_erase(&skb->rbnode, &flow->t_root);
skb->dev = qdisc_dev(sch);
@@ -712,6 +715,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
goto begin;
}
prefetch(&skb->end);
+ fq_dequeue_skb(sch, f, skb);
if ((s64)(now - time_next_packet - q->ce_threshold) > 0) {
INET_ECN_set_ce(skb);
q->stat_ce_mark++;
@@ -719,7 +723,6 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
if (--f->qlen == 0)
q->inactive_flows++;
q->band_pkt_count[fq_skb_cb(skb)->band]--;
- fq_dequeue_skb(sch, f, skb);
} else {
head->first = f->next;
/* force a pass through old_flows to prevent starvation */
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 10/14] net: prefech skb->priority in __dev_xmit_skb()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (8 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 09/14] net_sched: sch_fq: prefetch one skb ahead in dequeue() Eric Dumazet
@ 2025-11-11 9:31 ` Eric Dumazet
2025-11-11 9:32 ` [PATCH v2 net-next 11/14] net: annotate a data-race " Eric Dumazet
` (5 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Most qdiscs need to read skb->priority at enqueue time().
In commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
I added a prefetch(next), lets add another one for the second
half of skb.
Note that skb->priority and skb->hash share a common cache line,
so this patch helps qdiscs needing both fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/dev.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/core/dev.c b/net/core/dev.c
index e19eb4e9d77c27535ab2a0ce14299281e3ef9397..53e2496dc4292284072946fd9131d3f9a0c0af44 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4246,6 +4246,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
prefetch(next);
+ prefetch(&next->priority);
skb_mark_not_on_list(skb);
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
count++;
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 11/14] net: annotate a data-race in __dev_xmit_skb()
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (9 preceding siblings ...)
2025-11-11 9:31 ` [PATCH v2 net-next 10/14] net: prefech skb->priority in __dev_xmit_skb() Eric Dumazet
@ 2025-11-11 9:32 ` Eric Dumazet
2025-11-11 9:32 ` [PATCH v2 net-next 12/14] net_sched: add tcf_kfree_skb_list() helper Eric Dumazet
` (4 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:32 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
q->limit is read locklessly, add a READ_ONCE().
Fixes: 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 53e2496dc4292284072946fd9131d3f9a0c0af44..10042139dbb054b9a93dfb019477a80263feb029 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4194,7 +4194,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
do {
if (first_n && !defer_count) {
defer_count = atomic_long_inc_return(&q->defer_count);
- if (unlikely(defer_count > q->limit)) {
+ if (unlikely(defer_count > READ_ONCE(q->limit))) {
kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
return NET_XMIT_DROP;
}
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 12/14] net_sched: add tcf_kfree_skb_list() helper
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (10 preceding siblings ...)
2025-11-11 9:32 ` [PATCH v2 net-next 11/14] net: annotate a data-race " Eric Dumazet
@ 2025-11-11 9:32 ` Eric Dumazet
2025-11-11 9:32 ` [PATCH v2 net-next 13/14] net_sched: add qdisc_dequeue_drop() helper Eric Dumazet
` (3 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:32 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Using kfree_skb_list_reason() to free list of skbs from qdisc
operations seems wrong as each skb might have a different drop reason.
Cleanup __dev_xmit_skb() to call tcf_kfree_skb_list() once
in preparation of the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sch_generic.h | 11 +++++++++++
net/core/dev.c | 15 +++++----------
2 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 79501499dafba56271b9ebd97a8f379ffdc83cac..b8092d0378a0cafa290123d17c1b0ba787cd0680 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -1105,6 +1105,17 @@ static inline void tcf_set_drop_reason(const struct sk_buff *skb,
tc_skb_cb(skb)->drop_reason = reason;
}
+static inline void tcf_kfree_skb_list(struct sk_buff *skb)
+{
+ while (unlikely(skb)) {
+ struct sk_buff *next = skb->next;
+
+ prefetch(next);
+ kfree_skb_reason(skb, tcf_get_drop_reason(skb));
+ skb = next;
+ }
+}
+
/* Instead of calling kfree_skb() while root qdisc lock is held,
* queue the skb for future freeing at end of __dev_xmit_skb()
*/
diff --git a/net/core/dev.c b/net/core/dev.c
index 10042139dbb054b9a93dfb019477a80263feb029..e865cdb9b6966225072dc44a86610b9c7828bd8c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4162,7 +4162,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
__qdisc_run(q);
qdisc_run_end(q);
- goto no_lock_out;
+ goto free_skbs;
}
qdisc_bstats_cpu_update(q, skb);
@@ -4176,12 +4176,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
qdisc_run(q);
-
-no_lock_out:
- if (unlikely(to_free))
- kfree_skb_list_reason(to_free,
- tcf_get_drop_reason(to_free));
- return rc;
+ goto free_skbs;
}
/* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
@@ -4257,9 +4252,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
}
unlock:
spin_unlock(root_lock);
- if (unlikely(to_free))
- kfree_skb_list_reason(to_free,
- tcf_get_drop_reason(to_free));
+
+free_skbs:
+ tcf_kfree_skb_list(to_free);
return rc;
}
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 13/14] net_sched: add qdisc_dequeue_drop() helper
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (11 preceding siblings ...)
2025-11-11 9:32 ` [PATCH v2 net-next 12/14] net_sched: add tcf_kfree_skb_list() helper Eric Dumazet
@ 2025-11-11 9:32 ` Eric Dumazet
2025-11-11 9:32 ` [PATCH v2 net-next 14/14] net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel Eric Dumazet
` (2 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:32 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
Some qdisc like cake, codel, fq_codel might drop packets
in their dequeue() method.
This is currently problematic because dequeue() runs with
the qdisc spinlock held. Freeing skbs can be extremely expensive.
Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS
so that these qdiscs can opt-in to defer the skb frees
after the socket spinlock is released.
TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs
with an extra cache line miss.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/pkt_sched.h | 5 +++--
include/net/sch_generic.h | 30 +++++++++++++++++++++++++++---
net/core/dev.c | 22 +++++++++++++---------
3 files changed, 43 insertions(+), 14 deletions(-)
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 4678db45832a1e3bf7b8a07756fb89ab868bd5d2..e703c507d0daa97ae7c3bf131e322b1eafcc5664 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -114,12 +114,13 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
void __qdisc_run(struct Qdisc *q);
-static inline void qdisc_run(struct Qdisc *q)
+static inline struct sk_buff *qdisc_run(struct Qdisc *q)
{
if (qdisc_run_begin(q)) {
__qdisc_run(q);
- qdisc_run_end(q);
+ return qdisc_run_end(q);
}
+ return NULL;
}
extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index b8092d0378a0cafa290123d17c1b0ba787cd0680..c3a7268b567e0abf3f38290cd4e3fa7cd0601e36 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -88,6 +88,8 @@ struct Qdisc {
#define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
#define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
#define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */
+#define TCQ_F_DEQUEUE_DROPS 0x400 /* ->dequeue() can drop packets in q->to_free */
+
u32 limit;
const struct Qdisc_ops *ops;
struct qdisc_size_table __rcu *stab;
@@ -119,6 +121,8 @@ struct Qdisc {
/* Note : we only change qstats.backlog in fast path. */
struct gnet_stats_queue qstats;
+
+ struct sk_buff *to_free;
__cacheline_group_end(Qdisc_write);
@@ -218,8 +222,10 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
return true;
}
-static inline void qdisc_run_end(struct Qdisc *qdisc)
+static inline struct sk_buff *qdisc_run_end(struct Qdisc *qdisc)
{
+ struct sk_buff *to_free = NULL;
+
if (qdisc->flags & TCQ_F_NOLOCK) {
spin_unlock(&qdisc->seqlock);
@@ -232,9 +238,16 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
if (unlikely(test_bit(__QDISC_STATE_MISSED,
&qdisc->state)))
__netif_schedule(qdisc);
- } else {
- WRITE_ONCE(qdisc->running, false);
+ return NULL;
+ }
+
+ if (qdisc->flags & TCQ_F_DEQUEUE_DROPS) {
+ to_free = qdisc->to_free;
+ if (to_free)
+ qdisc->to_free = NULL;
}
+ WRITE_ONCE(qdisc->running, false);
+ return to_free;
}
static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
@@ -1116,6 +1129,17 @@ static inline void tcf_kfree_skb_list(struct sk_buff *skb)
}
}
+static inline void qdisc_dequeue_drop(struct Qdisc *q, struct sk_buff *skb,
+ enum skb_drop_reason reason)
+{
+ DEBUG_NET_WARN_ON_ONCE(!(q->flags & TCQ_F_DEQUEUE_DROPS));
+ DEBUG_NET_WARN_ON_ONCE(q->flags & TCQ_F_NOLOCK);
+
+ tcf_set_drop_reason(skb, reason);
+ skb->next = q->to_free;
+ q->to_free = skb;
+}
+
/* Instead of calling kfree_skb() while root qdisc lock is held,
* queue the skb for future freeing at end of __dev_xmit_skb()
*/
diff --git a/net/core/dev.c b/net/core/dev.c
index e865cdb9b6966225072dc44a86610b9c7828bd8c..9094c0fb8c689cd5252274d839638839bfb7642e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4141,7 +4141,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
struct net_device *dev,
struct netdev_queue *txq)
{
- struct sk_buff *next, *to_free = NULL;
+ struct sk_buff *next, *to_free = NULL, *to_free2 = NULL;
spinlock_t *root_lock = qdisc_lock(q);
struct llist_node *ll_list, *first_n;
unsigned long defer_count = 0;
@@ -4160,7 +4160,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
if (unlikely(!nolock_qdisc_is_empty(q))) {
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
__qdisc_run(q);
- qdisc_run_end(q);
+ to_free2 = qdisc_run_end(q);
goto free_skbs;
}
@@ -4170,12 +4170,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
!nolock_qdisc_is_empty(q))
__qdisc_run(q);
- qdisc_run_end(q);
- return NET_XMIT_SUCCESS;
+ to_free2 = qdisc_run_end(q);
+ rc = NET_XMIT_SUCCESS;
+ goto free_skbs;
}
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
- qdisc_run(q);
+ to_free2 = qdisc_run(q);
goto free_skbs;
}
@@ -4234,7 +4235,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
qdisc_bstats_update(q, skb);
if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
__qdisc_run(q);
- qdisc_run_end(q);
+ to_free2 = qdisc_run_end(q);
rc = NET_XMIT_SUCCESS;
} else {
int count = 0;
@@ -4246,7 +4247,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
count++;
}
- qdisc_run(q);
+ to_free2 = qdisc_run(q);
if (count != 1)
rc = NET_XMIT_SUCCESS;
}
@@ -4255,6 +4256,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
free_skbs:
tcf_kfree_skb_list(to_free);
+ tcf_kfree_skb_list(to_free2);
return rc;
}
@@ -5747,8 +5749,9 @@ static __latent_entropy void net_tx_action(void)
rcu_read_lock();
while (head) {
- struct Qdisc *q = head;
spinlock_t *root_lock = NULL;
+ struct sk_buff *to_free;
+ struct Qdisc *q = head;
head = head->next_sched;
@@ -5775,9 +5778,10 @@ static __latent_entropy void net_tx_action(void)
}
clear_bit(__QDISC_STATE_SCHED, &q->state);
- qdisc_run(q);
+ to_free = qdisc_run(q);
if (root_lock)
spin_unlock(root_lock);
+ tcf_kfree_skb_list(to_free);
}
rcu_read_unlock();
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v2 net-next 14/14] net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (12 preceding siblings ...)
2025-11-11 9:32 ` [PATCH v2 net-next 13/14] net_sched: add qdisc_dequeue_drop() helper Eric Dumazet
@ 2025-11-11 9:32 ` Eric Dumazet
2025-11-11 14:09 ` [syzbot ci] Re: net_sched: speedup qdisc dequeue syzbot ci
2025-11-11 16:43 ` [PATCH v2 net-next 00/14] " Toke Høiland-Jørgensen
15 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 9:32 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Toke Høiland-Jørgensen, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet
cake, codel and fq_codel can drop many packets from dequeue().
Use qdisc_dequeue_drop() so that the freeing can happen
outside of the qdisc spinlock scope.
Add TCQ_F_DEQUEUE_DROPS to sch->flags.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/sched/sch_cake.c | 4 +++-
net/sched/sch_codel.c | 4 +++-
net/sched/sch_fq_codel.c | 5 ++++-
3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 5948a149129c6de041ba949e2e2b5b6b4eb54166..0ea9440f68c60ab69e9dd889b225c1a171199787 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -2183,7 +2183,7 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
b->tin_dropped++;
qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
qdisc_qstats_drop(sch);
- kfree_skb_reason(skb, reason);
+ qdisc_dequeue_drop(sch, skb, reason);
if (q->rate_flags & CAKE_FLAG_INGRESS)
goto retry;
}
@@ -2724,6 +2724,8 @@ static int cake_init(struct Qdisc *sch, struct nlattr *opt,
int i, j, err;
sch->limit = 10240;
+ sch->flags |= TCQ_F_DEQUEUE_DROPS;
+
q->tin_mode = CAKE_DIFFSERV_DIFFSERV3;
q->flow_mode = CAKE_FLOW_TRIPLE;
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index fa0314679e434a4f84a128e8330bb92743c3d66a..c6551578f1cf8d332ca20ea062e858ffb437966a 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -52,7 +52,7 @@ static void drop_func(struct sk_buff *skb, void *ctx)
{
struct Qdisc *sch = ctx;
- kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_CONGESTED);
+ qdisc_dequeue_drop(sch, skb, SKB_DROP_REASON_QDISC_CONGESTED);
qdisc_qstats_drop(sch);
}
@@ -182,6 +182,8 @@ static int codel_init(struct Qdisc *sch, struct nlattr *opt,
else
sch->flags &= ~TCQ_F_CAN_BYPASS;
+ sch->flags |= TCQ_F_DEQUEUE_DROPS;
+
return 0;
}
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index a141423929394d7ebe127aa328dcf13ae67b3d56..dc187c7f06b10d8fd4191ead82e6a60133fff09d 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -275,7 +275,7 @@ static void drop_func(struct sk_buff *skb, void *ctx)
{
struct Qdisc *sch = ctx;
- kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_CONGESTED);
+ qdisc_dequeue_drop(sch, skb, SKB_DROP_REASON_QDISC_CONGESTED);
qdisc_qstats_drop(sch);
}
@@ -519,6 +519,9 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt,
sch->flags |= TCQ_F_CAN_BYPASS;
else
sch->flags &= ~TCQ_F_CAN_BYPASS;
+
+ sch->flags |= TCQ_F_DEQUEUE_DROPS;
+
return 0;
alloc_failure:
--
2.52.0.rc1.455.g30608eb744-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* [syzbot ci] Re: net_sched: speedup qdisc dequeue
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (13 preceding siblings ...)
2025-11-11 9:32 ` [PATCH v2 net-next 14/14] net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel Eric Dumazet
@ 2025-11-11 14:09 ` syzbot ci
2025-11-11 16:28 ` Eric Dumazet
2025-11-11 16:43 ` [PATCH v2 net-next 00/14] " Toke Høiland-Jørgensen
15 siblings, 1 reply; 23+ messages in thread
From: syzbot ci @ 2025-11-11 14:09 UTC (permalink / raw)
To: davem, edumazet, eric.dumazet, horms, jhs, jiri, kuba, kuniyu,
netdev, pabeni, toke, willemb, xiyou.wangcong
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v2] net_sched: speedup qdisc dequeue
https://lore.kernel.org/all/20251111093204.1432437-1-edumazet@google.com
* [PATCH v2 net-next 01/14] net_sched: make room for (struct qdisc_skb_cb)->pkt_segs
* [PATCH v2 net-next 02/14] net: init shinfo->gso_segs from qdisc_pkt_len_init()
* [PATCH v2 net-next 03/14] net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()
* [PATCH v2 net-next 04/14] net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()
* [PATCH v2 net-next 05/14] net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()
* [PATCH v2 net-next 06/14] net_sched: cake: use qdisc_pkt_segs()
* [PATCH v2 net-next 07/14] net_sched: add Qdisc_read_mostly and Qdisc_write groups
* [PATCH v2 net-next 08/14] net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()
* [PATCH v2 net-next 09/14] net_sched: sch_fq: prefetch one skb ahead in dequeue()
* [PATCH v2 net-next 10/14] net: prefech skb->priority in __dev_xmit_skb()
* [PATCH v2 net-next 11/14] net: annotate a data-race in __dev_xmit_skb()
* [PATCH v2 net-next 12/14] net_sched: add tcf_kfree_skb_list() helper
* [PATCH v2 net-next 13/14] net_sched: add qdisc_dequeue_drop() helper
* [PATCH v2 net-next 14/14] net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel
and found the following issue:
WARNING in sk_skb_reason_drop
Full report is available here:
https://ci.syzbot.org/series/a9dbee91-6b1f-4ab9-b55d-43f7f50de064
***
WARNING in sk_skb_reason_drop
tree: net-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base: a0c3aefb08cd81864b17c23c25b388dba90b9dad
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/a5059d85-d1f8-4036-a0fd-b677b5945ea9/config
C repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/c_repro
syz repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/syz_repro
syzkaller0: entered promiscuous mode
syzkaller0: entered allmulticast mode
------------[ cut here ]------------
WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 __sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
Modules linked in:
CPU: 0 UID: 0 PID: 5965 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
RIP: 0010:sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
Code: 20 2e a0 f8 83 fd 01 75 26 41 8d ae 00 00 fd ff bf 01 00 fd ff 89 ee e8 08 2e a0 f8 81 fd 00 00 fd ff 77 32 e8 bb 29 a0 f8 90 <0f> 0b 90 eb 53 bf 01 00 00 00 89 ee e8 e9 2d a0 f8 85 ed 0f 8e b2
RSP: 0018:ffffc9000284f3b0 EFLAGS: 00010293
RAX: ffffffff891fdcd5 RBX: ffff888113587680 RCX: ffff88816e6f3a00
RDX: 0000000000000000 RSI: 000000006e1a2a10 RDI: 00000000fffd0001
RBP: 000000006e1a2a10 R08: ffff888113587767 R09: 1ffff110226b0eec
R10: dffffc0000000000 R11: ffffed10226b0eed R12: ffff888113587764
R13: dffffc0000000000 R14: 000000006e1d2a10 R15: 0000000000000000
FS: 000055558e11c500(0000) GS:ffff88818eb38000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000002280 CR3: 000000011053c000 CR4: 00000000000006f0
Call Trace:
<TASK>
kfree_skb_reason include/linux/skbuff.h:1322 [inline]
tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline]
__dev_xmit_skb net/core/dev.c:4258 [inline]
__dev_queue_xmit+0x2669/0x3180 net/core/dev.c:4783
packet_snd net/packet/af_packet.c:3076 [inline]
packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
____sys_sendmsg+0x505/0x830 net/socket.c:2630
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
__sys_sendmsg net/socket.c:2716 [inline]
__do_sys_sendmsg net/socket.c:2721 [inline]
__se_sys_sendmsg net/socket.c:2719 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fc1a7b8efc9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff4ba6d968 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fc1a7de5fa0 RCX: 00007fc1a7b8efc9
RDX: 0000000000000004 RSI: 00002000000000c0 RDI: 0000000000000007
RBP: 00007fc1a7c11f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fc1a7de5fa0 R14: 00007fc1a7de5fa0 R15: 0000000000000003
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [syzbot ci] Re: net_sched: speedup qdisc dequeue
2025-11-11 14:09 ` [syzbot ci] Re: net_sched: speedup qdisc dequeue syzbot ci
@ 2025-11-11 16:28 ` Eric Dumazet
2025-11-11 19:23 ` Eric Dumazet
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 16:28 UTC (permalink / raw)
To: syzbot ci
Cc: davem, eric.dumazet, horms, jhs, jiri, kuba, kuniyu, netdev,
pabeni, toke, willemb, xiyou.wangcong, syzbot, syzkaller-bugs
On Tue, Nov 11, 2025 at 6:09 AM syzbot ci
<syzbot+ci51c71986dfbbfee2@syzkaller.appspotmail.com> wrote:
>
> syzbot ci has tested the following series
>
> [v2] net_sched: speedup qdisc dequeue
> https://lore.kernel.org/all/20251111093204.1432437-1-edumazet@google.com
> * [PATCH v2 net-next 01/14] net_sched: make room for (struct qdisc_skb_cb)->pkt_segs
> * [PATCH v2 net-next 02/14] net: init shinfo->gso_segs from qdisc_pkt_len_init()
> * [PATCH v2 net-next 03/14] net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()
> * [PATCH v2 net-next 04/14] net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()
> * [PATCH v2 net-next 05/14] net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()
> * [PATCH v2 net-next 06/14] net_sched: cake: use qdisc_pkt_segs()
> * [PATCH v2 net-next 07/14] net_sched: add Qdisc_read_mostly and Qdisc_write groups
> * [PATCH v2 net-next 08/14] net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()
> * [PATCH v2 net-next 09/14] net_sched: sch_fq: prefetch one skb ahead in dequeue()
> * [PATCH v2 net-next 10/14] net: prefech skb->priority in __dev_xmit_skb()
> * [PATCH v2 net-next 11/14] net: annotate a data-race in __dev_xmit_skb()
> * [PATCH v2 net-next 12/14] net_sched: add tcf_kfree_skb_list() helper
> * [PATCH v2 net-next 13/14] net_sched: add qdisc_dequeue_drop() helper
> * [PATCH v2 net-next 14/14] net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel
>
> and found the following issue:
> WARNING in sk_skb_reason_drop
>
> Full report is available here:
> https://ci.syzbot.org/series/a9dbee91-6b1f-4ab9-b55d-43f7f50de064
>
> ***
>
> WARNING in sk_skb_reason_drop
>
> tree: net-next
> URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
> base: a0c3aefb08cd81864b17c23c25b388dba90b9dad
> arch: amd64
> compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
> config: https://ci.syzbot.org/builds/a5059d85-d1f8-4036-a0fd-b677b5945ea9/config
> C repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/c_repro
> syz repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/syz_repro
>
> syzkaller0: entered promiscuous mode
> syzkaller0: entered allmulticast mode
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 __sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
> WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
> Modules linked in:
> CPU: 0 UID: 0 PID: 5965 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:__sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
> RIP: 0010:sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
> Code: 20 2e a0 f8 83 fd 01 75 26 41 8d ae 00 00 fd ff bf 01 00 fd ff 89 ee e8 08 2e a0 f8 81 fd 00 00 fd ff 77 32 e8 bb 29 a0 f8 90 <0f> 0b 90 eb 53 bf 01 00 00 00 89 ee e8 e9 2d a0 f8 85 ed 0f 8e b2
> RSP: 0018:ffffc9000284f3b0 EFLAGS: 00010293
> RAX: ffffffff891fdcd5 RBX: ffff888113587680 RCX: ffff88816e6f3a00
> RDX: 0000000000000000 RSI: 000000006e1a2a10 RDI: 00000000fffd0001
> RBP: 000000006e1a2a10 R08: ffff888113587767 R09: 1ffff110226b0eec
> R10: dffffc0000000000 R11: ffffed10226b0eed R12: ffff888113587764
> R13: dffffc0000000000 R14: 000000006e1d2a10 R15: 0000000000000000
> FS: 000055558e11c500(0000) GS:ffff88818eb38000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000200000002280 CR3: 000000011053c000 CR4: 00000000000006f0
> Call Trace:
> <TASK>
> kfree_skb_reason include/linux/skbuff.h:1322 [inline]
> tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline]
> __dev_xmit_skb net/core/dev.c:4258 [inline]
> __dev_queue_xmit+0x2669/0x3180 net/core/dev.c:4783
> packet_snd net/packet/af_packet.c:3076 [inline]
> packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108
> sock_sendmsg_nosec net/socket.c:727 [inline]
> __sock_sendmsg+0x21c/0x270 net/socket.c:742
> ____sys_sendmsg+0x505/0x830 net/socket.c:2630
> ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
> __sys_sendmsg net/socket.c:2716 [inline]
> __do_sys_sendmsg net/socket.c:2721 [inline]
> __se_sys_sendmsg net/socket.c:2719 [inline]
> __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719
> do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7fc1a7b8efc9
> Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fff4ba6d968 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 00007fc1a7de5fa0 RCX: 00007fc1a7b8efc9
> RDX: 0000000000000004 RSI: 00002000000000c0 RDI: 0000000000000007
> RBP: 00007fc1a7c11f91 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007fc1a7de5fa0 R14: 00007fc1a7de5fa0 R15: 0000000000000003
> </TASK>
>
Seems that cls_bpf_classify() is able to change tc_skb_cb(skb)->drop_reason,
and this predates my code.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [syzbot ci] Re: net_sched: speedup qdisc dequeue
2025-11-11 16:28 ` Eric Dumazet
@ 2025-11-11 19:23 ` Eric Dumazet
2025-11-11 19:44 ` Eric Dumazet
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 19:23 UTC (permalink / raw)
To: syzbot ci, Victor Nogueira, Daniel Borkmann
Cc: davem, eric.dumazet, horms, jhs, jiri, kuba, kuniyu, netdev,
pabeni, toke, willemb, xiyou.wangcong, syzbot, syzkaller-bugs
On Tue, Nov 11, 2025 at 8:28 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Nov 11, 2025 at 6:09 AM syzbot ci
> <syzbot+ci51c71986dfbbfee2@syzkaller.appspotmail.com> wrote:
> >
> > syzbot ci has tested the following series
> >
> > [v2] net_sched: speedup qdisc dequeue
> > https://lore.kernel.org/all/20251111093204.1432437-1-edumazet@google.com
> > * [PATCH v2 net-next 01/14] net_sched: make room for (struct qdisc_skb_cb)->pkt_segs
> > * [PATCH v2 net-next 02/14] net: init shinfo->gso_segs from qdisc_pkt_len_init()
> > * [PATCH v2 net-next 03/14] net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()
> > * [PATCH v2 net-next 04/14] net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()
> > * [PATCH v2 net-next 05/14] net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()
> > * [PATCH v2 net-next 06/14] net_sched: cake: use qdisc_pkt_segs()
> > * [PATCH v2 net-next 07/14] net_sched: add Qdisc_read_mostly and Qdisc_write groups
> > * [PATCH v2 net-next 08/14] net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()
> > * [PATCH v2 net-next 09/14] net_sched: sch_fq: prefetch one skb ahead in dequeue()
> > * [PATCH v2 net-next 10/14] net: prefech skb->priority in __dev_xmit_skb()
> > * [PATCH v2 net-next 11/14] net: annotate a data-race in __dev_xmit_skb()
> > * [PATCH v2 net-next 12/14] net_sched: add tcf_kfree_skb_list() helper
> > * [PATCH v2 net-next 13/14] net_sched: add qdisc_dequeue_drop() helper
> > * [PATCH v2 net-next 14/14] net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel
> >
> > and found the following issue:
> > WARNING in sk_skb_reason_drop
> >
> > Full report is available here:
> > https://ci.syzbot.org/series/a9dbee91-6b1f-4ab9-b55d-43f7f50de064
> >
> > ***
> >
> > WARNING in sk_skb_reason_drop
> >
> > tree: net-next
> > URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
> > base: a0c3aefb08cd81864b17c23c25b388dba90b9dad
> > arch: amd64
> > compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
> > config: https://ci.syzbot.org/builds/a5059d85-d1f8-4036-a0fd-b677b5945ea9/config
> > C repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/c_repro
> > syz repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/syz_repro
> >
> > syzkaller0: entered promiscuous mode
> > syzkaller0: entered allmulticast mode
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 __sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
> > WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
> > Modules linked in:
> > CPU: 0 UID: 0 PID: 5965 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > RIP: 0010:__sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
> > RIP: 0010:sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
> > Code: 20 2e a0 f8 83 fd 01 75 26 41 8d ae 00 00 fd ff bf 01 00 fd ff 89 ee e8 08 2e a0 f8 81 fd 00 00 fd ff 77 32 e8 bb 29 a0 f8 90 <0f> 0b 90 eb 53 bf 01 00 00 00 89 ee e8 e9 2d a0 f8 85 ed 0f 8e b2
> > RSP: 0018:ffffc9000284f3b0 EFLAGS: 00010293
> > RAX: ffffffff891fdcd5 RBX: ffff888113587680 RCX: ffff88816e6f3a00
> > RDX: 0000000000000000 RSI: 000000006e1a2a10 RDI: 00000000fffd0001
> > RBP: 000000006e1a2a10 R08: ffff888113587767 R09: 1ffff110226b0eec
> > R10: dffffc0000000000 R11: ffffed10226b0eed R12: ffff888113587764
> > R13: dffffc0000000000 R14: 000000006e1d2a10 R15: 0000000000000000
> > FS: 000055558e11c500(0000) GS:ffff88818eb38000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000200000002280 CR3: 000000011053c000 CR4: 00000000000006f0
> > Call Trace:
> > <TASK>
> > kfree_skb_reason include/linux/skbuff.h:1322 [inline]
> > tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline]
> > __dev_xmit_skb net/core/dev.c:4258 [inline]
> > __dev_queue_xmit+0x2669/0x3180 net/core/dev.c:4783
> > packet_snd net/packet/af_packet.c:3076 [inline]
> > packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108
> > sock_sendmsg_nosec net/socket.c:727 [inline]
> > __sock_sendmsg+0x21c/0x270 net/socket.c:742
> > ____sys_sendmsg+0x505/0x830 net/socket.c:2630
> > ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
> > __sys_sendmsg net/socket.c:2716 [inline]
> > __do_sys_sendmsg net/socket.c:2721 [inline]
> > __se_sys_sendmsg net/socket.c:2719 [inline]
> > __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719
> > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> > do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7fc1a7b8efc9
> > Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007fff4ba6d968 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> > RAX: ffffffffffffffda RBX: 00007fc1a7de5fa0 RCX: 00007fc1a7b8efc9
> > RDX: 0000000000000004 RSI: 00002000000000c0 RDI: 0000000000000007
> > RBP: 00007fc1a7c11f91 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > R13: 00007fc1a7de5fa0 R14: 00007fc1a7de5fa0 R15: 0000000000000003
> > </TASK>
> >
>
> Seems that cls_bpf_classify() is able to change tc_skb_cb(skb)->drop_reason,
> and this predates my code.
struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb;
void *data_meta;
void *data_end;
};
So anytime BPF calls bpf_compute_data_pointers(), it overwrites
tc_skb_cb(skb)->drop_reason,
because offsetof( ..., data_meta) == offsetof(... drop_reason)
CC Victor and Daniel
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [syzbot ci] Re: net_sched: speedup qdisc dequeue
2025-11-11 19:23 ` Eric Dumazet
@ 2025-11-11 19:44 ` Eric Dumazet
2025-11-11 21:04 ` Victor Nogueira
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-11-11 19:44 UTC (permalink / raw)
To: syzbot ci, Victor Nogueira, Daniel Borkmann
Cc: davem, eric.dumazet, horms, jhs, jiri, kuba, kuniyu, netdev,
pabeni, toke, willemb, xiyou.wangcong, syzbot, syzkaller-bugs
On Tue, Nov 11, 2025 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Nov 11, 2025 at 8:28 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Nov 11, 2025 at 6:09 AM syzbot ci
> > <syzbot+ci51c71986dfbbfee2@syzkaller.appspotmail.com> wrote:
> > >
> > > syzbot ci has tested the following series
> > >
> > > [v2] net_sched: speedup qdisc dequeue
> > > https://lore.kernel.org/all/20251111093204.1432437-1-edumazet@google.com
> > > * [PATCH v2 net-next 01/14] net_sched: make room for (struct qdisc_skb_cb)->pkt_segs
> > > * [PATCH v2 net-next 02/14] net: init shinfo->gso_segs from qdisc_pkt_len_init()
> > > * [PATCH v2 net-next 03/14] net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()
> > > * [PATCH v2 net-next 04/14] net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()
> > > * [PATCH v2 net-next 05/14] net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()
> > > * [PATCH v2 net-next 06/14] net_sched: cake: use qdisc_pkt_segs()
> > > * [PATCH v2 net-next 07/14] net_sched: add Qdisc_read_mostly and Qdisc_write groups
> > > * [PATCH v2 net-next 08/14] net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()
> > > * [PATCH v2 net-next 09/14] net_sched: sch_fq: prefetch one skb ahead in dequeue()
> > > * [PATCH v2 net-next 10/14] net: prefech skb->priority in __dev_xmit_skb()
> > > * [PATCH v2 net-next 11/14] net: annotate a data-race in __dev_xmit_skb()
> > > * [PATCH v2 net-next 12/14] net_sched: add tcf_kfree_skb_list() helper
> > > * [PATCH v2 net-next 13/14] net_sched: add qdisc_dequeue_drop() helper
> > > * [PATCH v2 net-next 14/14] net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel
> > >
> > > and found the following issue:
> > > WARNING in sk_skb_reason_drop
> > >
> > > Full report is available here:
> > > https://ci.syzbot.org/series/a9dbee91-6b1f-4ab9-b55d-43f7f50de064
> > >
> > > ***
> > >
> > > WARNING in sk_skb_reason_drop
> > >
> > > tree: net-next
> > > URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
> > > base: a0c3aefb08cd81864b17c23c25b388dba90b9dad
> > > arch: amd64
> > > compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
> > > config: https://ci.syzbot.org/builds/a5059d85-d1f8-4036-a0fd-b677b5945ea9/config
> > > C repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/c_repro
> > > syz repro: https://ci.syzbot.org/findings/e529fc3a-766e-4d6c-899a-c35a8fdaa940/syz_repro
> > >
> > > syzkaller0: entered promiscuous mode
> > > syzkaller0: entered allmulticast mode
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 __sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
> > > WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
> > > Modules linked in:
> > > CPU: 0 UID: 0 PID: 5965 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > RIP: 0010:__sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
> > > RIP: 0010:sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
> > > Code: 20 2e a0 f8 83 fd 01 75 26 41 8d ae 00 00 fd ff bf 01 00 fd ff 89 ee e8 08 2e a0 f8 81 fd 00 00 fd ff 77 32 e8 bb 29 a0 f8 90 <0f> 0b 90 eb 53 bf 01 00 00 00 89 ee e8 e9 2d a0 f8 85 ed 0f 8e b2
> > > RSP: 0018:ffffc9000284f3b0 EFLAGS: 00010293
> > > RAX: ffffffff891fdcd5 RBX: ffff888113587680 RCX: ffff88816e6f3a00
> > > RDX: 0000000000000000 RSI: 000000006e1a2a10 RDI: 00000000fffd0001
> > > RBP: 000000006e1a2a10 R08: ffff888113587767 R09: 1ffff110226b0eec
> > > R10: dffffc0000000000 R11: ffffed10226b0eed R12: ffff888113587764
> > > R13: dffffc0000000000 R14: 000000006e1d2a10 R15: 0000000000000000
> > > FS: 000055558e11c500(0000) GS:ffff88818eb38000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000200000002280 CR3: 000000011053c000 CR4: 00000000000006f0
> > > Call Trace:
> > > <TASK>
> > > kfree_skb_reason include/linux/skbuff.h:1322 [inline]
> > > tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline]
> > > __dev_xmit_skb net/core/dev.c:4258 [inline]
> > > __dev_queue_xmit+0x2669/0x3180 net/core/dev.c:4783
> > > packet_snd net/packet/af_packet.c:3076 [inline]
> > > packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108
> > > sock_sendmsg_nosec net/socket.c:727 [inline]
> > > __sock_sendmsg+0x21c/0x270 net/socket.c:742
> > > ____sys_sendmsg+0x505/0x830 net/socket.c:2630
> > > ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
> > > __sys_sendmsg net/socket.c:2716 [inline]
> > > __do_sys_sendmsg net/socket.c:2721 [inline]
> > > __se_sys_sendmsg net/socket.c:2719 [inline]
> > > __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719
> > > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> > > do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
> > > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > RIP: 0033:0x7fc1a7b8efc9
> > > Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:00007fff4ba6d968 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> > > RAX: ffffffffffffffda RBX: 00007fc1a7de5fa0 RCX: 00007fc1a7b8efc9
> > > RDX: 0000000000000004 RSI: 00002000000000c0 RDI: 0000000000000007
> > > RBP: 00007fc1a7c11f91 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > > R13: 00007fc1a7de5fa0 R14: 00007fc1a7de5fa0 R15: 0000000000000003
> > > </TASK>
> > >
> >
> > Seems that cls_bpf_classify() is able to change tc_skb_cb(skb)->drop_reason,
> > and this predates my code.
>
> struct bpf_skb_data_end {
> struct qdisc_skb_cb qdisc_cb;
> void *data_meta;
> void *data_end;
> };
>
> So anytime BPF calls bpf_compute_data_pointers(), it overwrites
> tc_skb_cb(skb)->drop_reason,
> because offsetof( ..., data_meta) == offsetof(... drop_reason)
>
> CC Victor and Daniel
Quick and dirty patch to save/restore the space.
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 7fbe42f0e5c2b7aca0a28c34cd801c3a767c804e..004d8fe2f29d89bd7df82d90b7a1e2881f7a463b
100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -82,11 +82,16 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
const struct tcf_proto *tp,
struct tcf_result *res)
{
+ struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
struct cls_bpf_head *head = rcu_dereference_bh(tp->root);
bool at_ingress = skb_at_tc_ingress(skb);
struct cls_bpf_prog *prog;
+ void *save[2];
int ret = -1;
+ save[0] = cb->data_meta;
+ save[1] = cb->data_end;
+
list_for_each_entry_rcu(prog, &head->plist, link) {
int filter_res;
@@ -133,7 +138,8 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
break;
}
-
+ cb->data_meta = save[0];
+ cb->data_end = save[1];
return ret;
}
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [syzbot ci] Re: net_sched: speedup qdisc dequeue
2025-11-11 19:44 ` Eric Dumazet
@ 2025-11-11 21:04 ` Victor Nogueira
2025-11-11 21:34 ` Jamal Hadi Salim
0 siblings, 1 reply; 23+ messages in thread
From: Victor Nogueira @ 2025-11-11 21:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: syzbot ci, Daniel Borkmann, davem, eric.dumazet, horms, jhs, jiri,
kuba, kuniyu, netdev, pabeni, toke, willemb, xiyou.wangcong,
syzbot, syzkaller-bugs
Hi Eric,
On Tue, Nov 11, 2025 at 4:44 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Nov 11, 2025 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Nov 11, 2025 at 8:28 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Nov 11, 2025 at 6:09 AM syzbot ci
> > > <syzbot+ci51c71986dfbbfee2@syzkaller.appspotmail.com> wrote:
> > > >
> > > > syzbot ci has tested the following series
> > > >
> > > > [v2] net_sched: speedup qdisc dequeue
> > > > [...]
> > > > and found the following issue:
> > > > WARNING in sk_skb_reason_drop
> > > >
> > > > Full report is available here:
> > > > https://ci.syzbot.org/series/a9dbee91-6b1f-4ab9-b55d-43f7f50de064
> > > >
> > > > ***
> > > >
> > > > WARNING in sk_skb_reason_drop
> > > > [...]
> > struct bpf_skb_data_end {
> > struct qdisc_skb_cb qdisc_cb;
> > void *data_meta;
> > void *data_end;
> > };
> >
> > So anytime BPF calls bpf_compute_data_pointers(), it overwrites
> > tc_skb_cb(skb)->drop_reason,
> > because offsetof( ..., data_meta) == offsetof(... drop_reason)
> >
> > CC Victor and Daniel
>
> Quick and dirty patch to save/restore the space.
>
> diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
> index 7fbe42f0e5c2b7aca0a28c34cd801c3a767c804e..004d8fe2f29d89bd7df82d90b7a1e2881f7a463b
> 100644
> --- a/net/sched/cls_bpf.c
> +++ b/net/sched/cls_bpf.c
> @@ -82,11 +82,16 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
> const struct tcf_proto *tp,
> struct tcf_result *res)
> {
> + struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
> struct cls_bpf_head *head = rcu_dereference_bh(tp->root);
> bool at_ingress = skb_at_tc_ingress(skb);
> struct cls_bpf_prog *prog;
> + void *save[2];
> int ret = -1;
>
> + save[0] = cb->data_meta;
> + save[1] = cb->data_end;
> +
> list_for_each_entry_rcu(prog, &head->plist, link) {
> int filter_res;
>
> @@ -133,7 +138,8 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
>
> break;
> }
> -
> + cb->data_meta = save[0];
> + cb->data_end = save[1];
> return ret;
> }
I think you are on the right track.
Maybe we can create helper functions for this.
Something like bpf_compute_and_save_data_end [1] and
and bpf_restore_data_end [2], but for data_meta as well.
Also, I think we might have the same issue in tcf_bpf_act [3].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/include/linux/filter.h#n907
[2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/include/linux/filter.h#n917
[3] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/act_bpf.c#n50
cheers,
Victor
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [syzbot ci] Re: net_sched: speedup qdisc dequeue
2025-11-11 21:04 ` Victor Nogueira
@ 2025-11-11 21:34 ` Jamal Hadi Salim
2025-11-12 12:21 ` Eric Dumazet
0 siblings, 1 reply; 23+ messages in thread
From: Jamal Hadi Salim @ 2025-11-11 21:34 UTC (permalink / raw)
To: Victor Nogueira
Cc: Eric Dumazet, syzbot ci, Daniel Borkmann, davem, eric.dumazet,
horms, jiri, kuba, kuniyu, netdev, pabeni, toke, willemb,
xiyou.wangcong, syzbot, syzkaller-bugs, Alexei Starovoitov
On Tue, Nov 11, 2025 at 4:04 PM Victor Nogueira <victor@mojatatu.com> wrote:
>
> Hi Eric,
>
> On Tue, Nov 11, 2025 at 4:44 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Nov 11, 2025 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Nov 11, 2025 at 8:28 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Nov 11, 2025 at 6:09 AM syzbot ci
> > > > <syzbot+ci51c71986dfbbfee2@syzkaller.appspotmail.com> wrote:
> > > > >
> > > > > syzbot ci has tested the following series
> > > > >
> > > > > [v2] net_sched: speedup qdisc dequeue
> > > > > [...]
> > > > > and found the following issue:
> > > > > WARNING in sk_skb_reason_drop
> > > > >
> > > > > Full report is available here:
> > > > > https://ci.syzbot.org/series/a9dbee91-6b1f-4ab9-b55d-43f7f50de064
> > > > >
> > > > > ***
> > > > >
> > > > > WARNING in sk_skb_reason_drop
> > > > > [...]
> > > struct bpf_skb_data_end {
> > > struct qdisc_skb_cb qdisc_cb;
> > > void *data_meta;
> > > void *data_end;
> > > };
> > >
> > > So anytime BPF calls bpf_compute_data_pointers(), it overwrites
> > > tc_skb_cb(skb)->drop_reason,
> > > because offsetof( ..., data_meta) == offsetof(... drop_reason)
> > >
> > > CC Victor and Daniel
> >
> > Quick and dirty patch to save/restore the space.
> >
> > diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
> > index 7fbe42f0e5c2b7aca0a28c34cd801c3a767c804e..004d8fe2f29d89bd7df82d90b7a1e2881f7a463b
> > 100644
> > --- a/net/sched/cls_bpf.c
> > +++ b/net/sched/cls_bpf.c
> > @@ -82,11 +82,16 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
> > const struct tcf_proto *tp,
> > struct tcf_result *res)
> > {
> > + struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
> > struct cls_bpf_head *head = rcu_dereference_bh(tp->root);
> > bool at_ingress = skb_at_tc_ingress(skb);
> > struct cls_bpf_prog *prog;
> > + void *save[2];
> > int ret = -1;
> >
> > + save[0] = cb->data_meta;
> > + save[1] = cb->data_end;
> > +
> > list_for_each_entry_rcu(prog, &head->plist, link) {
> > int filter_res;
> >
> > @@ -133,7 +138,8 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
> >
> > break;
> > }
> > -
> > + cb->data_meta = save[0];
> > + cb->data_end = save[1];
> > return ret;
> > }
>
> I think you are on the right track.
> Maybe we can create helper functions for this.
> Something like bpf_compute_and_save_data_end [1] and
> and bpf_restore_data_end [2], but for data_meta as well.
> Also, I think we might have the same issue in tcf_bpf_act [3].
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/include/linux/filter.h#n907
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/include/linux/filter.h#n917
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/act_bpf.c#n50
>
Digging a bit - when you send the fixes, this overwritting i believe
was introduced in:
commit db58ba45920255e967cc1d62a430cebd634b5046
+Cc Alexei
> cheers,
> Victor
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [syzbot ci] Re: net_sched: speedup qdisc dequeue
2025-11-11 21:34 ` Jamal Hadi Salim
@ 2025-11-12 12:21 ` Eric Dumazet
0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-11-12 12:21 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Victor Nogueira, syzbot ci, Daniel Borkmann, davem, eric.dumazet,
horms, jiri, kuba, kuniyu, netdev, pabeni, toke, willemb,
xiyou.wangcong, syzbot, syzkaller-bugs, Alexei Starovoitov
On Tue, Nov 11, 2025 at 1:34 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Nov 11, 2025 at 4:04 PM Victor Nogueira <victor@mojatatu.com> wrote:
> >
> > Hi Eric,
> >
> > On Tue, Nov 11, 2025 at 4:44 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Nov 11, 2025 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Nov 11, 2025 at 8:28 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Tue, Nov 11, 2025 at 6:09 AM syzbot ci
> > > > > <syzbot+ci51c71986dfbbfee2@syzkaller.appspotmail.com> wrote:
> > > > > >
> > > > > > syzbot ci has tested the following series
> > > > > >
> > > > > > [v2] net_sched: speedup qdisc dequeue
> > > > > > [...]
> > > > > > and found the following issue:
> > > > > > WARNING in sk_skb_reason_drop
> > > > > >
> > > > > > Full report is available here:
> > > > > > https://ci.syzbot.org/series/a9dbee91-6b1f-4ab9-b55d-43f7f50de064
> > > > > >
> > > > > > ***
> > > > > >
> > > > > > WARNING in sk_skb_reason_drop
> > > > > > [...]
> > > > struct bpf_skb_data_end {
> > > > struct qdisc_skb_cb qdisc_cb;
> > > > void *data_meta;
> > > > void *data_end;
> > > > };
> > > >
> > > > So anytime BPF calls bpf_compute_data_pointers(), it overwrites
> > > > tc_skb_cb(skb)->drop_reason,
> > > > because offsetof( ..., data_meta) == offsetof(... drop_reason)
> > > >
> > > > CC Victor and Daniel
> > >
> > > Quick and dirty patch to save/restore the space.
> > >
> > > diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
> > > index 7fbe42f0e5c2b7aca0a28c34cd801c3a767c804e..004d8fe2f29d89bd7df82d90b7a1e2881f7a463b
> > > 100644
> > > --- a/net/sched/cls_bpf.c
> > > +++ b/net/sched/cls_bpf.c
> > > @@ -82,11 +82,16 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
> > > const struct tcf_proto *tp,
> > > struct tcf_result *res)
> > > {
> > > + struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
> > > struct cls_bpf_head *head = rcu_dereference_bh(tp->root);
> > > bool at_ingress = skb_at_tc_ingress(skb);
> > > struct cls_bpf_prog *prog;
> > > + void *save[2];
> > > int ret = -1;
> > >
> > > + save[0] = cb->data_meta;
> > > + save[1] = cb->data_end;
> > > +
> > > list_for_each_entry_rcu(prog, &head->plist, link) {
> > > int filter_res;
> > >
> > > @@ -133,7 +138,8 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
> > >
> > > break;
> > > }
> > > -
> > > + cb->data_meta = save[0];
> > > + cb->data_end = save[1];
> > > return ret;
> > > }
> >
> > I think you are on the right track.
> > Maybe we can create helper functions for this.
> > Something like bpf_compute_and_save_data_end [1] and
> > and bpf_restore_data_end [2], but for data_meta as well.
> > Also, I think we might have the same issue in tcf_bpf_act [3].
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/include/linux/filter.h#n907
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/include/linux/filter.h#n917
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/act_bpf.c#n50
> >
>
> Digging a bit - when you send the fixes, this overwritting i believe
> was introduced in:
> commit db58ba45920255e967cc1d62a430cebd634b5046
I will test the following :
include/linux/filter.h | 20 ++++++++++++++++++++
net/sched/act_bpf.c | 7 +++----
net/sched/cls_bpf.c | 6 ++----
3 files changed, 25 insertions(+), 8 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a104b39942305af245b9f2938a0acf7d7ab33c23..03e7516c61872c1aa98e0be743abb96d496e49c3
100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -901,6 +901,26 @@ static inline void
bpf_compute_data_pointers(struct sk_buff *skb)
cb->data_end = skb->data + skb_headlen(skb);
}
+static inline int bpf_prog_run_data_pointers(
+ const struct bpf_prog *prog,
+ struct sk_buff *skb)
+{
+ struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+ void *save_data_meta, *save_data_end;
+ int res;
+
+ save_data_meta = cb->data_meta;
+ save_data_end = cb->data_end;
+
+ bpf_compute_data_pointers(skb);
+ res = bpf_prog_run(prog, skb);
+
+ cb->data_meta = save_data_meta;
+ cb->data_end = save_data_end;
+
+ return res;
+}
+
/* Similar to bpf_compute_data_pointers(), except that save orginal
* data in cb->data and cb->meta_data for restore.
*/
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 396b576390d00aad56bca6a18b7796e5324c0aef..3f5a5dc55c29433525b319f1307725d7feb015c6
100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -47,13 +47,12 @@ TC_INDIRECT_SCOPE int tcf_bpf_act(struct sk_buff *skb,
filter = rcu_dereference(prog->filter);
if (at_ingress) {
__skb_push(skb, skb->mac_len);
- bpf_compute_data_pointers(skb);
- filter_res = bpf_prog_run(filter, skb);
+ filter_res = bpf_prog_run_data_pointers(filter, skb);
__skb_pull(skb, skb->mac_len);
} else {
- bpf_compute_data_pointers(skb);
- filter_res = bpf_prog_run(filter, skb);
+ filter_res = bpf_prog_run_data_pointers(filter, skb);
}
+
if (unlikely(!skb->tstamp && skb->tstamp_type))
skb->tstamp_type = SKB_CLOCK_REALTIME;
if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK)
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 7fbe42f0e5c2b7aca0a28c34cd801c3a767c804e..a32754a2658bb7d21e8ceb62c67d6684ed4f9fcc
100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -97,12 +97,10 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
} else if (at_ingress) {
/* It is safe to push/pull even if skb_shared() */
__skb_push(skb, skb->mac_len);
- bpf_compute_data_pointers(skb);
- filter_res = bpf_prog_run(prog->filter, skb);
+ filter_res =
bpf_prog_run_data_pointers(prog->filter, skb);
__skb_pull(skb, skb->mac_len);
} else {
- bpf_compute_data_pointers(skb);
- filter_res = bpf_prog_run(prog->filter, skb);
+ filter_res =
bpf_prog_run_data_pointers(prog->filter, skb);
}
if (unlikely(!skb->tstamp && skb->tstamp_type))
skb->tstamp_type = SKB_CLOCK_REALTIME;
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue
2025-11-11 9:31 [PATCH v2 net-next 00/14] net_sched: speedup qdisc dequeue Eric Dumazet
` (14 preceding siblings ...)
2025-11-11 14:09 ` [syzbot ci] Re: net_sched: speedup qdisc dequeue syzbot ci
@ 2025-11-11 16:43 ` Toke Høiland-Jørgensen
15 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-11 16:43 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
Eric Dumazet <edumazet@google.com> writes:
> Avoid up to two cache line misses in qdisc dequeue() to fetch
> skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held.
>
> Idea is to cache gso_segs at enqueue time before spinlock is
> acquired, in the first skb cache line, where we already
> have qdisc_skb_cb(skb)->pkt_len.
>
> This series gives a 8 % improvement in a TX intensive workload.
>
> (120 Mpps -> 130 Mpps on a Turin host, IDPF with 32 TX queues)
>
> v2: - Fixed issues reported by Jakub (thanks !)
> - Added three patches adding/using qdisc_dequeue_drop() after
> recent regressions with CAKE qdisc reported by Toke.
> More fixes to come later.
>
> v1: https://lore.kernel.org/netdev/20251110094505.3335073-1-edumazet@google.com/T/#m8f562ed148f807c02fd02c6cd243604d449615b9
>
> Eric Dumazet (14):
> net_sched: make room for (struct qdisc_skb_cb)->pkt_segs
> net: init shinfo->gso_segs from qdisc_pkt_len_init()
> net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in
> qdisc_pkt_len_init()
> net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()
> net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()
> net_sched: cake: use qdisc_pkt_segs()
> net_sched: add Qdisc_read_mostly and Qdisc_write groups
> net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()
> net_sched: sch_fq: prefetch one skb ahead in dequeue()
> net: prefech skb->priority in __dev_xmit_skb()
> net: annotate a data-race in __dev_xmit_skb()
> net_sched: add tcf_kfree_skb_list() helper
> net_sched: add qdisc_dequeue_drop() helper
> net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel
As mentioned in the other thread[0], I tested this series and it
definitely seems to improve things, so feel free to add my:
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 23+ messages in thread