* [PATCH net 2/3] net/sched: Handle TC_ACT_REDIRECT from qdisc filter chains
From: Jamal Hadi Salim @ 2026-06-26 16:51 UTC (permalink / raw)
To: netdev, bpf
Cc: davem, edumazet, kuba, pabeni, horms, toke, jiri, bigeasy,
clrkwllms, rostedt, kuniyu, sdf.kernel, skhawaja, liuhangbin,
krikku, mkarsten, victor, ast, hawk, john.fastabend, daniel,
Jamal Hadi Salim
In-Reply-To: <20260626165156.169012-1-jhs@mojatatu.com>
When a TC filter attached to a qdisc filter chain returns
TC_ACT_REDIRECT (ex: via an eBPF program calling bpf_redirect() or an
act_bpf action), the redirect was silently lost i.e no qdisc classify
function handled TC_ACT_REDIRECT, so the packet fell through the
switch and was enqueued normally instead of being redirected.
This has been broken since bpf_redirect() was introduced for TC in
commit 27b29f63058d ("bpf: add bpf_redirect() helper"). We got lucky
for a long time because bpf_net_context was a per-CPU variable that
was always available.
commit 401cb7dae813 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
turned bpf_net_context into a task_struct member that is only set up by
explicit callers. Without a caller setting it up, bpf_redirect() itself
crashes with a NULL pointer dereference in bpf_net_ctx_get_ri().
The NULL deref is fixed separately by extending the bpf_net_context
lifetime to cover qdisc enqueue. However, even with bpf_net_context
available, TC_ACT_REDIRECT from qdisc filter chains cannot be honored
without adding skb_do_redirect() calls to every qdisc classify
function, which would require changes across net/sched/.
Instead, add a tcf_classify_qdisc() inline helper in pkt_cls.h, as a
wrapper around tcf_classify() for use by qdisc classify functions and
tcf_qevent_handle(). When the classify verdict is TC_ACT_REDIRECT,
the wrapper converts it to TC_ACT_SHOT, dropping the packet rather
than letting it continue silently. Dropping is preferred over
letting the packet through because the user immediately sees packet
loss and, with the help of the rate-limited warning in the log
("use eBPF with clsact or mirred redirect instead"), can quickly
identify and fix the misconfiguration. Silently passing the packet
through would hide the problem and leave the user wondering why their
redirect is not working.
The clsact fast path, tc_run() continues to call tcf_classify() directly
and is unaffected: TC_ACT_REDIRECT is returned as-is and handled by
sch_handle_egress/ingress() calling skb_do_redirect() as before.
Remove the TC_ACT_REDIRECT case from tcf_qevent_handle() since
tcf_classify_qdisc() now converts it to TC_ACT_SHOT, and the existing
SHOT case handles it.
Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
Fixes: 401cb7dae813 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
include/net/pkt_cls.h | 13 +++++++++++++
net/sched/cls_api.c | 6 +-----
net/sched/sch_cake.c | 2 +-
net/sched/sch_drr.c | 2 +-
net/sched/sch_dualpi2.c | 2 +-
net/sched/sch_ets.c | 2 +-
net/sched/sch_fq_codel.c | 2 +-
net/sched/sch_fq_pie.c | 2 +-
net/sched/sch_hfsc.c | 2 +-
net/sched/sch_htb.c | 2 +-
net/sched/sch_multiq.c | 2 +-
net/sched/sch_prio.c | 2 +-
net/sched/sch_qfq.c | 2 +-
net/sched/sch_sfb.c | 2 +-
net/sched/sch_sfq.c | 2 +-
15 files changed, 27 insertions(+), 18 deletions(-)
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 3bd08d7f39c1..3a542a72e9a5 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -159,6 +159,19 @@ static inline int tcf_classify(struct sk_buff *skb,
#endif
+static inline int tcf_classify_qdisc(struct sk_buff *skb,
+ const struct tcf_proto *tp,
+ struct tcf_result *res, bool compat_mode)
+{
+ int ret = tcf_classify(skb, NULL, tp, res, compat_mode);
+
+ if (unlikely(ret == TC_ACT_REDIRECT)) {
+ pr_warn_once("TC_ACT_REDIRECT from qdisc filter chains is not supported; use eBPF with clsact or mirred redirect instead\n");
+ ret = TC_ACT_SHOT;
+ }
+ return ret;
+}
+
static inline unsigned long
__cls_set_class(unsigned long *clp, unsigned long cl)
{
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3e67600a4a1a..3ca56d060e28 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -4033,7 +4033,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
fl = rcu_dereference_bh(qe->filter_chain);
- switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
+ switch (tcf_classify_qdisc(skb, fl, &cl_res, false)) {
case TC_ACT_SHOT:
qdisc_qstats_drop(sch);
__qdisc_drop(skb, to_free);
@@ -4045,10 +4045,6 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
__qdisc_drop(skb, to_free);
*ret = __NET_XMIT_STOLEN;
return NULL;
- case TC_ACT_REDIRECT:
- skb_do_redirect(skb);
- *ret = __NET_XMIT_STOLEN;
- return NULL;
case TC_ACT_CONSUMED:
*ret = __NET_XMIT_STOLEN;
return NULL;
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index a3c185505afc..94eb47ac54ee 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1730,7 +1730,7 @@ static u32 cake_classify(struct Qdisc *sch, struct cake_tin_data **t,
goto hash;
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
- result = tcf_classify(skb, NULL, filter, &res, false);
+ result = tcf_classify_qdisc(skb, filter, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c
index 020657f959b5..91b1ef824afa 100644
--- a/net/sched/sch_drr.c
+++ b/net/sched/sch_drr.c
@@ -312,7 +312,7 @@ static struct drr_class *drr_classify(struct sk_buff *skb, struct Qdisc *sch,
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
fl = rcu_dereference_bh(q->filter_list);
- result = tcf_classify(skb, NULL, fl, &res, false);
+ result = tcf_classify_qdisc(skb, fl, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index 5434df6ca8ef..98364f74211e 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -364,7 +364,7 @@ static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
return NET_XMIT_SUCCESS;
}
- result = tcf_classify(skb, NULL, fl, &res, false);
+ result = tcf_classify_qdisc(skb, fl, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index cb8cf437ce87..25fcf4079fec 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -391,7 +391,7 @@ static struct ets_class *ets_classify(struct sk_buff *skb, struct Qdisc *sch,
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
if (TC_H_MAJ(skb->priority) != sch->handle) {
fl = rcu_dereference_bh(q->filter_list);
- err = tcf_classify(skb, NULL, fl, &res, false);
+ err = tcf_classify_qdisc(skb, fl, &res, false);
#ifdef CONFIG_NET_CLS_ACT
switch (err) {
case TC_ACT_STOLEN:
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index cafd1f943d99..6cce86ba383c 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -91,7 +91,7 @@ static unsigned int fq_codel_classify(struct sk_buff *skb, struct Qdisc *sch,
return fq_codel_hash(q, skb) + 1;
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
- result = tcf_classify(skb, NULL, filter, &res, false);
+ result = tcf_classify_qdisc(skb, filter, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
diff --git a/net/sched/sch_fq_pie.c b/net/sched/sch_fq_pie.c
index 72f48fa4010b..069e1facd413 100644
--- a/net/sched/sch_fq_pie.c
+++ b/net/sched/sch_fq_pie.c
@@ -96,7 +96,7 @@ static unsigned int fq_pie_classify(struct sk_buff *skb, struct Qdisc *sch,
return fq_pie_hash(q, skb) + 1;
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
- result = tcf_classify(skb, NULL, filter, &res, false);
+ result = tcf_classify_qdisc(skb, filter, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 7e537295b8b6..e87f5021a199 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1143,7 +1143,7 @@ hfsc_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
head = &q->root;
tcf = rcu_dereference_bh(q->root.filter_list);
- while (tcf && (result = tcf_classify(skb, NULL, tcf, &res, false)) >= 0) {
+ while (tcf && (result = tcf_classify_qdisc(skb, tcf, &res, false)) >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
case TC_ACT_QUEUED:
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 908b9ba9ba2e..fdac0dc8f35a 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -243,7 +243,7 @@ static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch,
}
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
- while (tcf && (result = tcf_classify(skb, NULL, tcf, &res, false)) >= 0) {
+ while (tcf && (result = tcf_classify_qdisc(skb, tcf, &res, false)) >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
case TC_ACT_QUEUED:
diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index 4e465d11e3d7..004f0d275caf 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -36,7 +36,7 @@ multiq_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
int err;
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
- err = tcf_classify(skb, NULL, fl, &res, false);
+ err = tcf_classify_qdisc(skb, fl, &res, false);
#ifdef CONFIG_NET_CLS_ACT
switch (err) {
case TC_ACT_STOLEN:
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index e4dd56a89072..79437c587e7e 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -39,7 +39,7 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
if (TC_H_MAJ(skb->priority) != sch->handle) {
fl = rcu_dereference_bh(q->filter_list);
- err = tcf_classify(skb, NULL, fl, &res, false);
+ err = tcf_classify_qdisc(skb, fl, &res, false);
#ifdef CONFIG_NET_CLS_ACT
switch (err) {
case TC_ACT_STOLEN:
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index cb56787e1d25..6f3b7273cb16 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -709,7 +709,7 @@ static struct qfq_class *qfq_classify(struct sk_buff *skb, struct Qdisc *sch,
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
fl = rcu_dereference_bh(q->filter_list);
- result = tcf_classify(skb, NULL, fl, &res, false);
+ result = tcf_classify_qdisc(skb, fl, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c
index b1d465094276..ed39869199c0 100644
--- a/net/sched/sch_sfb.c
+++ b/net/sched/sch_sfb.c
@@ -260,7 +260,7 @@ static bool sfb_classify(struct sk_buff *skb, struct tcf_proto *fl,
struct tcf_result res;
int result;
- result = tcf_classify(skb, NULL, fl, &res, false);
+ result = tcf_classify_qdisc(skb, fl, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 758b88f21865..77675f9a4c46 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -171,7 +171,7 @@ static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
return sfq_hash(q, skb) + 1;
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
- result = tcf_classify(skb, NULL, fl, &res, false);
+ result = tcf_classify_qdisc(skb, fl, &res, false);
if (result >= 0) {
#ifdef CONFIG_NET_CLS_ACT
switch (result) {
--
2.54.0
^ permalink raw reply related
* [PATCH net 1/3] net: Extend bpf_net_context lifetime to cover qdisc enqueue
From: Jamal Hadi Salim @ 2026-06-26 16:51 UTC (permalink / raw)
To: netdev, bpf
Cc: davem, edumazet, kuba, pabeni, horms, toke, jiri, bigeasy,
clrkwllms, rostedt, kuniyu, sdf.kernel, skhawaja, liuhangbin,
krikku, mkarsten, victor, ast, hawk, john.fastabend, daniel,
Jamal Hadi Salim, Sashiko
In-Reply-To: <20260626165156.169012-1-jhs@mojatatu.com>
The bpf_net_context used by sch_handle_egress() is stack-allocated and torn
down in that function returned. By the time tcf_qevent_handle() runs
current->bpf_net_context is NULL.
When a filter attached to a qevent block (e.g. RED's early_drop or mark
qevents, which always use shared blocks) returns TC_ACT_REDIRECT,
tcf_qevent_handle() calls skb_do_redirect(), which in turn calls bpf helper
bpf_net_ctx_get_ri(). That helper unconditionally dereferences
current->bpf_net_context resulting in a NULL pointer dereference.
Note: The same holds for actions that invoke BPF redirect helpers
(e.g. act_bpf running a program that calls bpf_redirect()) during qevent
classification itself. And as a matter of fact the same assumption is
made in the code outside of tc.
Fix:
Move the bpf_net_context lifecycle out of sch_handle_egress() into
__dev_queue_xmit(), so that it spans both the egress TC fast path and the
qdisc enqueue. The setup is placed outside the egress_needed_key static
branch because qevents are independent of clsact/NF egress hooks and
that key may stay disabled when only a qevent-bearing qdisc is
configured. Unfortunately this adds a small unconditional penalty to the
code path _per packet_ only guarded by CONFIG_NET_XGRESS (two writes and
one read for bpf_net_ctx_set, plus one write for bpf_net_ctx_clear).
This keeps all bpf_net_context management in net/core/dev.c i.e the
existing boundary between tc core and BPF without requiring any net/sched/
code to know about BPF plumbing.
Reproducer (see the accompanying tdc test):
tc qdisc add dev eth0 root handle 1: red limit 1MB min 10KB max 20KB \
avpkt 1000 burst 100 qevent early_drop block 10
tc qdisc add dev eth0 clsact
tc filter add block 10 pref 1 bpf obj redirect.o
tc filter add dev eth0 egress protocol ip prio 1 matchall \
action gact pass
traffic through eth0 triggers red_enqueue() -> tcf_qevent_handle() and,
on a redirect verdict, a NULL deref in skb_do_redirect().
Fixes: 401cb7dae813 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260620130749.226642-1-jhs%40mojatatu.com
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
net/core/dev.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 4b3d5cfdf6e0..8c214bfff8aa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4527,14 +4527,11 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
{
struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
- struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
int sch_ret;
if (!entry)
return skb;
- bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
-
/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
* already set by the caller.
*/
@@ -4550,12 +4547,10 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
/* No need to push/pop skb's mac_header here on egress! */
skb_do_redirect(skb);
*ret = NET_XMIT_SUCCESS;
- bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
case TC_ACT_SHOT:
kfree_skb_reason(skb, drop_reason);
*ret = NET_XMIT_DROP;
- bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
/* used by tc_run */
case TC_ACT_STOLEN:
@@ -4565,10 +4560,8 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
fallthrough;
case TC_ACT_CONSUMED:
*ret = NET_XMIT_SUCCESS;
- bpf_net_ctx_clear(bpf_net_ctx);
return NULL;
}
- bpf_net_ctx_clear(bpf_net_ctx);
return skb;
}
@@ -4767,6 +4760,9 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
*/
int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
{
+#ifdef CONFIG_NET_XGRESS
+ struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx = NULL;
+#endif
struct net_device *dev = skb->dev;
struct netdev_queue *txq = NULL;
enum skb_drop_reason reason;
@@ -4795,6 +4791,9 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
skb_update_prio(skb);
tcx_set_ingress(skb, false);
+#ifdef CONFIG_NET_XGRESS
+ bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+#endif
#ifdef CONFIG_NET_EGRESS
if (static_branch_unlikely(&egress_needed_key)) {
if (nf_hook_egress_active()) {
@@ -4898,12 +4897,18 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
reason = SKB_DROP_REASON_RECURSION_LIMIT;
drop:
+#ifdef CONFIG_NET_XGRESS
+ bpf_net_ctx_clear(bpf_net_ctx);
+#endif
rcu_read_unlock_bh();
dev_core_stats_tx_dropped_inc(dev);
kfree_skb_list_reason(skb, reason);
return rc;
out:
+#ifdef CONFIG_NET_XGRESS
+ bpf_net_ctx_clear(bpf_net_ctx);
+#endif
rcu_read_unlock_bh();
return rc;
}
--
2.54.0
^ permalink raw reply related
* [PATCH net 0/3] Fix broken TC_ACT_REDIRECT
From: Jamal Hadi Salim @ 2026-06-26 16:51 UTC (permalink / raw)
To: netdev, bpf
Cc: davem, edumazet, kuba, pabeni, horms, toke, jiri, bigeasy,
clrkwllms, rostedt, kuniyu, sdf.kernel, skhawaja, liuhangbin,
krikku, mkarsten, victor, ast, hawk, john.fastabend, daniel,
Jamal Hadi Salim
When sashiko-gemini[1] reviewed commit a8a02897f2b4
("net/sched: cls_api: Handle TC_ACT_CONSUMED in tcf_qevent_handle") it
correctly pointed out the following:
"
This is a pre-existing issue, but does executing a redirect via a qevent
filter cause a NULL pointer dereference?
When tcf_qevent_handle() processes a TC_ACT_REDIRECT, it calls
skb_do_redirect(). This eventually calls bpf_net_ctx_get_ri() which
dereferences the task bpf_net_context:
include/linux/filter.h:bpf_net_ctx_get_ri() {
...
struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
if (!(bpf_net_ctx->ri.kern_flags & BPF_RI_F_RI_INIT)) {
...
}
Since qevents are evaluated during enqueue, which runs within
__dev_queue_xmit() after sch_handle_egress() has already executed and
cleared the bpf_net_context pointer, will this dereference a NULL pointer?
"
That issue is fixed in patch 1. See the commit log for details.
Upon further investigation it turns out that TC_ACT_REDIRECT being returned
from the egress qdiscs never actually worked. When an action returns that
code we would silently loose it and the packet will never be redirected.
After all those years, if nobody complained, my gut feel is it was never
used by anyone with serious need for it.
Patch 2 fixes it by 1) putting a warning out when someone does and 2) asking
the core to drop the packet. At least this would help whoever is
misconfiguring to diagnose the issue much faster.
I had initially attempted to "fix" this and make it work, but unfortunately
it's a bit ugly so i left i didnt think it was worth fixing
Apologies for the shotgun Cc - its what get_maintainer.pl told me to use.
[1] https://sashiko.dev/#/patchset/20260620130749.226642-1-jhs%40mojatatu.com
^ permalink raw reply
* Re: [PATCH net v3] net: wwan: iosm: bound device offsets in the MUX downlink decoder
From: Simon Horman @ 2026-06-26 16:37 UTC (permalink / raw)
To: Maoyi Xie
Cc: Loic Poulain, Sergey Ryazanov, Johannes Berg, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev, linux-kernel, stable
In-Reply-To: <178236824878.3259367.5389624724479864947@maoyixie.com>
On Thu, Jun 25, 2026 at 02:17:28PM +0800, Maoyi Xie wrote:
> mux_dl_adb_decode() walks a chain of aggregated datagram tables using
> offsets and lengths taken from the modem. first_table_index,
> next_table_index, table_length, datagram_index and datagram_length are
> all device supplied le values. Only first_table_index was checked, and
> only for being non zero. The decoder then formed adth = block +
> adth_index and read the table header and the datagram entries with no
> bound against the received skb. A modem that reports an index or a
> length past the downlink buffer makes the decoder read out of bounds.
>
> The buffer is IPC_MEM_MAX_DL_MUX_LITE_BUF_SIZE and skb->len is at most
> that, so skb->len is the real limit, but none of these in band offsets
> were checked against it.
>
> The table chain is also followed with no forward progress check. The loop
> takes the next table from adth->next_table_index and stops only when that
> reaches zero. A modem can stage two tables that point at each other, so
> the loop never ends. It runs in softirq and clones the skb on every pass.
>
> Validate every device offset and length against skb->len before use.
> The block header must fit. Each table header, on entry and after every
> next_table_index, must lie inside the skb. The datagram table must fit.
> Each datagram index and length must stay inside the skb. The header
> padding must not exceed the datagram length so the receive length does
> not wrap. Require each next_table_index to move forward so the chain
> cannot cycle.
>
> This was reproduced under KASAN as a slab out of bounds read on a normal
> downlink receive once the iosm net device is up.
>
> Fixes: 1f52d7b62285 ("net: wwan: iosm: Enable M.2 7360 WWAN card support")
> Suggested-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH net v4 2/2] net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
From: Petr Wozniak @ 2026-06-26 16:35 UTC (permalink / raw)
To: maxime.chevallier, olek2, linux, andrew, hkallweit1
Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
bjorn, kabel, Petr Wozniak
In-Reply-To: <20260624084814.20972-3-petr.wozniak@gmail.com>
Maxime Chevallier wrote:
> I finally got time to test this with a RollBall module, and I
> confirm what Aleksander says, the RollBall module's PHY doesn't
> get detected even with this patch. It does work on v7.0 though,
> so before the bridge probing was introduced.
Thanks a lot for taking the time to test, Maxime - and Aleksander
for the original report.
That settles it: the deferred-probe approach in patch 2/2 doesn't
actually restore genuine RollBall PHY detection, and as you both
confirm it worked before 8fe125892f40 introduced the bridge probing.
Sashiko's static review flagged the same thing (the probe bypasses the
PHY discovery retry loop for slow-initializing modules), so the static
analysis and the two hardware reports all point at the same flaw.
I only have RTL8261BE-based copper modules here, not a genuine RollBall
one, so I can't develop and verify a proper slow-init timing fix
(module_t_wait / a retry that waits for the bridge) without the
hardware to test against.
Given that, my suggestion:
- Please drop patch 2/2 from the series.
- Since 8fe125892f40 regressed genuine RollBall detection and the
deferred probe doesn't restore it, I think the cleanest fix is to
revert 8fe125892f40. I'm happy to send that revert if you'd prefer.
The 5-minute RTL8261BE probe loop it was addressing is handled in our
downstream tree, so reverting it upstream is fine on our side.
- Patch 1/2 (the mii_bus leak fix) is independent of all this and
already has Reviewed-by from Maxime and Larysa - it would be good to
take that one regardless. I can resend it standalone if that's easier.
A proper fix covering slow-firmware modules really needs a genuine
RollBall module to validate, so it's better owned by someone who has
that hardware - happy to help review.
Thanks again,
Petr
^ permalink raw reply
* Re: [RFC net-next 00/17] MPTCP KTLS support
From: Jakub Kicinski @ 2026-06-26 16:32 UTC (permalink / raw)
To: Geliang Tang
Cc: Matthieu Baerts, Mat Martineau, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Neal Cardwell, Kuniyuki Iwashima,
John Fastabend, Sabrina Dubroca, Hannes Reinecke, Geliang Tang,
netdev, mptcp, Gang Yan, Zqiang
In-Reply-To: <a0358bb3aa8e2d45bd1e74851416728c3f08ddff.camel@kernel.org>
On Fri, 26 Jun 2026 15:56:53 +0800 Geliang Tang wrote:
> On Mon, 2026-06-22 at 09:00 -0700, Jakub Kicinski wrote:
> > On Mon, 22 Jun 2026 18:43:20 +0800 Geliang Tang wrote:
> > > Subject: [RFC net-next 00/17] MPTCP KTLS support
> >
> > Please no. We have a ton of unfixed bugs and may have to revert some
> > of
> > the features we dropped back in. I'd prefer to avoid large new bug
> > surfaces until we reach an LTS release.
>
> Sure, I can wait. During this time, I'll go over the implementation
> more carefully to make sure there are no issues on the MPTCP side.
The two things that I'd be most interested in understanding better are
1) perf impact on all the pointer chasing and indirect calls you're
adding
2) what's your plan on implementing HW offload? I understand that _you_
may not care, but if you follow the ML you'll discover that sooner
or later some well meaning person will slop code such support.
We have pretty solid (by kernel standards) documentation for the
offload. Please read if you haven't already. And then you should
hopefully realize the huge issue with the layering you chose...
^ permalink raw reply
* [PATCH] net: lan743x: Initialize eth_syslock spinlock before use
From: Andrea Righi @ 2026-06-26 16:32 UTC (permalink / raw)
To: Bryan Whitehead, UNGLinuxDriver
Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Raju Lakkaraju, David Thompson, netdev, linux-kernel
lan743x_hardware_init() calls pci11x1x_strap_get_status() during the
PCI11x1x probe sequence. That helper acquires the Ethernet subsystem
hardware lock via lan743x_hs_syslock_acquire(), which relies on
adapter->eth_syslock_spinlock to serialize access.
The spinlock is currently initialized only after the strap status is
read. With CONFIG_DEBUG_SPINLOCK enabled, taking the zeroed initialized
spinlock can trip the spinlock debug check.
Fix by initializing adapter->eth_syslock_spinlock before reading the
strap status so the probe path never attempts to lock an uninitialized
spinlock.
Fixes: 46b777ad9a8c ("net: lan743x: Add support to SGMII 1G and 2.5G")
Cc: stable@vger.kernel.org # v6.0+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
drivers/net/ethernet/microchip/lan743x_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index 1cdce35e14239..e759171bfd766 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -3541,8 +3541,8 @@ static int lan743x_hardware_init(struct lan743x_adapter *adapter,
adapter->max_tx_channels = PCI11X1X_MAX_TX_CHANNELS;
adapter->used_tx_channels = PCI11X1X_USED_TX_CHANNELS;
adapter->max_vector_count = PCI11X1X_MAX_VECTOR_COUNT;
- pci11x1x_strap_get_status(adapter);
spin_lock_init(&adapter->eth_syslock_spinlock);
+ pci11x1x_strap_get_status(adapter);
mutex_init(&adapter->sgmii_rw_lock);
pci11x1x_set_rfe_rd_fifo_threshold(adapter);
sgmii_ctl = lan743x_csr_read(adapter, SGMII_CTL);
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Stanislav Fomichev @ 2026-06-26 16:30 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Jason Xing, netdev, bpf, magnus.karlsson, stfomichev, kuba,
pabeni, horms, bjorn
In-Reply-To: <aj4tUMqAXmQskO6r@boxer>
On 06/26, Maciej Fijalkowski wrote:
> On Thu, Jun 25, 2026 at 09:05:28AM -0700, Stanislav Fomichev wrote:
> > On 06/25, Jason Xing wrote:
> > > On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > > > On 06/23, Maciej Fijalkowski wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > > > consumed from the Tx ring are not consistently returned to userspace
> > > > > > through the completion ring when the packet is later dropped as invalid.
> > > > > >
> > > > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > > > consume one or more Tx descriptors while building or validating a
> > > > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > > > addresses are returned through the CQ. Missing completions therefore
> > > > > > make userspace lose track of those buffers.
> > > > > >
> > > > > > The generic path fixes cover three related cases:
> > > > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > > > > continuation descriptors left in the Tx ring after xsk_build_skb()
> > > > > > reports overflow;
> > > > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > > > > packet, including the offending invalid descriptor itself.
> > > > > >
> > > > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > > > distinguishes descriptors that can be passed to the driver from
> > > > > > descriptors that are consumed only because they belong to an invalid
> > > > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > > > address area and published in completion order, after any earlier
> > > > > > driver-visible Tx descriptors.
> > > > > >
> > > > > > The ZC batching path can also retain drain state when userspace has not
> > > > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > > > state local to the singular batched path, the series prevents a second
> > > > > > Tx socket from joining the same pool while such drain state exists.
> > > > > > During the singular-to-shared transition, Tx batching is gated,
> > > > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > > > existing socket still has pending drain state. This avoids adding
> > > > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > > > >
> > > > > > The last two patches update xskxceiver so the tests account invalid
> > > > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > > > still not expecting those invalid packets on the Rx side.
> > > > > >
> > > > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > > > against ice driver.
> > > > >
> > > > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > > > question from me is: is it time to scrap our current approach where
> > > > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > > > half-consumed descriptor queues)
> > > > >
> > > > > Should we:
> > > > >
> > > > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > > > and do a full stop here)
> > > > > 2. now that we really know the number of valid descriptors -> reserve
> > > > > the cq space (if not -> EAGAIN)
> > > > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > > > locally, don't ever create semi-initialized skb)
> > > > > 4. construct the skb
> > > > > 5. xmit
> > > >
> > > > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > > > reviews yet, but bare in mind this set also aligns zc side to what was
> > > > previously being addressed by Jason.
> > > >
> > > > I believe planned logistics were to get these fixes onto net and then
> > > > Jason had an implementation of batching on generic xmit, directed towards
> > > > -next and that's where we could address current flow.
> > >
> > > Agreed. That's what I'm hoping for. There would be much more
> > > discussion on how to do batch xmit in an elegant way, I believe.
> >
> > This doesn't have to depend on the batch rewrite, we should be able to rewrite
> > this non-zc in net, this is still technically fixes, not feature work..
> >
> > There was already a couple of revisions with this drain_cont approach
> > and every time I look at it feels like the cure is worse than the
> > decease :-( Obviously not gonna stop you from going with the current approach,
> > but these fixes feel a bit of a wasted effort to me (since the bugs keep
> > coming and we are piling more complexity).
>
> Well this is my fault as I took Jason's patches as-is and did not realize
> Sashiko had issues with it. I *think* I got ZC side almost right so I'd
> like to have at least one last round with trying to make the generic side
> right...
You can have my ack on the series since I already did a few revisions
on the original series from Jason, but it is still a very low confidence
ack:
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
^ permalink raw reply
* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Stanislav Fomichev @ 2026-06-26 16:25 UTC (permalink / raw)
To: Jason Xing
Cc: Maciej Fijalkowski, netdev, bpf, magnus.karlsson, stfomichev,
kuba, pabeni, horms, bjorn
In-Reply-To: <CAL+tcoDOFtTKavz8TWpU6BPrHCcL6TjY8xrmFA5y3P9Wn0T4ZA@mail.gmail.com>
On 06/26, Jason Xing wrote:
> On Fri, Jun 26, 2026 at 12:05 AM Stanislav Fomichev
> <sdf.kernel@gmail.com> wrote:
> >
> > On 06/25, Jason Xing wrote:
> > > On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > > > On 06/23, Maciej Fijalkowski wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > > > consumed from the Tx ring are not consistently returned to userspace
> > > > > > through the completion ring when the packet is later dropped as invalid.
> > > > > >
> > > > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > > > consume one or more Tx descriptors while building or validating a
> > > > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > > > addresses are returned through the CQ. Missing completions therefore
> > > > > > make userspace lose track of those buffers.
> > > > > >
> > > > > > The generic path fixes cover three related cases:
> > > > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > > > > continuation descriptors left in the Tx ring after xsk_build_skb()
> > > > > > reports overflow;
> > > > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > > > > packet, including the offending invalid descriptor itself.
> > > > > >
> > > > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > > > distinguishes descriptors that can be passed to the driver from
> > > > > > descriptors that are consumed only because they belong to an invalid
> > > > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > > > address area and published in completion order, after any earlier
> > > > > > driver-visible Tx descriptors.
> > > > > >
> > > > > > The ZC batching path can also retain drain state when userspace has not
> > > > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > > > state local to the singular batched path, the series prevents a second
> > > > > > Tx socket from joining the same pool while such drain state exists.
> > > > > > During the singular-to-shared transition, Tx batching is gated,
> > > > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > > > existing socket still has pending drain state. This avoids adding
> > > > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > > > >
> > > > > > The last two patches update xskxceiver so the tests account invalid
> > > > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > > > still not expecting those invalid packets on the Rx side.
> > > > > >
> > > > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > > > against ice driver.
> > > > >
> > > > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > > > question from me is: is it time to scrap our current approach where
> > > > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > > > half-consumed descriptor queues)
> > > > >
> > > > > Should we:
> > > > >
> > > > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > > > and do a full stop here)
> > > > > 2. now that we really know the number of valid descriptors -> reserve
> > > > > the cq space (if not -> EAGAIN)
> > > > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > > > locally, don't ever create semi-initialized skb)
> > > > > 4. construct the skb
> > > > > 5. xmit
> > > >
> > > > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > > > reviews yet, but bare in mind this set also aligns zc side to what was
> > > > previously being addressed by Jason.
> > > >
> > > > I believe planned logistics were to get these fixes onto net and then
> > > > Jason had an implementation of batching on generic xmit, directed towards
> > > > -next and that's where we could address current flow.
> > >
> > > Agreed. That's what I'm hoping for. There would be much more
> > > discussion on how to do batch xmit in an elegant way, I believe.
> >
> > This doesn't have to depend on the batch rewrite, we should be able to rewrite
> > this non-zc in net, this is still technically fixes, not feature work..
> >
> > There was already a couple of revisions with this drain_cont approach
> > and every time I look at it feels like the cure is worse than the
> > decease :-( Obviously not gonna stop you from going with the current approach,
> > but these fixes feel a bit of a wasted effort to me (since the bugs keep
> > coming and we are piling more complexity).
>
> I see your point, but rewriting is something that cannot be easily
> applied to the stable branches? Until now, we fix issues one by one
> which have an explicit target branch (because of the fixes tag). Cross
> fingers :(
>
> Sashiko has the magic to find out the hidden bugs more than ever and
> AF_XDP is not the only place where a pile of reports are coming in.
net vs net-next is fixes vs feature work. If we can't fix the current
code, I think we can justify a rewrite using a better approach and
route it via net. This series is 7 patches anyway, it's not like
it is a quick short fix :-) But I'm ok with pushing it as it, I'm just
trying to see if someone on your side is fed up with that part as well
and wants to fix it "properly" :-p
> My take is that batch xmit has been appending too long and at least so
> far less and less bugs are found by sashiko. I believe if the mode is
> changed to batch xmit, there are likely to be new and challenging
> problems to discuss. I prefer to solve questions of the batch xmit
> series.
We can redo this part separately, without batching. Move from "read
one chunk at a time" to "pre-read all chunks". Batching vs current issue
are separate.
> BTW, would you both come to Netdev 0x1a next month? I believe we could
> sit around the table and discuss some future plans there (in xdp
> workshop?).
> https://netdevconf.info/0x1A/sessions/workshop/xdp-workshop.html
Yes, I plan to be there in person.
^ permalink raw reply
* Re: [PATCH bpf-next v5 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: David Ahern @ 2026-06-26 16:25 UTC (permalink / raw)
To: Avinash Duduskar, ast, daniel, andrii
Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
toke
In-Reply-To: <20260624030530.3342884-2-avinash.duduskar@gmail.com>
On 6/23/26 9:05 PM, Avinash Duduskar wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 89b36de5fdbb..e00f0392e728 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3532,6 +3532,29 @@ union bpf_attr {
> * Use the mark present in *params*->mark for the fib lookup.
> * This option should not be used with BPF_FIB_LOOKUP_DIRECT,
> * as it only has meaning for full lookups.
> + * **BPF_FIB_LOOKUP_VLAN**
This flag should not be needed. Patches for vlan support were never
submitted (I have them in some old branch). Since the vlan params are
initialized to 0, no new flag should be needed. Besides, these are
output parameters.
> + * If the fib lookup resolves to a VLAN device whose
> + * parent is a real (non-VLAN) device, set
> + * *params*->h_vlan_proto and *params*->h_vlan_TCI from
> + * the VLAN device and replace *params*->ifindex with the
> + * parent's ifindex. *params*->h_vlan_TCI carries the VID
> + * only, with PCP and DEI bits zero; a consumer wanting to
> + * set egress priority writes PCP itself. *params*->smac is
> + * the VLAN device's own address, which can differ from the
> + * parent's. Only the immediate parent is resolved; if it
> + * is itself a VLAN device (QinQ) or in another namespace,
> + * the egress cannot be reduced to a physical device plus
> + * one tag and the lookup returns
> + * **BPF_FIB_LKUP_RET_VLAN_FAILURE** with *params*->ifindex
> + * left at the input. Re-issue without
> + * **BPF_FIB_LOOKUP_VLAN** to obtain the VLAN device's own
> + * ifindex. The swap and the vlan fields
> + * are written only on success; other output fields keep
> + * the helper's existing behaviour, so a frag-needed result
> + * still reports the route mtu in *params*->mtu_result.
> + * This flag is only valid for XDP programs; tc programs
> + * receive -EINVAL since they can redirect to the VLAN
> + * device directly.
^ permalink raw reply
* Re: [PATCH v3] net: fman: fix use-after-free on IRQF_SHARED handler after probe failure
From: Simon Horman @ 2026-06-26 16:23 UTC (permalink / raw)
To: 赵金明
Cc: Andrew Lunn, andrew+netdev, davem, edumazet, kuba, linux-kernel,
madalin.bucur, netdev, pabeni, sean.anderson
In-Reply-To: <A2160E1BFD1B78E6+202606261753007787241@uniontech.com>
On Fri, Jun 26, 2026 at 05:53:02PM +0800, 赵金明 wrote:
> Hi,
>
> The analysis is logically correct. Since fman is zero-initialized by
> kzalloc_obj(), both fman->cfg and fman->fpm_regs are NULL when
> devm_request_irq() registers the shared IRQ handler. The guard in
> fman_irq():
>
> if (!is_init_done(fman->cfg))
> return IRQ_NONE;
>
> does not protect against this case because is_init_done(NULL) returns
> true, so the handler would proceed to dereference the NULL
> fpm_regs pointer via ioread32be().
>
> However, this is a pre-existing issue unrelated to the UAF fix in this
> patch. The window is very short -- between devm_request_irq() and the
> completion of fman_init() -- and would require another device on the
> same shared IRQ line to fire an interrupt during that interval.
>
> If this should be addressed, I will send a separate patch for it.
> The current patch is focused solely on the post-IRQ-registration UAF
> on error paths.
>
> Please let me know if you would like me to handle this separately.
Thanks, I agree this can be handled separately.
^ permalink raw reply
* Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc
From: Stanislav Fomichev @ 2026-06-26 16:18 UTC (permalink / raw)
To: Mahe Tardy
Cc: bpf, andrii, ast, daniel, john.fastabend, jordan, martin.lau,
yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms
In-Reply-To: <aj1b_z6h5xn42Hxe@gmail.com>
On 06/25, Mahe Tardy wrote:
> On Thu, Jun 25, 2026 at 09:24:59AM -0700, Stanislav Fomichev wrote:
> > On 06/25, Mahe Tardy wrote:
>
> [...]
>
> > > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > > +{
> > > + struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > > + struct sk_buff *nskb;
> > > + struct sock *sk;
> > > +
> > > + sk = skb_to_full_sk(skb);
> > > + if (sk && sk->sk_kern_sock &&
> > > + (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
> > > + return -EBUSY;
> > > +
> > > + switch (skb->protocol) {
> > > +#if IS_ENABLED(CONFIG_INET)
> > > + case htons(ETH_P_IP): {
> > > + if (type != ICMP_DEST_UNREACH)
> > > + return -EOPNOTSUPP;
> > > + if (code < 0 || code > NR_ICMP_UNREACH ||
> > > + code == ICMP_FRAG_NEEDED) /* needs a valid next-hop MTU */
> > > + return -EINVAL;
> > > +
> > > + /* icmp_send expects skb_dst to be a real rtable. */
> > > + if (!skb_valid_dst(skb))
> > > + return -ENETUNREACH;
> > > +
> > > + nskb = skb_clone(skb, GFP_ATOMIC);
> > > + if (!nskb)
> > > + return -ENOMEM;
> > > +
> > > + memset(IPCB(nskb), 0, sizeof(*IPCB(nskb)));
> > > + icmp_send(nskb, type, code, 0);
> > > + consume_skb(nskb);
> > > + break;
> > > + }
> > > +#endif
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > + case htons(ETH_P_IPV6):
> > > + if (type != ICMPV6_DEST_UNREACH)
> > > + return -EOPNOTSUPP;
> > > + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > > + return -EINVAL;
> >
> > [..]
> >
> > > + /* icmpv6_send may treat skb_dst as rt6_info. */
> > > + if (skb_metadata_dst(skb))
> > > + return -ENETUNREACH;
> >
> > A bit confused about this. Which part of icmpv6_send treats skb_dst as rt6_info?
> > (I see the original sashiko report about dst, but icmp6 seems to be not
> > requiring it)
>
> Yeah I was also a bit confused because this came out of nowhere as soon
> as I put the skb_valid_dst only on the IPv4 path (for different
> reasons), but there is actually a potential trace in which we have type
> confusion indeed:
>
> - icmp6_send() checks scoped source addresses and calls icmp6_iif() at net/ipv6/icmp.c:702
> - icmp6_iif() calls icmp6_dev() at net/ipv6/icmp.c:441
> - icmp6_dev() does skb_rt6_info(skb) for loopback/L3 master devices at net/ipv6/icmp.c:428
> - skb_rt6_info() casts any non-NULL dst to struct rt6_info at include/net/ip6_route.h:233
> - rt6->rt6i_idev is then dereferenced at net/ipv6/icmp.c:434
>
> When checking with pahole, we can find this on my local kernel:
>
> struct rt6_info {
> struct dst_entry dst; /* 0 136 */
> /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> struct fib6_info * from; /* 136 8 */
> int sernum; /* 144 4 */
> struct rt6key rt6i_dst; /* 148 20 */
> struct rt6key rt6i_src; /* 168 20 */
> struct in6_addr rt6i_gateway; /* 188 16 */
>
> /* XXX 4 bytes hole, try to pack */
>
> /* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */
> struct inet6_dev * rt6i_idev; /* 208 8 */ <--- we dereference this
> u32 rt6i_flags; /* 216 4 */
> short unsigned int rt6i_nfheader_len; /* 220 2 */
>
> /* size: 224, cachelines: 4, members: 9 */
> /* sum members: 218, holes: 1, sum holes: 4 */
> /* padding: 2 */
> /* last cacheline: 32 bytes */
> };
>
> And the metadata_dst would look like this:
>
> struct metadata_dst {
> struct dst_entry dst; /* 0 136 */
> /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> enum metadata_type type; /* 136 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> union {
> struct ip_tunnel_info tun_info; /* 144 96 */
> struct hw_port_info port_info; /* 144 16 */
> struct macsec_info macsec_info; /* 144 8 */
> struct xfrm_md_info xfrm_info; /* 144 16 */
> } u; /* 144 96 */ <--- we land on this union
>
> /* size: 240, cachelines: 4, members: 3 */
> /* sum members: 236, holes: 1, sum holes: 4 */
> /* last cacheline: 48 bytes */
> };
>
> Let's say it's a struct ip_tunnel_info:
>
> struct ip_tunnel_info {
> struct ip_tunnel_key key; /* 0 64 */
>
> /* XXX last struct has 7 bytes of padding */
>
> /* --- cacheline 1 boundary (64 bytes) --- */
> struct ip_tunnel_encap encap; /* 64 8 */ <--- 144 + 64 = 208 we land here
> struct dst_cache dst_cache; /* 72 16 */
> u8 options_len; /* 88 1 */
> u8 mode; /* 89 1 */
>
> /* size: 96, cachelines: 2, members: 5 */
> /* padding: 6 */
> /* paddings: 1, sum paddings: 7 */
> /* last cacheline: 32 bytes */
> };
>
> So I imagine this is fairly tricky to trigger but still a case of type
> confusion. I have actually no idea how likely this can happen from my
> call but the trace makes sense at least.
That logic seems to exist for the icmp6_send to find the input device
(since the expected use-case for calling icmp6_send is to the incoming
skb). And since you're mainly doing egress, I don't think this path will
ever trigger (iow the check is not needed)?
Maybe you can add cgroup_ingress test case? Looks like this rt6_info
path might trigger for ipv6 lo? I don't see any ingress test in your
series, so might be good to have one regardless?
^ permalink raw reply
* Re: [PATCH] fix: net: mediatek: mtk_star_mdio_init: fix double of_node_put after devm_of_mdiobus_register
From: Andrew Lunn @ 2026-06-26 16:12 UTC (permalink / raw)
To: WenTao Liang
Cc: Daniel Borkmann, netdev, David S . Miller, Jakub Kicinski,
Paolo Abeni, stable, linux-kernel
In-Reply-To: <20260626152009.51599-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:20:09PM +0800, WenTao Liang wrote:
> After devm_of_mdiobus_register succeeds, the mdio_node reference
> ownership is transferred to the mii_bus device (released via
> mdiobus_release on device teardown). However, the function
> unconditionally calls of_node_put(mdio_node) after registration, causing
> a double put.
>
> Only call of_node_put when devm_of_mdiobus_register fails (i.e., when
> ownership was not transferred). On success, the bus driver manages the
> reference lifecycle.
>
> Cc: stable@vger.kernel.org
> Fixes: 9ed0a3fac08b ("net: ethernet: mtk-star-emac: use devm_of_mdiobus_register()")
> Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Andrew
---
pw-bot: cr
^ permalink raw reply
* Re: [PATCH] fix: net: ti: cpsw_probe_dt: fix phy_node reference leak on error paths
From: Andrew Lunn @ 2026-06-26 16:12 UTC (permalink / raw)
To: WenTao Liang
Cc: Siddharth Vadapalli, Roger Quadros, netdev, Andrew Lunn,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-omap, stable, linux-kernel
In-Reply-To: <20260626152906.52112-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:29:06PM +0800, WenTao Liang wrote:
> After slave_data->phy_node is assigned (via of_node_get or
> of_parse_phandle), if subsequent calls like of_get_phy_mode or
> ti_cm_get_macid fail, the error path jumps to err_node_put which only
> releases the loop's port_np reference but not the phy_node reference.
> This causes a device_node reference leak.
>
> Release slave_data->phy_node via of_node_put before jumping to
> err_node_put on error paths after phy_node has been acquired.
>
> Cc: stable@vger.kernel.org
> Fixes: ed3525eda4c4 ("net: ethernet: ti: introduce cpsw switchdev based driver part 1 - dual-emac")
> Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Andrew
---
pw-bot: cr
^ permalink raw reply
* Re: [PATCH] fix: net: ti: cpsw_init_common: fix excess of_node_put on parent node when cpts child not found
From: Andrew Lunn @ 2026-06-26 16:12 UTC (permalink / raw)
To: WenTao Liang
Cc: Siddharth Vadapalli, Roger Quadros, netdev, Andrew Lunn,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-omap, stable, linux-kernel
In-Reply-To: <20260626152945.52192-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:29:45PM +0800, WenTao Liang wrote:
> When no "cpts" child node exists in the device tree, cpts_node is
> assigned cpsw->dev->of_node without taking a reference via of_node_get.
> The function then unconditionally calls of_node_put(cpts_node) at the
> end, causing an excess put on the parent device node which can lead to a
> refcount underflow.
>
> Use of_node_get when falling back to the parent node to ensure the
> reference count is properly balanced with the subsequent of_node_put.
>
> Cc: stable@vger.kernel.org
> Fixes: ed3525eda4c4 ("net: ethernet: ti: introduce cpsw switchdev based driver part 1 - dual-emac")
> Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Andrew
---
pw-bot: cr
^ permalink raw reply
* Re: [PATCH net v2 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: Jamal Hadi Salim @ 2026-06-26 16:11 UTC (permalink / raw)
To: Simon Horman
Cc: netdev, davem, edumazet, kuba, pabeni, jiri, victor, security,
zdi-disclosures, stable, kernel test robot
In-Reply-To: <20260626141547.GA1310988@horms.kernel.org>
Hi Simon,
On Fri, Jun 26, 2026 at 10:15 AM Simon Horman <horms@kernel.org> wrote:
>
> On Fri, Jun 26, 2026 at 06:16:43AM -0400, Jamal Hadi Salim wrote:
> > "
> >
> > On Wed, Jun 24, 2026 at 6:40 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > The teql master->slaves singly linked list is not protected against
> > > multiple writes. It can be mod'ed concurently from teql_master_xmit(),
> > > teql_dequeue(), teql_init() and teql_destroy() without holding any list
> > > lock or RCU protection.
> > >
> > > zdi-disclosures@trendmicro.com has demonstrated that the qdisc is freed
> > > after an RCU grace period, but teql_master_xmit() running on another
> > > CPU can still hold a stale pointer into the list, resulting in a
> > > slab-use-after-free:
> > >
> > > BUG: KASAN: slab-use-after-free in teql_destroy+0x3ca/0x440 linux/net/sched/sch_teql.c:142
> > > Read of size 8 at addr ffff88802923aa80 by task ip/10024
> > >
> > > The zdi-disclosures@trendmicro.com repro created concurrent AF_PACKET
> > > senders on a teql device against a thread that repeatedly adds/deletes the
> > > slave qdisc, together with a SLUB spray that reclaims the freed slot; the
> > > resulting UAF is controllable enough to be turned into a read/write
> > > primitive against the freed qdisc object.
> > >
> > > The fix?
> > > Add a per-master slaves_lock spinlock that serializes all mutations of
> > > master->slaves and the NEXT_SLAVE() links in teql_destroy() and
> > > teql_qdisc_init(). teql_master_xmit() also takes the same slaves_lock
> > > around those updates.
> > > Annotate master->slaves and the per-slave ->next pointer with __rcu and
> > > use the appropriate RCU accessors everywhere they are touched:
> > > rcu_assign_pointer() on the writer side (under slaves_lock),
> > > rcu_dereference_protected() for the writer-side loads (also under
> > > slaves_lock), rcu_dereference_bh() for the loads in teql_master_xmit() and
> > > rtnl_dereference() for the loads in teql_master_open()/teql_master_mtu(),
> > > which run under RTNL.
> > > Pair this with rcu_read_lock_bh()/rcu_read_unlock_bh() around the list
> > > traversal in teql_master_xmit(), so that readers either observe a fully
> > > linked list or are deferred until the in-flight mutation completes. The two
> > > early-return paths in teql_master_xmit() are updated to release the RCU-bh
> > > read-side critical section before returning, since leaving it held would
> > > disable BH on that CPU for good.
> > >
> >
> > sashiko-gemini's complaints:
> > https://sashiko.dev/#/patchset/20260624224016.24018-1-jhs%40mojatatu.com
> > seem bogus to me (someone correct me if i am wrong). I am only going
> > to address the first claim of "TOCTOU / "resurrection" race in
> > teql_master_xmit()"
> > teql_master_xmit() holds rcu_read_lock_bh() across the entire
> > traversal. teql_destroy() freeing can only proceed once the qdisc's
> > RCU grace period has elapsed - so where is this TOCTOU? Let's say this
> > were true: both calls hold the slaves_lock.
> > The other issues are of similar nature.
>
> Hi Jamal,
>
> I think the central question here is about the protection offered by RCU
> in these code paths. And while I agree it protects the use of elements
> of the list, I think the problem flagged relates to the management of
> the list itself.
>
> The example AI gave me when I asked is like this:
>
> Assume a TEQL master has one slave, `q`.
> The list is circular: `q->next == q`.
>
> 1. CPU A (Transmitting): Enters `teql_master_xmit()`.
> It reads `master->sla ves` and gets a local pointer to `q`.
>
> 2. CPU B (Destroying): Calls `teql_destroy(q)`.
> It takes the lock, unlinks `q`, and sets `master->slaves = NULL`.
> The list is now logically empty.
>
> 3. CPU A: Finishes its work and prepares to rotate the list head
> to the next slave.
> It takes the lock.
>
> 4. CPU A (The "Use" / The Resurrection):
> It executes: `rcu_assign_pointer(master->slaves, NEXT_SLAVE(q));`
> Because `q` was circular, `NEXT_SLAVE(q)` is still `q`.
>
> 5. CPU A: Releases the lock.
> **The global `master->slaves` is now `q` again.**
>
> 6. The System: The RCU grace period expires. CPU B finishes
> `teql_destroy()` and the memory for `q` is freed.
>
> The global `master->slaves` pointer is now a **dangling pointer**
> pointing to freed memory.
>
Yeah, thats the same earlier claim of TOCTOU (what sashiko-gemini
claimed was "resurrecting the freed q")
My view is rcu read lock blocks the subsequent call_rcu free - and
destroy() and xmit() already serialize on slaves_lock.
I could be totaly wrong, but it's almost like sashiko-gemini thinks
that the list-mutation lock _alone_ governs the object lifetime.
The rcu read-side critical section prevents the UAF, not just the
slaves_lock alone
Only reason i added slaves_lock was to prevent corrupting the list
state (whereas the RCU read lock prevents premature free).
In step #4 above this thing somehow leaves out any mention of the rcu
read lock entirely and places the free in step 6 as if it was
independent of CPU A's critical section.
I am not sure how to improve it.
> > OTOH, sashiko-claude
> > (https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260624224016.24018-1-jhs%40mojatatu.com)
> > does make some valid claims which are low value, so not sure a resend
> > is worth it.
> > For example in claim 1 it says "Should the changelog mention this
> > teql_dequeue() site too?" Sure I can - but just because I provided
> > extra information in the commit log, which I could have omitted, now I
> > have to add more info? ;->
>
> FWIIW, I think there is a value in tightening up the commit message.
> E.g. so it's accurate when we look at again in two years time.
> But I also lean towards it not being necessary to post an update
> only to address this.
>
>
> > The second claim is "rcu_dereference_bh()
> > should be rcu_dereference_protected() on writer side". Sparse didnt
> > complain and i dont see this as breakage rather a consistency measure.
>
> I think it would be good to address in the long run. But as per my comment
> immediately above, I also lean towards it not being necessary to post an
> update only to address this.
I can resend with these two taken care of - but i am skeptical of what
sashiko-gemini is claiming (and i admit as a human the AI may see
something i am totally missing).
cheers,
jamal
>
> > Unless I am missing something ..
> >
> > cheers,
> > jamal
^ permalink raw reply
* Re: [PATCH] fix: net: ti: cpts_of_mux_clk_setup: fix device_node reference leak on success path
From: Andrew Lunn @ 2026-06-26 16:11 UTC (permalink / raw)
To: WenTao Liang
Cc: Andrew Lunn, netdev, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Grygorii Strashko, linux-omap,
stable, linux-kernel
In-Reply-To: <20260626153148.52612-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:31:48PM +0800, WenTao Liang wrote:
> of_get_child_by_name acquires a device_node reference for refclk_np, and
> of_clk_add_hw_provider additionally takes an extra reference internally.
> On the success path, the function returns 0 without calling
> of_node_put(refclk_np), leaking the initial reference.
>
> Add of_node_put(refclk_np) before returning success to properly release
> the acquired reference.
>
> Cc: stable@vger.kernel.org
> Fixes: a3047a81ba13 ("net: ethernet: ti: cpts: add support for ext rftclk selection")
> Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Andrew
---
pw-bot: cr
^ permalink raw reply
* Re: [PATCH] fix: net: mdio: of_phy_register_fixed_link: fix phy_device reference leak via discarded pointer
From: Andrew Lunn @ 2026-06-26 16:11 UTC (permalink / raw)
To: WenTao Liang
Cc: Andrew Lunn, netdev, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, stable, linux-kernel
In-Reply-To: <20260626153330.52741-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:33:30PM +0800, WenTao Liang wrote:
> fixed_phy_register returns a struct phy_device pointer with a held
> reference on success. However, the register_phy label discards the
> pointer via PTR_ERR_OR_ZERO, and the phy_device's fwnode is not set, so
> of_phy_deregister_fixed_link cannot find the device via
> bus_find_device_by_fwnode to clean it up. This permanently leaks the
> phy_device and its device_node reference.
>
> Store the returned phy_device pointer and set dev->fwnode so the
> deregister path can properly locate and clean it up.
>
> Cc: stable@vger.kernel.org
> Fixes: 24c30dbbcdda ("of/mdio: Add support function for Ethernet fixed-link property")
> Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Andrew
---
pw-bot: cr
^ permalink raw reply
* Re: [PATCH] fix: net: phy: phy_sfp_probe: fix kref leak when phy_setup_sfp_port fails after sfp_bus_add_upstream
From: Andrew Lunn @ 2026-06-26 16:11 UTC (permalink / raw)
To: WenTao Liang
Cc: Andrew Lunn, netdev, Heiner Kallweit, Russell King,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
stable, linux-kernel
In-Reply-To: <20260626153443.52842-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:34:43PM +0800, WenTao Liang wrote:
> sfp_bus_add_upstream unconditionally acquires a kref on the SFP bus. When
> this call succeeds but the subsequent phy_setup_sfp_port fails, the error
> path returns without releasing the upstream reference. Since probe fails,
> the device's remove function (which would normally clean this up) will
> never be called, permanently leaking the kref.
>
> Call sfp_bus_del_upstream on the error path after a successful
> sfp_bus_add_upstream to properly release the upstream reference.
>
> Cc: stable@vger.kernel.org
> Fixes: 298e54fa810e ("net: phy: add core phylib sfp support")
> Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Andrew
---
pw-bot: cr
^ permalink raw reply
* Re: [PATCH v2] fix: net: renesas: rswitch_mii_register: fix double of_node_put after of_mdiobus_register
From: Andrew Lunn @ 2026-06-26 16:10 UTC (permalink / raw)
To: WenTao Liang
Cc: netdev, Yoshihiro Shimoda, David S . Miller, Jakub Kicinski,
Paolo Abeni, stable, linux-kernel
In-Reply-To: <20260626152550.51911-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:25:50PM +0800, WenTao Liang wrote:
> After of_mdiobus_register succeeds, the mdio_np reference ownership is
> transferred to the mii_bus device (released via fwnode_handle_put during
> mdiobus_release). The success path calls of_node_put(mdio_np) which,
> combined with the automatic release via bus teardown, results in a double
> put and refcount underflow.
>
> Move of_node_put so it is only called in the error path where
> of_mdiobus_register failed. On success, the bus driver manages the
> reference lifecycle.
Please stop with these patches.
First please read:
https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html
and
https://docs.kernel.org/process/submitting-patches.html
You are getting a lot of things wrong.
* don’t repost your patches within one 24h period
* Don't thread new versions of a patch to the old one
* Include version history, how is v2 different to v1
* When you see your own patch is broken, reply with NACK, and explain
what is wrong with it.
Until you learn how to correctly submit patches, please only submit
them one at a time, get it accepted, and move onto the next. Otherwise
you are wasting peoples time, and getting yourself a bad reputation.
Andrew
^ permalink raw reply
* Re: [PATCH] fix: net: cadence: macb_mii_init: fix double of_node_put on mdio_np after macb_mdiobus_register
From: Andrew Lunn @ 2026-06-26 16:02 UTC (permalink / raw)
To: WenTao Liang
Cc: Nicolas Ferre, netdev, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, stable, linux-kernel
In-Reply-To: <20260626151449.50969-1-vulab@iscas.ac.cn>
On Fri, Jun 26, 2026 at 11:14:49PM +0800, WenTao Liang wrote:
> After macb_mdiobus_register succeeds, the mdio_np reference ownership is
> transferred to the mii_bus device (stored in mii_bus->dev.of_node). When
> the subsequent macb_mii_probe fails, the error path jumps to
> err_out_unregister_bus which calls mdiobus_free (releasing the node via
> fwnode_handle_put) and then falls through to err_out which calls
> of_node_put(mdio_np) again, causing a double put.
>
> Move the of_node_put to only execute on paths where the reference was not
> transferred (i.e., before successful macb_mdiobus_register).
>
> Cc: stable@vger.kernel.org
> Fixes: ef8a2e27289e ("net: macb: fix probing of PHY not described in the dt")
> Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
> ---
> drivers/net/ethernet/cadence/macb_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index a12aa21244e8..c58e089e5888 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -1170,6 +1170,9 @@ static int macb_mii_init(struct macb *bp)
> mdiobus_unregister(bp->mii_bus);
> err_out_free_mdiobus:
> mdiobus_free(bp->mii_bus);
> + of_node_put(mdio_np);
> + return err;
> +
> err_out:
> of_node_put(mdio_np);
This does not look correct.
You say if mdiobus_register() is successful, mdiobus_free() will
release the reference.
If macb_mii_probe() fails, we have successfully done
mdiobus_register(). It does a goto err_out_unregister_bus, which calls
mdiobus_unregister(), mdiobus_free() releasing the reference, and then
with your patch of_node_put(). This looks like a double put to me.
I think i already said this once, consider the risk of your patches,
particularly if you cannot test them.
Andrew
^ permalink raw reply
* [PATCH net v2 2/2] net: ethernet: oa_tc6: Improvement in buffer overflow handling
From: Selvamani Rajagopal via B4 Relay @ 2026-06-26 15:35 UTC (permalink / raw)
To: Parthiban Veerasooran, Andrew Lunn, Piergiorgio Beruto,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: netdev, linux-kernel, Andrew Lunn, Parthiban Veerasooran,
Selvamani Rajagopal
In-Reply-To: <20260626-fix-race-condition-and-crash-v2-0-b6c5c10e604f@onsemi.com>
From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
When oversubscribed traffic causes lot of buffer overflow errors,
probably due to loss of data chunks, driver fails to find a
data chunk with end_valid bit set, before it runs out of sk buffer
space. As a result, assert is seen during skb_put.
Now check is made if tail + len > end, driver abandons the current
data and starts look for a data chunk with start_valid bit,
that is a new frame.
Fixes: d70a0d8f2f2d ("net: ethernet: oa_tc6: implement receive path to receive rx ethernet frames")
Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
---
changes in v2
- Check rx_skb pointer before new allocation and NULL before use.
---
drivers/net/ethernet/oa_tc6.c | 69 +++++++++++++++++++++++++++++++------------
1 file changed, 50 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/oa_tc6.c b/drivers/net/ethernet/oa_tc6.c
index 3fd4851ee66d..c59daa032e70 100644
--- a/drivers/net/ethernet/oa_tc6.c
+++ b/drivers/net/ethernet/oa_tc6.c
@@ -692,6 +692,12 @@ static void oa_tc6_free_pending_skbs(struct oa_tc6 *tc6)
oa_tc6_cleanup_waiting_tx_skb(tc6);
}
+static void oa_tc6_look_for_new_frame(struct oa_tc6 *tc6)
+{
+ tc6->rx_buf_overflow = true;
+ oa_tc6_cleanup_ongoing_rx_skb(tc6);
+}
+
/* If the failure is at SPI interface level, masking and clearing
* the interrupt of the device won't work. Since SPI interrupt is
* disabled, it should stop the repeated interrupts.
@@ -729,8 +735,7 @@ static int oa_tc6_process_extended_status(struct oa_tc6 *tc6)
}
if (FIELD_GET(STATUS0_RX_BUFFER_OVERFLOW_ERROR, value)) {
- tc6->rx_buf_overflow = true;
- oa_tc6_cleanup_ongoing_rx_skb(tc6);
+ oa_tc6_look_for_new_frame(tc6);
net_err_ratelimited("%s: Receive buffer overflow error\n",
tc6->netdev->name);
return -EAGAIN;
@@ -811,13 +816,35 @@ static void oa_tc6_submit_rx_skb(struct oa_tc6 *tc6)
tc6->rx_skb = NULL;
}
-static void oa_tc6_update_rx_skb(struct oa_tc6 *tc6, u8 *payload, u8 length)
+/* On oversubscribed traffic condition, particularly with overwhelming rx
+ * buffer overflow errors, there could be data chunk loss. If tail + length
+ * goes beyond end pointer, that is an indication that the data chunk with
+ * end_valid bit is lost. Time to look for a data chunk with start_valid bit.
+ *
+ * If rx_skb is NULL, it is time to start looking for data chunk with
+ * start_bit.
+ */
+static int oa_tc6_update_rx_skb(struct oa_tc6 *tc6, u8 *payload, u8 length)
{
+ if (!tc6->rx_skb ||
+ (tc6->rx_skb->tail + length) > tc6->rx_skb->end) {
+ oa_tc6_look_for_new_frame(tc6);
+ return -EAGAIN;
+ }
+
memcpy(skb_put(tc6->rx_skb, length), payload, length);
+ return 0;
}
+/* On overwhelming rx buffer overflow errors, due to data chunk loss, it is
+ * possible that we get two data chunks with start_valid bit set, without
+ * end_valid bit set in between. In this case, rx_skb would have a valid
+ * buffer pointer. We should release, if a valid pointer is found before
+ * allocating a new one.
+ */
static int oa_tc6_allocate_rx_skb(struct oa_tc6 *tc6)
{
+ oa_tc6_cleanup_ongoing_rx_skb(tc6);
tc6->rx_skb = netdev_alloc_skb_ip_align(tc6->netdev, tc6->netdev->mtu +
ETH_HLEN + ETH_FCS_LEN);
if (!tc6->rx_skb) {
@@ -837,7 +864,9 @@ static int oa_tc6_prcs_complete_rx_frame(struct oa_tc6 *tc6, u8 *payload,
if (ret)
return ret;
- oa_tc6_update_rx_skb(tc6, payload, size);
+ ret = oa_tc6_update_rx_skb(tc6, payload, size);
+ if (ret)
+ return ret;
oa_tc6_submit_rx_skb(tc6);
@@ -852,22 +881,24 @@ static int oa_tc6_prcs_rx_frame_start(struct oa_tc6 *tc6, u8 *payload, u16 size)
if (ret)
return ret;
- oa_tc6_update_rx_skb(tc6, payload, size);
-
- return 0;
+ return oa_tc6_update_rx_skb(tc6, payload, size);
}
-static void oa_tc6_prcs_rx_frame_end(struct oa_tc6 *tc6, u8 *payload, u16 size)
+static int oa_tc6_prcs_rx_frame_end(struct oa_tc6 *tc6, u8 *payload, u16 size)
{
- oa_tc6_update_rx_skb(tc6, payload, size);
+ int ret;
- oa_tc6_submit_rx_skb(tc6);
+ ret = oa_tc6_update_rx_skb(tc6, payload, size);
+ if (!ret)
+ oa_tc6_submit_rx_skb(tc6);
+ return ret;
}
-static void oa_tc6_prcs_ongoing_rx_frame(struct oa_tc6 *tc6, u8 *payload,
- u32 footer)
+static int oa_tc6_prcs_ongoing_rx_frame(struct oa_tc6 *tc6, u8 *payload,
+ u32 footer)
{
- oa_tc6_update_rx_skb(tc6, payload, OA_TC6_CHUNK_PAYLOAD_SIZE);
+ return oa_tc6_update_rx_skb(tc6, payload,
+ OA_TC6_CHUNK_PAYLOAD_SIZE);
}
static int oa_tc6_prcs_rx_chunk_payload(struct oa_tc6 *tc6, u8 *data,
@@ -880,6 +911,7 @@ static int oa_tc6_prcs_rx_chunk_payload(struct oa_tc6 *tc6, u8 *data,
bool start_valid = FIELD_GET(OA_TC6_DATA_FOOTER_START_VALID, footer);
bool end_valid = FIELD_GET(OA_TC6_DATA_FOOTER_END_VALID, footer);
u16 size;
+ int ret;
/* Restart the new rx frame after receiving rx buffer overflow error */
if (start_valid && tc6->rx_buf_overflow)
@@ -907,8 +939,7 @@ static int oa_tc6_prcs_rx_chunk_payload(struct oa_tc6 *tc6, u8 *data,
/* Process the chunk with only rx frame end */
if (end_valid && !start_valid) {
size = end_byte_offset + 1;
- oa_tc6_prcs_rx_frame_end(tc6, data, size);
- return 0;
+ return oa_tc6_prcs_rx_frame_end(tc6, data, size);
}
/* Process the chunk with previous rx frame end and next rx frame
@@ -921,7 +952,9 @@ static int oa_tc6_prcs_rx_chunk_payload(struct oa_tc6 *tc6, u8 *data,
*/
if (tc6->rx_skb) {
size = end_byte_offset + 1;
- oa_tc6_prcs_rx_frame_end(tc6, data, size);
+ ret = oa_tc6_prcs_rx_frame_end(tc6, data, size);
+ if (ret)
+ return ret;
}
size = OA_TC6_CHUNK_PAYLOAD_SIZE - start_byte_offset;
return oa_tc6_prcs_rx_frame_start(tc6,
@@ -930,9 +963,7 @@ static int oa_tc6_prcs_rx_chunk_payload(struct oa_tc6 *tc6, u8 *data,
}
/* Process the chunk with ongoing rx frame data */
- oa_tc6_prcs_ongoing_rx_frame(tc6, data, footer);
-
- return 0;
+ return oa_tc6_prcs_ongoing_rx_frame(tc6, data, footer);
}
static u32 oa_tc6_get_rx_chunk_footer(struct oa_tc6 *tc6, u16 footer_offset)
--
2.43.0
^ permalink raw reply related
* [PATCH net v2 1/2] net: ethernet: oa_tc6: Protect skb pointer used by two different kernel instances
From: Selvamani Rajagopal via B4 Relay @ 2026-06-26 15:35 UTC (permalink / raw)
To: Parthiban Veerasooran, Andrew Lunn, Piergiorgio Beruto,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: netdev, linux-kernel, Andrew Lunn, Parthiban Veerasooran,
Selvamani Rajagopal
In-Reply-To: <20260626-fix-race-condition-and-crash-v2-0-b6c5c10e604f@onsemi.com>
From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
Threaded IRQ uses waiting_tx_skb. Transmit path also uses
this pointer without any mutual exclusion protection. As a
result, it might leak skb buffer, particularly threaded IRQ
runs in the middle of tranmsmit path, near skb_linearize.
Fixes: b542d13fab0f ("net: ethernet: oa_tc6: Interrupt is active low, level triggered.")
Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
---
changes in v2:
added the missing prefix to the title
---
drivers/net/ethernet/oa_tc6.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/oa_tc6.c b/drivers/net/ethernet/oa_tc6.c
index 0727d53345a3..3fd4851ee66d 100644
--- a/drivers/net/ethernet/oa_tc6.c
+++ b/drivers/net/ethernet/oa_tc6.c
@@ -672,10 +672,16 @@ static void oa_tc6_cleanup_ongoing_tx_skb(struct oa_tc6 *tc6)
static void oa_tc6_cleanup_waiting_tx_skb(struct oa_tc6 *tc6)
{
- if (tc6->waiting_tx_skb) {
+ struct sk_buff *skb;
+
+ spin_lock_bh(&tc6->tx_skb_lock);
+ skb = tc6->waiting_tx_skb;
+ tc6->waiting_tx_skb = NULL;
+ spin_unlock_bh(&tc6->tx_skb_lock);
+
+ if (skb) {
tc6->netdev->stats.tx_dropped++;
- kfree_skb(tc6->waiting_tx_skb);
- tc6->waiting_tx_skb = NULL;
+ kfree_skb(skb);
}
}
@@ -1250,11 +1256,6 @@ EXPORT_SYMBOL_GPL(oa_tc6_zero_align_receive_frame_enable);
*/
netdev_tx_t oa_tc6_start_xmit(struct oa_tc6 *tc6, struct sk_buff *skb)
{
- if (tc6->disable_traffic || tc6->waiting_tx_skb) {
- netif_stop_queue(tc6->netdev);
- return NETDEV_TX_BUSY;
- }
-
if (skb_linearize(skb)) {
dev_kfree_skb_any(skb);
tc6->netdev->stats.tx_dropped++;
@@ -1262,6 +1263,11 @@ netdev_tx_t oa_tc6_start_xmit(struct oa_tc6 *tc6, struct sk_buff *skb)
}
spin_lock_bh(&tc6->tx_skb_lock);
+ if (tc6->disable_traffic || tc6->waiting_tx_skb) {
+ netif_stop_queue(tc6->netdev);
+ spin_unlock_bh(&tc6->tx_skb_lock);
+ return NETDEV_TX_BUSY;
+ }
tc6->waiting_tx_skb = skb;
spin_unlock_bh(&tc6->tx_skb_lock);
--
2.43.0
^ permalink raw reply related
* [PATCH net v2 0/2] Fix to possible skb leak due to race condtion in tx path
From: Selvamani Rajagopal via B4 Relay @ 2026-06-26 15:35 UTC (permalink / raw)
To: Parthiban Veerasooran, Andrew Lunn, Piergiorgio Beruto,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: netdev, linux-kernel, Andrew Lunn, Parthiban Veerasooran,
Selvamani Rajagopal
Now the traffic is handled in threaded IRQ, and the
disable_traffic flag is checked before handling the
data, new race condition is exposed, in which
buffer may leak, if threaded IRQ interrupts the
trasmit path midway.
With this change, disable_traffic and waiting_tx_skb
pointer are protected by spin lock/unlock pair.
This is highlighted in Sashiko review
https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260611-level-trigger-v5-0-4533a9e85ce2%40onsemi.com
Also on buffer overrun condition, probably due to loss of
SPI data chunks, receive path doesn't see the expected
data chunk with end_valid bit set. As a result, driver
keeps adding data chunks to the skb before running out
of space and kernel panic is seen.
With this change, before adding data to the skb, if there
is no space, skb is freed and driver starts looking for
new frame by looking for a data chunk with start_valid
bit set.
[ 705.405490] skbuff: skb_over_panic: text:ffffffd2eb72a264 len:1600 put:64 head:ffffff804e5cdc40 data:ffffff804e5cdc80 tail:0x680 end:0x640 dev:eth1
[ 705.405569] ------------[ cut here ]------------
[ 705.405575] kernel BUG at net/core/skbuff.c:214!
[ 705.405589] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 6703.427690] Call trace:
[ 705.925157] skb_panic+0x58/0x68 (P)
[ 705.928726] skb_put+0x74/0x80
[ 705.931772] oa_tc6_update_rx_skb+0x44/0x98 [oa_tc6_mod]
[ 705.937084] oa_tc6_macphy_threaded_irq+0x3f4/0x900 [oa_tc6_mod]
[ 705.943084] irq_thread_fn+0x34/0xb8
[ 705.946654] irq_thread+0x1a0/0x300
[ 705.950134] kthread+0x138/0x150
[ 705.953356] ret_from_fork+0x10/0x20
Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
---
Changes in v2:
- Improvment to how error -EAGAIN is handled. Took care of
couple of use cases where start_bit and end_bit may be missing or
repeated due to lost data chunks.
- Protected handling of waiting_tx_skb pointer with spin lock
- Link to v1: https://lore.kernel.org/r/20260621-fix-race-condition-and-crash-v1-0-87e290d9357f@onsemi.com
---
Selvamani Rajagopal (2):
net: ethernet: oa_tc6: Protect skb pointer used by two different kernel instances
net: ethernet: oa_tc6: Improvement in buffer overflow handling
drivers/net/ethernet/oa_tc6.c | 91 ++++++++++++++++++++++++++++++-------------
1 file changed, 64 insertions(+), 27 deletions(-)
---
base-commit: 805185b7c7a1069e407b6f7b3bc98e44d415f484
change-id: 20260621-fix-race-condition-and-crash-94d055a665c4
Best regards,
--
Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>
^ permalink raw reply
* [PATCH] fix: net: phy: phy_sfp_probe: fix kref leak when phy_setup_sfp_port fails after sfp_bus_add_upstream
From: WenTao Liang @ 2026-06-26 15:34 UTC (permalink / raw)
To: Andrew Lunn, netdev
Cc: Heiner Kallweit, Russell King, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, stable, linux-kernel, WenTao Liang
sfp_bus_add_upstream unconditionally acquires a kref on the SFP bus. When
this call succeeds but the subsequent phy_setup_sfp_port fails, the error
path returns without releasing the upstream reference. Since probe fails,
the device's remove function (which would normally clean this up) will
never be called, permanently leaking the kref.
Call sfp_bus_del_upstream on the error path after a successful
sfp_bus_add_upstream to properly release the upstream reference.
Cc: stable@vger.kernel.org
Fixes: 298e54fa810e ("net: phy: add core phylib sfp support")
Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
---
drivers/net/phy/phy_device.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 3370eb822017..cd62c46de017 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1723,6 +1723,11 @@ static int phy_sfp_probe(struct phy_device *phydev)
if (!ret && phydev->sfp_bus)
ret = phy_setup_sfp_port(phydev);
+ if (ret && phydev->sfp_bus) {
+ sfp_bus_del_upstream(phydev->sfp_bus);
+ phydev->sfp_bus = NULL;
+ }
+
return ret;
}
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox