[PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
@ 2026-03-07 21:20 Jamal Hadi Salim
  2026-03-11  1:47 ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: Jamal Hadi Salim @ 2026-03-07 21:20 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, jiri, toke, vinicius.gomes,
	stephen, vladbu, cake, bpf, ghandatmanas, km.kim1503, security,
	Jamal Hadi Salim, Victor Nogueira

In tc_new_tfilter, __tcf_qdisc_find is always called without the
rtnl_lock. This causes a bug in the following scenario.

- The user attaches a multiqueue qdisc to root with handle ffff0000
- The user attaches a child multiqueue qdisc to ffff0002 with handle
  0x10000

After that parallel threads execute the following:

time x: On cpux: "qdisc del" enters for root qdisc, holding rtnl
(example refcnt = 1), see tc_get_qdisc()
time x+1: On cpuy: "filter add" enters increments refcnt to 2 after
finding the root qdisc (see __tcf_qdisc_find)
time x+2: On cpux: "qdisc del" calls qdisc_put, decrements refcnt to 1
and doesnt delete qdisc because refcnt is not 0;  "qdisc del" exits
time x+3: On cpuz: "qdisc add" enters for htb, using handle ffff0000 (the
deleted root qdisc's handle), and adds as parent 0x10001 (which is the
major of the delete root qdiscs's child) whilst "filter add" is still
executing.
time x+4: On cpuz: "qdisc_add" calls qdisc_tree_reduce_backlog.
Before calling  qdisc_tree_reduce_backlog, the newly created qdisc is added
to the qdisc hashtable. Since its parent is 0x10001,
qdisc_tree_reduce_backlog will find the previous child qdisc (which had as
parent 0xffff0002). So when qdisc_tree_reduce_backlog looks up 0x10000's
parent, it will find the newly created qdisc (from step x + 3). This will
result in calling the notify callback for the newly created qdisc.
Since the class it passes as a parameter (0xffff0002) to the notify
callback isn't a child of the newly created qdisc, this will cause a
segfault (in time x+4):

[   89.555574][  T337] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000035: 0000 [#1] SMP KASAN NOPTI
[   89.556410][  T337] KASAN: null-ptr-deref in range [0x00000000000001a8-0x00000000000001af]
[   89.556737][  T337] CPU: 5 UID: 0 PID: 337 Comm: poc_manas_null_ Not tainted 7.0.0-rc1-00147-g9439a661c2e8 #604 PREEMPT(full)
...
[   89.557404][  T337] RIP: 0010:htb_qlen_notify (net/sched/sch_htb.c:613 net/sched/sch_htb.c:1490)
[   89.557614][  T337] Code: 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8d 96 a8 01 00 00 48 89 f9 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 d7 48 c1 ef 03 <0f> b6 04 07 84 c0 74 04 3c 03 7e 61 8b 86 a8 01 00 00 85 c0 75 09
[   89.558291][  T337] RSP: 0018:ffff8880200df308 EFLAGS: 00010216
[   89.558530][  T337] RAX: dffffc0000000000 RBX: ffff888004484000 RCX: ffff888004484000
[   89.558818][  T337] RDX: 00000000000001a8 RSI: 0000000000000000 RDI: 0000000000000035
[   89.559096][  T337] RBP: dffffc0000000000 R08: 1ffff11000890850 R09: 00000000a1c8e417
[   89.559374][  T337] R10: 0000000000000001 R11: 00000000d51664f0 R12: 0000000000000000
[   89.559657][  T337] R13: 0000000000000000 R14: 00000000ffff0002 R15: ffffffff9e5c6cc0
[   89.559942][  T337] FS:  00007ff004df66c0(0000) GS:ffff88809543b000(0000) knlGS:0000000000000000
[   89.560273][  T337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   89.560507][  T337] CR2: 000055ce40d39158 CR3: 0000000016275000 CR4: 0000000000750ef0
[   89.560797][  T337] PKRU: 55555554
[   80.405129] Call Trace:
[   80.405430]  <TASK>
[   89.561184][  T337]  ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221)
[   89.561382][  T337]  ? __pfx_htb_qlen_notify (net/sched/sch_htb.c:1487)
[   89.561570][  T337]  qdisc_tree_reduce_backlog (net/sched/sch_api.c:806)
[   89.561770][  T337]  multiq_graft (./include/net/sch_generic.h:1007 ./include/net/sch_generic.h:1283 net/sched/sch_multiq.c:289)
[   89.561959][  T337]  ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221)
[   89.562140][  T337]  ? __pfx_multiq_graft (net/sched/sch_multiq.c:282)
[   89.562294][  T337]  ? qdisc_alloc (./include/linux/refcount.h:134 net/sched/sch_generic.c:984)
[   89.562461][  T337]  ? qdisc_create (net/sched/sch_api.c:1263)
[   89.562613][  T337]  ? __pfx_htb_init (net/sched/sch_htb.c:1052)
[   89.562786][  T337]  ? netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344)
[   89.562941][  T337]  qdisc_graft (net/sched/sch_api.c:1197)
[   89.563098][  T337]  ? __pfx_qdisc_graft (net/sched/sch_api.c:1091)
[   89.563293][  T337]  ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221)
[   89.563459][  T337]  ? qdisc_hash_add (net/sched/sch_api.c:285 (discriminator 2) net/sched/sch_api.c:282 (discriminator 2))
[   89.563618][  T337]  ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221)
[   89.563780][  T337]  ? qdisc_create (./include/trace/events/qdisc.h:127 (discriminator 33) net/sched/sch_api.c:1344 (discriminator 33))
[   89.563950][  T337]  tc_modify_qdisc (net/sched/sch_api.c:1762 net/sched/sch_api.c:1817)

Fix this by putting the qdisc in a intermediate state whenever qdisc_graft
cannot delete it because it's being accessed in parallel by a classifier.
We accomplish this by creating a new qdisc op (mark_for_del), which will
set the TCQ_F_MARK_FOR_DEL flag for the qdisc and all of its descendents in
the tree. This op is mandatory for all qdiscs that perform grafting.
When the qdisc is in this intermediate state, whenever it is looked up,
ERR_PTR(-EBUSY) will be returned.

A similar issue was fixed for ingress side in
https://lore.kernel.org/netdev/c1f67078dc8a3fd7b3c8ed65896c726d1e9b261e.1686355297.git.peilin.ye@bytedance.com/

Note: We tried a couple of different approaches that had smaller code
footprint but were a bit fugly. The first approach was to use recursion
on the qdisc hash table to iterate the descendants of the qdisc; however,
the challenge here is if the graph depth is "high" - we may overflow the
stack. The second approach was to use a breadth first search to achieve
the same goal; the challenge here was it was a quadratic algorithm.

Fixes: 470502de5bdb ("net: sched: unlock rules update API")
Reported-by: Manas Ghandat <ghandatmanas@gmail.com>
Reported-by: GangMin Kim <km.kim1503@gmail.com>
Closes: https://lore.kernel.org/netdev/CAGfirffHzSjmjNGx1ZU+JNLmrUYWEukrPAP5nTbJdemn6MZGyQ@mail.gmail.com/
Closes: https://lore.kernel.org/netdev/CAGfirfcsAFODycGarmqY8v6HSiHVaBgyKSUucV0DvPDjVX0U8Q@mail.gmail.com/
Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/sch_generic.h | 14 ++++++-
 include/net/sch_priv.h    |  1 +
 net/sched/bpf_qdisc.c     |  5 +++
 net/sched/cls_api.c       | 11 ++++--
 net/sched/sch_api.c       | 79 +++++++++++++++++++++++++++++++++++----
 net/sched/sch_cake.c      |  1 +
 net/sched/sch_cbs.c       |  8 ++++
 net/sched/sch_drr.c       | 15 ++++++++
 net/sched/sch_ets.c       | 10 +++++
 net/sched/sch_hfsc.c      | 14 +++++++
 net/sched/sch_htb.c       | 22 +++++++++++
 net/sched/sch_mq.c        | 14 +++++++
 net/sched/sch_mqprio.c    | 15 ++++++++
 net/sched/sch_multiq.c    | 11 ++++++
 net/sched/sch_netem.c     |  8 ++++
 net/sched/sch_prio.c      | 11 ++++++
 net/sched/sch_qfq.c       | 14 +++++++
 net/sched/sch_red.c       |  8 ++++
 net/sched/sch_sfb.c       |  8 ++++
 net/sched/sch_taprio.c    | 12 ++++++
 net/sched/sch_tbf.c       |  8 ++++
 21 files changed, 277 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d5d55cb21686..5f0ec8e02d2b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -89,6 +89,7 @@ struct Qdisc {
 #define TCQ_F_NOLOCK		0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED		0x200 /* qdisc is offloaded to HW */
 #define TCQ_F_DEQUEUE_DROPS	0x400 /* ->dequeue() can drop packets in q->to_free */
+#define TCQ_F_MARK_FOR_DEL	0x800 /* Is marked for deletion */
 
 	u32			limit;
 	const struct Qdisc_ops	*ops;
@@ -328,7 +329,7 @@ struct Qdisc_ops {
 
 	int			(*dump)(struct Qdisc *, struct sk_buff *);
 	int			(*dump_stats)(struct Qdisc *, struct gnet_dump *);
-
+	void			(*mark_for_del)(struct Qdisc *sch);
 	void			(*ingress_block_set)(struct Qdisc *sch,
 						     u32 block_index);
 	void			(*egress_block_set)(struct Qdisc *sch,
@@ -740,6 +741,17 @@ qdisc_offload_graft_helper(struct net_device *dev, struct Qdisc *sch,
 {
 }
 #endif
+
+static inline void qdisc_mark_for_del(struct Qdisc *sch)
+{
+	if (!sch)
+		return;
+
+	sch->flags |= TCQ_F_MARK_FOR_DEL;
+	if (sch->ops->mark_for_del)
+		sch->ops->mark_for_del(sch);
+}
+
 void qdisc_offload_query_caps(struct net_device *dev,
 			      enum tc_setup_type type,
 			      void *caps, size_t caps_len);
diff --git a/include/net/sch_priv.h b/include/net/sch_priv.h
index 4789f668ae87..4c874d1d5e7c 100644
--- a/include/net/sch_priv.h
+++ b/include/net/sch_priv.h
@@ -12,6 +12,7 @@ int mq_init_common(struct Qdisc *sch, struct nlattr *opt,
 		   struct netlink_ext_ack *extack,
 		   const struct Qdisc_ops *qdisc_ops);
 void mq_destroy_common(struct Qdisc *sch);
+void mq_mark_for_del_common(struct Qdisc *sch);
 void mq_attach(struct Qdisc *sch);
 void mq_dump_common(struct Qdisc *sch, struct sk_buff *skb);
 struct netdev_queue *mq_select_queue(struct Qdisc *sch,
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 098ca02aed89..e40d84ddf4a8 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -246,6 +246,11 @@ __bpf_kfunc int bpf_qdisc_init_prologue(struct Qdisc *sch,
 		 * has not been added to qdisc_hash yet.
 		 */
 		p = qdisc_lookup(dev, TC_H_MAJ(sch->parent));
+		if (IS_ERR(p)) {
+			NL_SET_ERR_MSG(extack,
+				       "BPF Qdisc is being deleted in parallel");
+			return PTR_ERR(p);
+		}
 		if (p && !(p->flags & TCQ_F_MQROOT)) {
 			NL_SET_ERR_MSG(extack, "BPF qdisc only supported on root or mq");
 			return -EINVAL;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 4829c27446e3..fc0c0d94d7ac 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1203,6 +1203,12 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
 		*parent = (*q)->handle;
 	} else {
 		*q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
+		if (IS_ERR(*q)) {
+			NL_SET_ERR_MSG(extack,
+				       "Parent Qdisc is being deleted in parallel");
+			err = PTR_ERR(*q);
+			goto errout_rcu;
+		}
 		if (!*q) {
 			NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
 			err = -EINVAL;
@@ -2895,7 +2901,7 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 			q = rtnl_dereference(dev->qdisc);
 		else
 			q = qdisc_lookup(dev, TC_H_MAJ(tcm->tcm_parent));
-		if (!q)
+		if (IS_ERR_OR_NULL(q))
 			goto out;
 		cops = q->ops->cl_ops;
 		if (!cops)
@@ -3278,8 +3284,7 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 			q = rtnl_dereference(dev->qdisc);
 		else
 			q = qdisc_lookup(dev, TC_H_MAJ(tcm->tcm_parent));
-
-		if (!q)
+		if (IS_ERR_OR_NULL(q))
 			goto out;
 		cops = q->ops->cl_ops;
 		if (!cops)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index cc43e3f7574f..fa126309c06b 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -159,6 +159,12 @@ int register_qdisc(struct Qdisc_ops *qops)
 
 		if (cops->tcf_block && !(cops->bind_tcf && cops->unbind_tcf))
 			goto out_einval;
+
+		/* If a qdisc can have children, it might need to mark them
+		 * in case there are parallel classifier ops
+		 */
+		if (cops->graft && !(qops->mark_for_del))
+			goto out_einval;
 	}
 
 	qops->next = NULL;
@@ -264,17 +270,30 @@ static struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
 {
 	struct Qdisc *q;
 
-	if (!qdisc_dev(root))
-		return (root->handle == handle ? root : NULL);
+	if (!qdisc_dev(root)) {
+		if (root->handle == handle) {
+			if (root->flags & TCQ_F_MARK_FOR_DEL)
+				return ERR_PTR(-EBUSY);
+			return root;
+		}
+		return NULL;
+	}
 
 	if (!(root->flags & TCQ_F_BUILTIN) &&
-	    root->handle == handle)
+	    root->handle == handle) {
+		if (root->flags & TCQ_F_MARK_FOR_DEL)
+			return ERR_PTR(-EBUSY);
+
 		return root;
+	}
 
 	hash_for_each_possible_rcu(qdisc_dev(root)->qdisc_hash, q, hash, handle,
 				   lockdep_rtnl_is_held()) {
-		if (q->handle == handle)
+		if (q->handle == handle) {
+			if (q->flags & TCQ_F_MARK_FOR_DEL)
+				return ERR_PTR(-EBUSY);
 			return q;
+		}
 	}
 	return NULL;
 }
@@ -793,8 +812,8 @@ void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
 		notify = !sch->q.qlen;
 		/* TODO: perform the search on a per txq basis */
 		sch = qdisc_lookup_rcu(qdisc_dev(sch), TC_H_MAJ(parentid));
-		if (sch == NULL) {
-			WARN_ON_ONCE(parentid != TC_H_ROOT);
+		if (sch == NULL || IS_ERR(sch)) {
+			WARN_ON_ONCE(parentid != TC_H_ROOT && sch == NULL);
 			break;
 		}
 		cops = sch->ops->cl_ops;
@@ -985,6 +1004,8 @@ static bool tc_qdisc_dump_ignore(struct Qdisc *q, bool dump_invisible)
 		return true;
 	if ((q->flags & TCQ_F_INVISIBLE) && !dump_invisible)
 		return true;
+	if ((q->flags & TCQ_F_MARK_FOR_DEL))
+		return true;
 
 	return false;
 }
@@ -1050,6 +1071,17 @@ static int qdisc_notify(struct net *net, struct sk_buff *oskb,
 	return -EINVAL;
 }
 
+static void qdisc_put_mark_for_del(struct Qdisc *sch)
+{
+	if (sch->flags & TCQ_F_BUILTIN)
+		return;
+
+	if (refcount_dec_and_test(&sch->refcnt))
+		qdisc_destroy(sch);
+	else
+		qdisc_mark_for_del(sch);
+}
+
 static void notify_and_destroy(struct net *net, struct sk_buff *skb,
 			       struct nlmsghdr *n, u32 clid,
 			       struct Qdisc *old, struct Qdisc *new,
@@ -1059,7 +1091,7 @@ static void notify_and_destroy(struct net *net, struct sk_buff *skb,
 		qdisc_notify(net, skb, n, clid, old, new, extack);
 
 	if (old)
-		qdisc_put(old);
+		qdisc_put_mark_for_del(old);
 }
 
 static void qdisc_clear_nolock(struct Qdisc *sch)
@@ -1157,7 +1189,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 			rcu_assign_pointer(dev->qdisc, new ? : &noop_qdisc);
 
 			notify_and_destroy(net, skb, n, classid, old, new, extack);
-
 			if (new && new->ops->attach)
 				new->ops->attach(new);
 		}
@@ -1481,6 +1512,11 @@ static int __tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 		if (clid != TC_H_ROOT) {
 			if (TC_H_MAJ(clid) != TC_H_MAJ(TC_H_INGRESS)) {
 				p = qdisc_lookup(dev, TC_H_MAJ(clid));
+				if (IS_ERR(p)) {
+					NL_SET_ERR_MSG(extack,
+						       "Qdisc is being deleted in parallel");
+					return PTR_ERR(p);
+				}
 				if (!p) {
 					NL_SET_ERR_MSG(extack, "Failed to find qdisc with specified classid");
 					return -ENOENT;
@@ -1505,6 +1541,11 @@ static int __tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 		}
 	} else {
 		q = qdisc_lookup(dev, tcm->tcm_handle);
+		if (IS_ERR(q)) {
+			NL_SET_ERR_MSG(extack,
+				       "Qdisc is being deleted in parallel");
+			return PTR_ERR(q);
+		}
 		if (!q) {
 			NL_SET_ERR_MSG(extack, "Failed to find qdisc with specified handle");
 			return -ENOENT;
@@ -1595,6 +1636,11 @@ static int __tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 		if (clid != TC_H_ROOT) {
 			if (clid != TC_H_INGRESS) {
 				p = qdisc_lookup(dev, TC_H_MAJ(clid));
+				if (IS_ERR(p)) {
+					NL_SET_ERR_MSG(extack,
+						       "Qdisc is being deleted in parallel");
+					return PTR_ERR(p);
+				}
 				if (!p) {
 					NL_SET_ERR_MSG(extack, "Failed to find specified qdisc");
 					return -ENOENT;
@@ -1629,6 +1675,11 @@ static int __tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 					return -EINVAL;
 				}
 				q = qdisc_lookup(dev, tcm->tcm_handle);
+				if (IS_ERR(q)) {
+					NL_SET_ERR_MSG(extack,
+						       "Qdisc is being deleted in parallel");
+					return PTR_ERR(q);
+				}
 				if (!q)
 					goto create_n_graft;
 				if (q->parent != tcm->tcm_parent) {
@@ -1705,6 +1756,11 @@ static int __tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			return -EINVAL;
 		}
 		q = qdisc_lookup(dev, tcm->tcm_handle);
+		if (IS_ERR(q)) {
+			NL_SET_ERR_MSG(extack,
+				       "Qdisc is being deleted in parallel");
+			return PTR_ERR(q);
+		}
 	}
 
 	/* Change qdisc parameters */
@@ -2218,6 +2274,11 @@ static int __tc_ctl_tclass(struct sk_buff *skb, struct nlmsghdr *n,
 
 	/* OK. Locate qdisc */
 	q = qdisc_lookup(dev, qid);
+	if (IS_ERR(q)) {
+		NL_SET_ERR_MSG(extack,
+			       "Qdisc is being deleted in parallel");
+		return PTR_ERR(q);
+	}
 	if (!q)
 		return -ENOENT;
 
@@ -2375,6 +2436,8 @@ static int tc_dump_tclass_root(struct Qdisc *root, struct sk_buff *skb,
 
 	if (tcm->tcm_parent) {
 		q = qdisc_match_from_root(root, TC_H_MAJ(tcm->tcm_parent));
+		if (IS_ERR(q))
+			return 0;
 		if (q && q != root &&
 		    tc_dump_tclass_qdisc(q, skb, tcm, cb, t_p, s_t) < 0)
 			return -1;
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 9efe23f8371b..a92ce78c733f 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -3341,6 +3341,7 @@ static struct Qdisc_ops cake_mq_qdisc_ops __read_mostly = {
 	.change		=	cake_mq_change,
 	.change_real_num_tx = mq_change_real_num_tx,
 	.dump		=	cake_mq_dump,
+	.mark_for_del	=	mq_mark_for_del_common,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("cake_mq");
diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
index 8c9a0400c862..23edb020a0de 100644
--- a/net/sched/sch_cbs.c
+++ b/net/sched/sch_cbs.c
@@ -449,6 +449,13 @@ static void cbs_destroy(struct Qdisc *sch)
 	qdisc_put(q->qdisc);
 }
 
+static void cbs_mark_for_del(struct Qdisc *sch)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+
+	qdisc_mark_for_del(q->qdisc);
+}
+
 static int cbs_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct cbs_sched_data *q = qdisc_priv(sch);
@@ -544,6 +551,7 @@ static struct Qdisc_ops cbs_qdisc_ops __read_mostly = {
 	.destroy	=	cbs_destroy,
 	.change		=	cbs_change,
 	.dump		=	cbs_dump,
+	.mark_for_del	=	cbs_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("cbs");
diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c
index 01335a49e091..8d261d2c6548 100644
--- a/net/sched/sch_drr.c
+++ b/net/sched/sch_drr.c
@@ -458,6 +458,20 @@ static void drr_destroy_qdisc(struct Qdisc *sch)
 	qdisc_class_hash_destroy(&q->clhash);
 }
 
+static void drr_mark_qdisc_for_del(struct Qdisc *sch)
+{
+	struct drr_sched *q = qdisc_priv(sch);
+	struct hlist_node *next;
+	struct drr_class *cl;
+	int i;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry_safe(cl, next, &q->clhash.hash[i],
+					  common.hnode)
+			qdisc_mark_for_del(cl->qdisc);
+	}
+}
+
 static const struct Qdisc_class_ops drr_class_ops = {
 	.change		= drr_change_class,
 	.delete		= drr_delete_class,
@@ -483,6 +497,7 @@ static struct Qdisc_ops drr_qdisc_ops __read_mostly = {
 	.init		= drr_init_qdisc,
 	.reset		= drr_reset_qdisc,
 	.destroy	= drr_destroy_qdisc,
+	.mark_for_del	= drr_mark_qdisc_for_del,
 	.owner		= THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("drr");
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index a4b07b661b77..93a88e181ffb 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -742,6 +742,15 @@ static void ets_qdisc_destroy(struct Qdisc *sch)
 		qdisc_put(q->classes[band].qdisc);
 }
 
+static void ets_qdisc_mark_for_del(struct Qdisc *sch)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int band;
+
+	for (band = 0; band < q->nbands; band++)
+		qdisc_mark_for_del(q->classes[band].qdisc);
+}
+
 static int ets_qdisc_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct ets_sched *q = qdisc_priv(sch);
@@ -827,6 +836,7 @@ static struct Qdisc_ops ets_qdisc_ops __read_mostly = {
 	.reset		= ets_qdisc_reset,
 	.destroy	= ets_qdisc_destroy,
 	.dump		= ets_qdisc_dump,
+	.mark_for_del	= ets_qdisc_mark_for_del,
 	.owner		= THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("ets");
diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index b5657ffbbf84..a008c8db532c 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1517,6 +1517,19 @@ hfsc_destroy_qdisc(struct Qdisc *sch)
 	qdisc_watchdog_cancel(&q->watchdog);
 }
 
+static void
+hfsc_mark_qdisc_for_del(struct Qdisc *sch)
+{
+	struct hfsc_sched *q = qdisc_priv(sch);
+	struct hfsc_class *cl;
+	unsigned int i;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], cl_common.hnode)
+			qdisc_mark_for_del(cl->qdisc);
+	}
+}
+
 static int
 hfsc_dump_qdisc(struct Qdisc *sch, struct sk_buff *skb)
 {
@@ -1681,6 +1694,7 @@ static struct Qdisc_ops hfsc_qdisc_ops __read_mostly = {
 	.dequeue	= hfsc_dequeue,
 	.peek		= qdisc_peek_dequeued,
 	.cl_ops		= &hfsc_class_ops,
+	.mark_for_del	= hfsc_mark_qdisc_for_del,
 	.priv_size	= sizeof(struct hfsc_sched),
 	.owner		= THIS_MODULE
 };
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index cf6cd4ccfa20..6f740c82721e 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -1689,6 +1689,27 @@ static void htb_destroy(struct Qdisc *sch)
 	kfree(q->direct_qdiscs);
 }
 
+static void htb_mark_for_del(struct Qdisc *sch)
+{
+	struct htb_sched *q = qdisc_priv(sch);
+	struct hlist_node *next;
+	struct htb_class *cl;
+	unsigned int i;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry_safe(cl, next, &q->clhash.hash[i],
+					  common.hnode)
+			if (!cl->level)
+				qdisc_mark_for_del(cl->leaf.q);
+	}
+
+	if (q->direct_qdiscs) {
+		for (i = 0; i < q->num_direct_qdiscs && q->direct_qdiscs[i];
+		     i++)
+			qdisc_mark_for_del(q->direct_qdiscs[i]);
+	}
+}
+
 static int htb_delete(struct Qdisc *sch, unsigned long arg,
 		      struct netlink_ext_ack *extack)
 {
@@ -2148,6 +2169,7 @@ static struct Qdisc_ops htb_qdisc_ops __read_mostly = {
 	.reset		=	htb_reset,
 	.destroy	=	htb_destroy,
 	.dump		=	htb_dump,
+	.mark_for_del	=	htb_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("htb");
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index 0ed199fa18f0..1f9369ebc1dc 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -59,6 +59,19 @@ void mq_destroy_common(struct Qdisc *sch)
 }
 EXPORT_SYMBOL_NS_GPL(mq_destroy_common, "NET_SCHED_INTERNAL");
 
+void mq_mark_for_del_common(struct Qdisc *sch)
+{
+	struct mq_sched *priv = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntx;
+
+	if (!priv->qdiscs)
+		return;
+	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+		qdisc_mark_for_del(priv->qdiscs[ntx]);
+}
+EXPORT_SYMBOL_NS_GPL(mq_mark_for_del_common, "NET_SCHED_INTERNAL");
+
 static void mq_destroy(struct Qdisc *sch)
 {
 	mq_offload(sch, TC_MQ_DESTROY);
@@ -297,5 +310,6 @@ struct Qdisc_ops mq_qdisc_ops __read_mostly = {
 	.attach		= mq_attach,
 	.change_real_num_tx = mq_change_real_num_tx,
 	.dump		= mq_dump,
+	.mark_for_del	= mq_mark_for_del_common,
 	.owner		= THIS_MODULE,
 };
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index b83276409416..0566e854e156 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -112,6 +112,20 @@ static void mqprio_destroy(struct Qdisc *sch)
 		netdev_set_num_tc(dev, 0);
 }
 
+static void mqprio_mark_for_del(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	unsigned int ntx;
+
+	if (priv->qdiscs) {
+		for (ntx = 0;
+		     ntx < dev->num_tx_queues && priv->qdiscs[ntx];
+		     ntx++)
+			qdisc_mark_for_del(priv->qdiscs[ntx]);
+	}
+}
+
 static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
 			    const struct tc_mqprio_caps *caps,
 			    struct netlink_ext_ack *extack)
@@ -769,6 +783,7 @@ static struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
 	.attach		= mqprio_attach,
 	.change_real_num_tx = mq_change_real_num_tx,
 	.dump		= mqprio_dump,
+	.mark_for_del	= mqprio_mark_for_del,
 	.owner		= THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("mqprio");
diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index 9f822fee113d..edc60ddc0b12 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -168,6 +168,16 @@ multiq_destroy(struct Qdisc *sch)
 	kfree(q->queues);
 }
 
+static void
+multiq_mark_for_del(struct Qdisc *sch)
+{
+	struct multiq_sched_data *q = qdisc_priv(sch);
+	int band;
+
+	for (band = 0; band < q->bands; band++)
+		qdisc_mark_for_del(q->queues[band]);
+}
+
 static int multiq_tune(struct Qdisc *sch, struct nlattr *opt,
 		       struct netlink_ext_ack *extack)
 {
@@ -393,6 +403,7 @@ static struct Qdisc_ops multiq_qdisc_ops __read_mostly = {
 	.destroy	=	multiq_destroy,
 	.change		=	multiq_tune,
 	.dump		=	multiq_dump,
+	.mark_for_del	=	multiq_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("multiq");
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 5de1c932944a..134716ec507f 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -1155,6 +1155,13 @@ static void netem_destroy(struct Qdisc *sch)
 	dist_free(q->slot_dist);
 }
 
+static void netem_mark_for_del(struct Qdisc *sch)
+{
+	struct netem_sched_data *q = qdisc_priv(sch);
+
+	qdisc_mark_for_del(q->qdisc);
+}
+
 static int dump_loss_model(const struct netem_sched_data *q,
 			   struct sk_buff *skb)
 {
@@ -1353,6 +1360,7 @@ static struct Qdisc_ops netem_qdisc_ops __read_mostly = {
 	.destroy	=	netem_destroy,
 	.change		=	netem_change,
 	.dump		=	netem_dump,
+	.mark_for_del	=	netem_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("netem");
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 9e2b9a490db2..3a8cc47cd6a4 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -173,6 +173,16 @@ prio_destroy(struct Qdisc *sch)
 		qdisc_put(q->queues[prio]);
 }
 
+static void
+prio_mark_for_del(struct Qdisc *sch)
+{
+	struct prio_sched_data *q = qdisc_priv(sch);
+	int prio;
+
+	for (prio = 0; prio < q->bands; prio++)
+		qdisc_mark_for_del(q->queues[prio]);
+}
+
 static int prio_tune(struct Qdisc *sch, struct nlattr *opt,
 		     struct netlink_ext_ack *extack)
 {
@@ -416,6 +426,7 @@ static struct Qdisc_ops prio_qdisc_ops __read_mostly = {
 	.destroy	=	prio_destroy,
 	.change		=	prio_tune,
 	.dump		=	prio_dump,
+	.mark_for_del	=	prio_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("prio");
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index 699e45873f86..6b72a626dc74 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -1510,6 +1510,19 @@ static void qfq_destroy_qdisc(struct Qdisc *sch)
 	qdisc_class_hash_destroy(&q->clhash);
 }
 
+static void qfq_mark_qdisc_for_del(struct Qdisc *sch)
+{
+	struct qfq_sched *q = qdisc_priv(sch);
+	struct qfq_class *cl;
+	unsigned int i;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i],
+				     common.hnode)
+			qdisc_mark_for_del(cl->qdisc);
+	}
+}
+
 static const struct Qdisc_class_ops qfq_class_ops = {
 	.change		= qfq_change_class,
 	.delete		= qfq_delete_class,
@@ -1535,6 +1548,7 @@ static struct Qdisc_ops qfq_qdisc_ops __read_mostly = {
 	.init		= qfq_init_qdisc,
 	.reset		= qfq_reset_qdisc,
 	.destroy	= qfq_destroy_qdisc,
+	.mark_for_del	= qfq_mark_qdisc_for_del,
 	.owner		= THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("qfq");
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 479c42d11083..e5df29ca5f7c 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -223,6 +223,13 @@ static void red_destroy(struct Qdisc *sch)
 	qdisc_put(q->qdisc);
 }
 
+static void red_mark_for_del(struct Qdisc *sch)
+{
+	struct red_sched_data *q = qdisc_priv(sch);
+
+	qdisc_mark_for_del(q->qdisc);
+}
+
 static const struct nla_policy red_policy[TCA_RED_MAX + 1] = {
 	[TCA_RED_UNSPEC] = { .strict_start_type = TCA_RED_FLAGS },
 	[TCA_RED_PARMS]	= { .len = sizeof(struct tc_red_qopt) },
@@ -548,6 +555,7 @@ static struct Qdisc_ops red_qdisc_ops __read_mostly = {
 	.change		=	red_change,
 	.dump		=	red_dump,
 	.dump_stats	=	red_dump_stats,
+	.mark_for_del	=	red_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("red");
diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c
index d2835f1168e1..e4151e39a92f 100644
--- a/net/sched/sch_sfb.c
+++ b/net/sched/sch_sfb.c
@@ -473,6 +473,13 @@ static void sfb_destroy(struct Qdisc *sch)
 	qdisc_put(q->qdisc);
 }
 
+static void sfb_mark_for_del(struct Qdisc *sch)
+{
+	struct sfb_sched_data *q = qdisc_priv(sch);
+
+	qdisc_mark_for_del(q->qdisc);
+}
+
 static const struct nla_policy sfb_policy[TCA_SFB_MAX + 1] = {
 	[TCA_SFB_PARMS]	= { .len = sizeof(struct tc_sfb_qopt) },
 };
@@ -709,6 +716,7 @@ static struct Qdisc_ops sfb_qdisc_ops __read_mostly = {
 	.change		=	sfb_change,
 	.dump		=	sfb_dump,
 	.dump_stats	=	sfb_dump_stats,
+	.mark_for_del	=	sfb_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("sfb");
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index f721c03514f6..54bb5814f1e9 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -2059,6 +2059,17 @@ static void taprio_destroy(struct Qdisc *sch)
 	taprio_cleanup_broken_mqprio(q);
 }
 
+static void taprio_mark_for_del(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int i;
+
+	if (q->qdiscs)
+		for (i = 0; i < dev->num_tx_queues; i++)
+			qdisc_mark_for_del(q->qdiscs[i]);
+}
+
 static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
 		       struct netlink_ext_ack *extack)
 {
@@ -2542,6 +2553,7 @@ static struct Qdisc_ops taprio_qdisc_ops __read_mostly = {
 	.enqueue	= taprio_enqueue,
 	.dump		= taprio_dump,
 	.dump_stats	= taprio_dump_stats,
+	.mark_for_del	= taprio_mark_for_del,
 	.owner		= THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("taprio");
diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index f2340164f579..ef378d5dc636 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -507,6 +507,13 @@ static void tbf_destroy(struct Qdisc *sch)
 	qdisc_put(q->qdisc);
 }
 
+static void tbf_mark_for_del(struct Qdisc *sch)
+{
+	struct tbf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_mark_for_del(q->qdisc);
+}
+
 static int tbf_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
@@ -613,6 +620,7 @@ static struct Qdisc_ops tbf_qdisc_ops __read_mostly = {
 	.destroy	=	tbf_destroy,
 	.change		=	tbf_change,
 	.dump		=	tbf_dump,
+	.mark_for_del	=	tbf_mark_for_del,
 	.owner		=	THIS_MODULE,
 };
 MODULE_ALIAS_NET_SCH("tbf");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-07 21:20 [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete Jamal Hadi Salim
@ 2026-03-11  1:47 ` Jakub Kicinski
  2026-03-11 16:22   ` Jamal Hadi Salim
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2026-03-11  1:47 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Sat,  7 Mar 2026 16:20:58 -0500 Jamal Hadi Salim wrote:
> Note: We tried a couple of different approaches that had smaller code
> footprint but were a bit fugly. The first approach was to use recursion
> on the qdisc hash table to iterate the descendants of the qdisc; however,
> the challenge here is if the graph depth is "high" - we may overflow the
> stack. The second approach was to use a breadth first search to achieve
> the same goal; the challenge here was it was a quadratic algorithm.

Lots of complexity when realistically only ingress/clsact support 
the unlocked operations. Can we not just take rtnl before the
references and not bother all the real qdiscs with this @#%$ ?
(diff just to illustrate the point not even compiled)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 4829c27446e3..21b461f3323d 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2255,6 +2255,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	int err;
 	int tp_created;
 	bool rtnl_held = false;
+	bool rtnl_take = false
 	u32 flags;
 
 replay:
@@ -2290,11 +2291,17 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		}
 	}
 
+	/* Realistically only INGRESS supports unlocked ops */
+	if (parent != TC_H_INGRESS) {
+		rtnl_held = true;
+		rtnl_lock();
+	}
+
 	/* Find head of filter chain. */
 
 	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
 	if (err)
-		return err;
+		goto errout;
 
 	if (tcf_proto_check_kind(tca[TCA_KIND], name)) {
 		NL_SET_ERR_MSG(extack, "Specified TC filter name too long");
@@ -2306,11 +2313,12 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	 * block is shared (no qdisc found), qdisc is not unlocked, classifier
 	 * type is not specified, classifier is not unlocked.
 	 */
-	if (rtnl_held ||
+	if (rtnl_take ||
 	    (q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
 	    !tcf_proto_is_unlocked(name)) {
+		if (!rtnl_held)
+			rtnl_lock();
 		rtnl_held = true;
-		rtnl_lock();
 	}
 
 	err = __tcf_qdisc_cl_find(q, parent, &cl, t->tcm_ifindex, extack);
@@ -2451,17 +2459,16 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 	tcf_block_release(q, block, rtnl_held);
 
-	if (rtnl_held)
-		rtnl_unlock();
-
 	if (err == -EAGAIN) {
 		/* Take rtnl lock in case EAGAIN is caused by concurrent flush
 		 * of target chain.
 		 */
-		rtnl_held = true;
+		rtnl_take = true;
 		/* Replay the request. */
 		goto replay;
 	}
+	if (rtnl_held)
+		rtnl_unlock();
 	return err;
 
 errout_locked:

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-11  1:47 ` Jakub Kicinski
@ 2026-03-11 16:22   ` Jamal Hadi Salim
  2026-03-12  0:52     ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: Jamal Hadi Salim @ 2026-03-11 16:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Tue, Mar 10, 2026 at 9:47 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat,  7 Mar 2026 16:20:58 -0500 Jamal Hadi Salim wrote:
> > Note: We tried a couple of different approaches that had smaller code
> > footprint but were a bit fugly. The first approach was to use recursion
> > on the qdisc hash table to iterate the descendants of the qdisc; however,
> > the challenge here is if the graph depth is "high" - we may overflow the
> > stack. The second approach was to use a breadth first search to achieve
> > the same goal; the challenge here was it was a quadratic algorithm.
>
> Lots of complexity when realistically only ingress/clsact support
> the unlocked operations. Can we not just take rtnl before the
> references and not bother all the real qdiscs with this @#%$ ?
> (diff just to illustrate the point not even compiled)
>

Two of the several (I think 4!) patches we had took a similar path. I
am trying to remember at least one variant was bad for performance and
the other was unstable. Let's see if we can revive it and take a
closer look. BTW - none were pretty, it was maybe half the lines of
code but touched many things.

cheers,
jamal

> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 4829c27446e3..21b461f3323d 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -2255,6 +2255,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>         int err;
>         int tp_created;
>         bool rtnl_held = false;
> +       bool rtnl_take = false
>         u32 flags;
>
>  replay:
> @@ -2290,11 +2291,17 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>                 }
>         }
>
> +       /* Realistically only INGRESS supports unlocked ops */
> +       if (parent != TC_H_INGRESS) {
> +               rtnl_held = true;
> +               rtnl_lock();
> +       }
> +
>         /* Find head of filter chain. */
>
>         err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
>         if (err)
> -               return err;
> +               goto errout;
>
>         if (tcf_proto_check_kind(tca[TCA_KIND], name)) {
>                 NL_SET_ERR_MSG(extack, "Specified TC filter name too long");
> @@ -2306,11 +2313,12 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>          * block is shared (no qdisc found), qdisc is not unlocked, classifier
>          * type is not specified, classifier is not unlocked.
>          */
> -       if (rtnl_held ||
> +       if (rtnl_take ||
>             (q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
>             !tcf_proto_is_unlocked(name)) {
> +               if (!rtnl_held)
> +                       rtnl_lock();
>                 rtnl_held = true;
> -               rtnl_lock();
>         }
>
>         err = __tcf_qdisc_cl_find(q, parent, &cl, t->tcm_ifindex, extack);
> @@ -2451,17 +2459,16 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>         }
>         tcf_block_release(q, block, rtnl_held);
>
> -       if (rtnl_held)
> -               rtnl_unlock();
> -
>         if (err == -EAGAIN) {
>                 /* Take rtnl lock in case EAGAIN is caused by concurrent flush
>                  * of target chain.
>                  */
> -               rtnl_held = true;
> +               rtnl_take = true;
>                 /* Replay the request. */
>                 goto replay;
>         }
> +       if (rtnl_held)
> +               rtnl_unlock();
>         return err;
>
>  errout_locked:

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-11 16:22   ` Jamal Hadi Salim
@ 2026-03-12  0:52     ` Jakub Kicinski
  2026-03-12 20:36       ` Jamal Hadi Salim
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2026-03-12  0:52 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Wed, 11 Mar 2026 12:22:41 -0400 Jamal Hadi Salim wrote:
> > On Sat,  7 Mar 2026 16:20:58 -0500 Jamal Hadi Salim wrote:  
> > > Note: We tried a couple of different approaches that had smaller code
> > > footprint but were a bit fugly. The first approach was to use recursion
> > > on the qdisc hash table to iterate the descendants of the qdisc; however,
> > > the challenge here is if the graph depth is "high" - we may overflow the
> > > stack. The second approach was to use a breadth first search to achieve
> > > the same goal; the challenge here was it was a quadratic algorithm.  
> >
> > Lots of complexity when realistically only ingress/clsact support
> > the unlocked operations. Can we not just take rtnl before the
> > references and not bother all the real qdiscs with this @#%$ ?
> > (diff just to illustrate the point not even compiled)
> 
> Two of the several (I think 4!) patches we had took a similar path. I
> am trying to remember at least one variant was bad for performance and
> the other was unstable. Let's see if we can revive it and take a
> closer look. BTW - none were pretty, it was maybe half the lines of
> code but touched many things.

FWIW / of course, we have to apply similar change to all(?) callers of
__tcf_qdisc_find in cls_api. So LOC-wise it may end up also pretty long.
And it's not going to help the already spaghetti-looking locking. But
even if it's more LoC I quite like the idea of containing the poopy
code to where problems originate which is the lockless filter handling.
Fingers crossed..

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-12  0:52     ` Jakub Kicinski
@ 2026-03-12 20:36       ` Jamal Hadi Salim
  2026-03-12 23:51         ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: Jamal Hadi Salim @ 2026-03-12 20:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

[-- Attachment #1: Type: text/plain, Size: 2148 bytes --]

On Wed, Mar 11, 2026 at 8:52 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 11 Mar 2026 12:22:41 -0400 Jamal Hadi Salim wrote:
> > > On Sat,  7 Mar 2026 16:20:58 -0500 Jamal Hadi Salim wrote:
> > > > Note: We tried a couple of different approaches that had smaller code
> > > > footprint but were a bit fugly. The first approach was to use recursion
> > > > on the qdisc hash table to iterate the descendants of the qdisc; however,
> > > > the challenge here is if the graph depth is "high" - we may overflow the
> > > > stack. The second approach was to use a breadth first search to achieve
> > > > the same goal; the challenge here was it was a quadratic algorithm.
> > >
> > > Lots of complexity when realistically only ingress/clsact support
> > > the unlocked operations. Can we not just take rtnl before the
> > > references and not bother all the real qdiscs with this @#%$ ?
> > > (diff just to illustrate the point not even compiled)
> >
> > Two of the several (I think 4!) patches we had took a similar path. I
> > am trying to remember at least one variant was bad for performance and
> > the other was unstable. Let's see if we can revive it and take a
> > closer look. BTW - none were pretty, it was maybe half the lines of
> > code but touched many things.
>
> FWIW / of course, we have to apply similar change to all(?) callers of
> __tcf_qdisc_find in cls_api. So LOC-wise it may end up also pretty long.
> And it's not going to help the already spaghetti-looking locking. But
> even if it's more LoC I quite like the idea of containing the poopy
> code to where problems originate which is the lockless filter handling.
> Fingers crossed..

Something like attached.
Unfortunately after running it for a few hours it reproduced.
The action code path (entered by virtue of filter code path execution)
releases the rtnl when attempting to load an action module. A parallel
qdisc operation waiting for the lock then grabs it and we hit the same
issue...

So now we have to be more invasive and start coordinating the action
code etc, which is not appealing. Thoughts?

cheers,
jamal

[-- Attachment #2: acquire_rtnl_before_qdisc_find.patch --]
[-- Type: text/x-patch, Size: 6487 bytes --]

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 4829c27446e3..ee5f6fff57cb 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -349,7 +349,7 @@ static bool tcf_proto_check_kind(struct nlattr *kind, char *name)
 	return false;
 }
 
-static bool tcf_proto_is_unlocked(const char *kind)
+static bool tcf_proto_is_unlocked(const char *kind, bool rtnl_held)
 {
 	const struct tcf_proto_ops *ops;
 	bool ret;
@@ -357,7 +357,7 @@ static bool tcf_proto_is_unlocked(const char *kind)
 	if (strlen(kind) == 0)
 		return false;
 
-	ops = tcf_proto_lookup_ops(kind, false, NULL);
+	ops = tcf_proto_lookup_ops(kind, rtnl_held, NULL);
 	/* On error return false to take rtnl lock. Proto lookup/create
 	 * functions will perform lookup again and properly handle errors.
 	 */
@@ -2228,6 +2228,12 @@ static bool is_qdisc_ingress(__u32 classid)
 	return (TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_INGRESS));
 }
 
+static bool parent_is_ingress_or_egress(__u32 parentid)
+{
+	return (TC_H_MAKE(TC_H_CLSACT, TC_H_MIN_INGRESS) == parentid) ||
+		(TC_H_MAKE(TC_H_CLSACT, TC_H_MIN_EGRESS) == parentid);
+}
+
 static bool is_ingress_or_clsact(struct tcf_block *block, struct Qdisc *q)
 {
 	return tcf_block_shared(block) || (q && !!(q->flags & TCQ_F_INGRESS));
@@ -2255,6 +2261,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	int err;
 	int tp_created;
 	bool rtnl_held = false;
+	bool proto_is_unlocked;
 	u32 flags;
 
 replay:
@@ -2286,29 +2293,38 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			prio_allocate = true;
 		} else {
 			NL_SET_ERR_MSG(extack, "Invalid filter command with priority of zero");
+			if (rtnl_held)
+				rtnl_unlock();
 			return -ENOENT;
 		}
 	}
 
-	/* Find head of filter chain. */
-
-	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
-	if (err)
-		return err;
-
 	if (tcf_proto_check_kind(tca[TCA_KIND], name)) {
 		NL_SET_ERR_MSG(extack, "Specified TC filter name too long");
-		err = -EINVAL;
-		goto errout;
+		if (rtnl_held)
+			rtnl_unlock();
+		return -EINVAL;
 	}
+	proto_is_unlocked = tcf_proto_is_unlocked(name, rtnl_held);
 
-	/* Take rtnl mutex if rtnl_held was set to true on previous iteration,
-	 * block is shared (no qdisc found), qdisc is not unlocked, classifier
-	 * type is not specified, classifier is not unlocked.
-	 */
-	if (rtnl_held ||
-	    (q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
-	    !tcf_proto_is_unlocked(name)) {
+	if (!rtnl_held && !parent_is_ingress_or_egress(parent)) {
+		rtnl_held = true;
+		rtnl_lock();
+	}
+
+	/* Find head of filter chain. */
+	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, rtnl_held,
+			       extack);
+	if (err) {
+		if (rtnl_held)
+			rtnl_unlock();
+		return err;
+	}
+
+       /* Take rtnl mutex if classifier type is not specified, or classifier
+	* is not unlocked.
+	*/
+	if (!rtnl_held && !proto_is_unlocked) {
 		rtnl_held = true;
 		rtnl_lock();
 	}
@@ -2451,17 +2467,20 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 	tcf_block_release(q, block, rtnl_held);
 
-	if (rtnl_held)
-		rtnl_unlock();
-
 	if (err == -EAGAIN) {
 		/* Take rtnl lock in case EAGAIN is caused by concurrent flush
 		 * of target chain.
 		 */
+		if (!rtnl_held)
+			rtnl_lock();
 		rtnl_held = true;
 		/* Replay the request. */
 		goto replay;
 	}
+
+	if (rtnl_held)
+		rtnl_unlock();
+
 	return err;
 
 errout_locked:
@@ -2488,6 +2507,7 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	unsigned long cl = 0;
 	void *fh = NULL;
 	int err;
+	bool proto_is_unlocked = true;
 	bool rtnl_held = false;
 
 	err = nlmsg_parse_deprecated(n, sizeof(*t), tca, TCA_MAX,
@@ -2505,24 +2525,29 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		return -ENOENT;
 	}
 
-	/* Find head of filter chain. */
-
-	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
-	if (err)
-		return err;
-
 	if (tcf_proto_check_kind(tca[TCA_KIND], name)) {
 		NL_SET_ERR_MSG(extack, "Specified TC filter name too long");
-		err = -EINVAL;
-		goto errout;
+		return -EINVAL;
+	}
+	proto_is_unlocked = tcf_proto_is_unlocked(name, rtnl_held);
+
+	if (!parent_is_ingress_or_egress(parent)) {
+		rtnl_held = true;
+		rtnl_lock();
+	}
+
+	/* Find head of filter chain. */
+	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, rtnl_held,
+			       extack);
+	if (err) {
+		if (rtnl_held)
+			rtnl_unlock();
+		return err;
 	}
-	/* Take rtnl mutex if flushing whole chain, block is shared (no qdisc
-	 * found), qdisc is not unlocked, classifier type is not specified,
-	 * classifier is not unlocked.
+	/* Take rtnl mutex if flushing whole chain, classifier type is not
+	 * specified, classifier is not unlocked.
 	 */
-	if (!prio ||
-	    (q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
-	    !tcf_proto_is_unlocked(name)) {
+	if (!rtnl_held && (!prio || !proto_is_unlocked)) {
 		rtnl_held = true;
 		rtnl_lock();
 	}
@@ -2648,6 +2673,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	void *fh = NULL;
 	int err;
 	bool rtnl_held = false;
+	bool proto_is_unlocked;
 
 	err = nlmsg_parse_deprecated(n, sizeof(*t), tca, TCA_MAX,
 				     rtm_tca_policy, extack);
@@ -2664,23 +2690,29 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		return -ENOENT;
 	}
 
-	/* Find head of filter chain. */
-
-	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
-	if (err)
-		return err;
-
 	if (tcf_proto_check_kind(tca[TCA_KIND], name)) {
 		NL_SET_ERR_MSG(extack, "Specified TC filter name too long");
-		err = -EINVAL;
-		goto errout;
+		return -EINVAL;
 	}
-	/* Take rtnl mutex if block is shared (no qdisc found), qdisc is not
-	 * unlocked, classifier type is not specified, classifier is not
-	 * unlocked.
-	 */
-	if ((q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
-	    !tcf_proto_is_unlocked(name)) {
+	proto_is_unlocked = tcf_proto_is_unlocked(name, false);
+
+	if (!parent_is_ingress_or_egress(parent)) {
+		rtnl_held = true;
+		rtnl_lock();
+	}
+
+	/* Find head of filter chain. */
+	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, rtnl_held,
+			       extack);
+	if (err) {
+		if (rtnl_held)
+			rtnl_unlock();
+		return err;
+	}
+       /* Take rtnl mutex if classifier type is not specified, or classifier is
+	* not unlocked.
+	*/
+	if (!rtnl_held && !proto_is_unlocked) {
 		rtnl_held = true;
 		rtnl_lock();
 	}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-12 20:36       ` Jamal Hadi Salim
@ 2026-03-12 23:51         ` Jakub Kicinski
  2026-03-13 15:56           ` Jamal Hadi Salim
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2026-03-12 23:51 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Thu, 12 Mar 2026 16:36:48 -0400 Jamal Hadi Salim wrote:
> > > Two of the several (I think 4!) patches we had took a similar path. I
> > > am trying to remember at least one variant was bad for performance and
> > > the other was unstable. Let's see if we can revive it and take a
> > > closer look. BTW - none were pretty, it was maybe half the lines of
> > > code but touched many things.  
> >
> > FWIW / of course, we have to apply similar change to all(?) callers of
> > __tcf_qdisc_find in cls_api. So LOC-wise it may end up also pretty long.
> > And it's not going to help the already spaghetti-looking locking. But
> > even if it's more LoC I quite like the idea of containing the poopy
> > code to where problems originate which is the lockless filter handling.
> > Fingers crossed..  
> 
> Something like attached.
> Unfortunately after running it for a few hours it reproduced.
> The action code path (entered by virtue of filter code path execution)
> releases the rtnl when attempting to load an action module. A parallel
> qdisc operation waiting for the lock then grabs it and we hit the same
> issue...
> 
> So now we have to be more invasive and start coordinating the action
> code etc, which is not appealing. Thoughts?

I see. Doesn't seem entirely crazy to let tcf_proto_lookup_ops()
return -EAGAIN without actually loading the module, and have it's
call path (of which there are only 2?) do the module loading once
all the locks are released. The call paths handle the EAGAIN and
retry already they just assume tcf_proto_lookup_ops() has loaded
the module so they don't have to.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-12 23:51         ` Jakub Kicinski
@ 2026-03-13 15:56           ` Jamal Hadi Salim
  2026-03-13 19:36             ` Jamal Hadi Salim
  0 siblings, 1 reply; 10+ messages in thread
From: Jamal Hadi Salim @ 2026-03-13 15:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Thu, Mar 12, 2026 at 7:51 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 12 Mar 2026 16:36:48 -0400 Jamal Hadi Salim wrote:
> > > > Two of the several (I think 4!) patches we had took a similar path. I
> > > > am trying to remember at least one variant was bad for performance and
> > > > the other was unstable. Let's see if we can revive it and take a
> > > > closer look. BTW - none were pretty, it was maybe half the lines of
> > > > code but touched many things.
> > >
> > > FWIW / of course, we have to apply similar change to all(?) callers of
> > > __tcf_qdisc_find in cls_api. So LOC-wise it may end up also pretty long.
> > > And it's not going to help the already spaghetti-looking locking. But
> > > even if it's more LoC I quite like the idea of containing the poopy
> > > code to where problems originate which is the lockless filter handling.
> > > Fingers crossed..
> >
> > Something like attached.
> > Unfortunately after running it for a few hours it reproduced.
> > The action code path (entered by virtue of filter code path execution)
> > releases the rtnl when attempting to load an action module. A parallel
> > qdisc operation waiting for the lock then grabs it and we hit the same
> > issue...
> >
> > So now we have to be more invasive and start coordinating the action
> > code etc, which is not appealing. Thoughts?
>
> I see. Doesn't seem entirely crazy to let tcf_proto_lookup_ops()
> return -EAGAIN without actually loading the module, and have it's
> call path (of which there are only 2?) do the module loading once
> all the locks are released. The call paths handle the EAGAIN and
> retry already they just assume tcf_proto_lookup_ops() has loaded
> the module so they don't have to.

I might not be entirely following what you are saying. The core issue
is not module loading perse. We can avoid module loading in
tcf_proto_lookup_ops while the rtnl lock is released and the qdisc
refcnt is greater than 1 (which is what the patch i sent attempted).
The new quark is when we have actions associated with the filter.

I will send you a much lengthier description of the issue in an hour
or two  to get us on the same page.

cheers,
jamal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-13 15:56           ` Jamal Hadi Salim
@ 2026-03-13 19:36             ` Jamal Hadi Salim
  2026-03-14 15:00               ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: Jamal Hadi Salim @ 2026-03-13 19:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Fri, Mar 13, 2026 at 11:56 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Mar 12, 2026 at 7:51 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 12 Mar 2026 16:36:48 -0400 Jamal Hadi Salim wrote:
> > > > > Two of the several (I think 4!) patches we had took a similar path. I
> > > > > am trying to remember at least one variant was bad for performance and
> > > > > the other was unstable. Let's see if we can revive it and take a
> > > > > closer look. BTW - none were pretty, it was maybe half the lines of
> > > > > code but touched many things.
> > > >
> > > > FWIW / of course, we have to apply similar change to all(?) callers of
> > > > __tcf_qdisc_find in cls_api. So LOC-wise it may end up also pretty long.
> > > > And it's not going to help the already spaghetti-looking locking. But
> > > > even if it's more LoC I quite like the idea of containing the poopy
> > > > code to where problems originate which is the lockless filter handling.
> > > > Fingers crossed..
> > >
> > > Something like attached.
> > > Unfortunately after running it for a few hours it reproduced.
> > > The action code path (entered by virtue of filter code path execution)
> > > releases the rtnl when attempting to load an action module. A parallel
> > > qdisc operation waiting for the lock then grabs it and we hit the same
> > > issue...
> > >
> > > So now we have to be more invasive and start coordinating the action
> > > code etc, which is not appealing. Thoughts?
> >
> > I see. Doesn't seem entirely crazy to let tcf_proto_lookup_ops()
> > return -EAGAIN without actually loading the module, and have it's
> > call path (of which there are only 2?) do the module loading once
> > all the locks are released. The call paths handle the EAGAIN and
> > retry already they just assume tcf_proto_lookup_ops() has loaded
> > the module so they don't have to.
>
> I might not be entirely following what you are saying. The core issue
> is not module loading perse. We can avoid module loading in
> tcf_proto_lookup_ops while the rtnl lock is released and the qdisc
> refcnt is greater than 1 (which is what the patch i sent attempted).
> The new quark is when we have actions associated with the filter.
>
> I will send you a much lengthier description of the issue in an hour
> or two  to get us on the same page.

Note, i am going to describe the scenario we saw with the action quark
where "filter add" enters before "qdisc del"; the one described in the
commit is the reverse order. We see the issue in both scenarios.

Sorry, this is a lot of text ;->

Let me explain the "expected" behavior
=================================================================

time x:   On cpu1: "filter add" enters increments refcnt to 2 after
          finding the root qdisc (see __tcf_qdisc_find)
time x+1: On cpu1: "filter add" in the action code releases the rtnl_lock
          before calling request_module.
time x+1: On cpu2: "qdisc del" enters for root qdisc, holding rtnl
          (example refcnt = 1), see tc_get_qdisc().
time x+2: On cpu2: "qdisc del" calls graft which calls qdisc_put() which
          decrements refcnt to 1 and cant delete qdisc because refcnt is not 0;
          "qdisc del" exits
          ***We are in an inconsistent state at this point - IOW, the qdisc
          needs to be deleted but cant because of the refcnt !=0.
time x+3: On cpu1: "filter add" acquires the rtnl_lock again in the action
          code (see time x+1).
time x+4: On cpu1: "filter add" eventually completes; see tcf_block_release()
          invocation in tc_new_tfilter() which eventually calls qdisc_put()
          which manages to decrement refcnt to 0. Qdisc is deleted..

To reproduce the issue requires 3 threads - which is what the repro does.
Note: sequence time x to x+2 is the same as "expected" flow above.
===============================================================

time x:   On cpu1: "filter add" enters increments refcnt to 2 after
          finding the root qdisc (see __tcf_qdisc_find)
time x+1: On cpu1: "filter add" in the action code releases the rtnl_lock
          before calling request_module. Note: the action code path will be
          momentarily rescheduled because of request_module call.
time x+1: On cpu2: "qdisc del" enters for root qdisc, holding rtnl
          (example refcnt = 1), see tc_get_qdisc().
time x+2: On cpu2: "qdisc del" calls graft which calls qdisc_put() which
          decrements refcnt to 1 and cant delete qdisc because refcnt is not 0;
          "qdisc del" exits
          ***We are in an inconsistent state at this point- IOW, the qdisc
          needs to be deleted but cant because of the refcnt !=0.
time x+3: On cpu3: "qdisc add" comes in and finds the deleted qdisc still
          on the qdisc hash table since it hasnt been deleted (see inconsitent
          state in time x+2). And then shit hits the fan (as described in the
          commit, qdisc_tree_reduce_backlog() etc).

In this specific example, the issue is that the classifier code path
can't release the rtnl_lock while the qdisc's refcnt is bigger than 1.

Does this make more sense?
The reason we went with the "mark for delete" approach is at time x+3
the "qdisc add" wont be able to find this qdisc. This is the common
observed pattern - for example described in the commit message where
we get have a slightly different flow with "qdisc del" before "filter
add".

cheers,
jamal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-13 19:36             ` Jamal Hadi Salim
@ 2026-03-14 15:00               ` Jakub Kicinski
  2026-03-15 15:56                 ` Jamal Hadi Salim
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2026-03-14 15:00 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Fri, 13 Mar 2026 15:36:28 -0400 Jamal Hadi Salim wrote:
> In this specific example, the issue is that the classifier code path
> can't release the rtnl_lock while the qdisc's refcnt is bigger than 1.
> 
> Does this make more sense?
> The reason we went with the "mark for delete" approach is at time x+3
> the "qdisc add" wont be able to find this qdisc. This is the common
> observed pattern - for example described in the commit message where
> we get have a slightly different flow with "qdisc del" before "filter
> add".

Maybe a (completely untested) diff will help illustrate my thinking 
better than words:

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 99ac747b7906..c13c9e8619e4 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -27,6 +27,9 @@ void unregister_tcf_proto_ops(struct tcf_proto_ops *ops);
 #define NET_CLS_ALIAS_PREFIX "net-cls-"
 #define MODULE_ALIAS_NET_CLS(kind)	MODULE_ALIAS(NET_CLS_ALIAS_PREFIX kind)
 
+extern char *rtnl_load_mod;
+void rtnl_load_mod_check(void);
+
 struct tcf_block_ext_info {
 	enum flow_block_binder_type binder_type;
 	tcf_chain_head_change_t *chain_head_change;
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 332fd9695e54..c21dd2e36592 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1368,11 +1368,15 @@ struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
 #ifdef CONFIG_MODULES
 		bool rtnl_held = !(flags & TCA_ACT_FLAGS_NO_RTNL);
 
-		if (rtnl_held)
-			rtnl_unlock();
+		if (rtnl_held) {
+			if (WARN_ON_ONCE(rtnl_load_mod))
+				return ERR_PTR(-EINVAL);
+			rtnl_load_mod = kasprintf(GFP_KERNEL,
+						  NET_ACT_ALIAS_PREFIX "%s",
+						  act_name);
+			return ERR_PTR(-EAGAIN);
+		}
 		request_module(NET_ACT_ALIAS_PREFIX "%s", act_name);
-		if (rtnl_held)
-			rtnl_lock();
 
 		a_o = tc_lookup_action_n(act_name);
 
@@ -2107,6 +2111,9 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,
 				      &attr_size, flags, 0, extack);
 		if (ret != -EAGAIN)
 			break;
+		rtnl_unlock();
+		rtnl_load_mod_check();
+		rtnl_lock();
 	}
 
 	if (ret < 0)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 4829c27446e3..1b0f762d6e4b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -46,6 +46,19 @@
 /* The list of all installed classifier types */
 static LIST_HEAD(tcf_proto_base);
 
+char *rtnl_load_mod;
+
+void rtnl_load_mod_check(void)
+{
+	char *mod = rtnl_load_mod;
+
+	if (mod) {
+		rtnl_load_mod = NULL;
+		request_module("%s", mod);
+		kfree(mod);
+	}
+}
+
 /* Protects list of registered TC modules. It is pure SMP lock. */
 static DEFINE_RWLOCK(cls_mod_lock);
 
@@ -255,17 +268,15 @@ tcf_proto_lookup_ops(const char *kind, bool rtnl_held,
 	if (ops)
 		return ops;
 #ifdef CONFIG_MODULES
-	if (rtnl_held)
-		rtnl_unlock();
+	if (rtnl_held) {
+		if (WARN_ON_ONCE(rtnl_load_mod))
+			return ERR_PTR(-EINVAL);
+		rtnl_load_mod = kasprintf(GFP_KERNEL,
+					  NET_CLS_ALIAS_PREFIX "%s", kind);
+		return ERR_PTR(-EAGAIN);
+	}
 	request_module(NET_CLS_ALIAS_PREFIX "%s", kind);
-	if (rtnl_held)
-		rtnl_lock();
 	ops = __tcf_proto_lookup_ops(kind);
-	/* We dropped the RTNL semaphore in order to perform
-	 * the module load. So, even if we succeeded in loading
-	 * the module we have to replay the request. We indicate
-	 * this using -EAGAIN.
-	 */
 	if (ops) {
 		module_put(ops->owner);
 		return ERR_PTR(-EAGAIN);
@@ -2459,6 +2470,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		 * of target chain.
 		 */
 		rtnl_held = true;
+		rtnl_load_mod_check();
 		/* Replay the request. */
 		goto replay;
 	}
@@ -3230,9 +3242,13 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 	tcf_chain_put(chain);
 errout_block:
 	tcf_block_release(q, block, true);
-	if (err == -EAGAIN)
+	if (err == -EAGAIN) {
+		rtnl_unlock();
+		rtnl_load_mod_check();
+		rtnl_lock();
 		/* Replay the request. */
 		goto replay;
+	}
 	return err;
 
 errout_block_locked:

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete
  2026-03-14 15:00               ` Jakub Kicinski
@ 2026-03-15 15:56                 ` Jamal Hadi Salim
  0 siblings, 0 replies; 10+ messages in thread
From: Jamal Hadi Salim @ 2026-03-15 15:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, horms, jiri, toke,
	vinicius.gomes, stephen, vladbu, cake, bpf, ghandatmanas,
	km.kim1503, security, Victor Nogueira

On Sat, Mar 14, 2026 at 11:00 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 13 Mar 2026 15:36:28 -0400 Jamal Hadi Salim wrote:
> > In this specific example, the issue is that the classifier code path
> > can't release the rtnl_lock while the qdisc's refcnt is bigger than 1.
> >
> > Does this make more sense?
> > The reason we went with the "mark for delete" approach is at time x+3
> > the "qdisc add" wont be able to find this qdisc. This is the common
> > observed pattern - for example described in the commit message where
> > we get have a slightly different flow with "qdisc del" before "filter
> > add".
>
> Maybe a (completely untested) diff will help illustrate my thinking
> better than words:
>

Dont have much time today - but will try in the upcoming days.

cheers,
jamal

> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index 99ac747b7906..c13c9e8619e4 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -27,6 +27,9 @@ void unregister_tcf_proto_ops(struct tcf_proto_ops *ops);
>  #define NET_CLS_ALIAS_PREFIX "net-cls-"
>  #define MODULE_ALIAS_NET_CLS(kind)     MODULE_ALIAS(NET_CLS_ALIAS_PREFIX kind)
>
> +extern char *rtnl_load_mod;
> +void rtnl_load_mod_check(void);
> +
>  struct tcf_block_ext_info {
>         enum flow_block_binder_type binder_type;
>         tcf_chain_head_change_t *chain_head_change;
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 332fd9695e54..c21dd2e36592 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -1368,11 +1368,15 @@ struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
>  #ifdef CONFIG_MODULES
>                 bool rtnl_held = !(flags & TCA_ACT_FLAGS_NO_RTNL);
>
> -               if (rtnl_held)
> -                       rtnl_unlock();
> +               if (rtnl_held) {
> +                       if (WARN_ON_ONCE(rtnl_load_mod))
> +                               return ERR_PTR(-EINVAL);
> +                       rtnl_load_mod = kasprintf(GFP_KERNEL,
> +                                                 NET_ACT_ALIAS_PREFIX "%s",
> +                                                 act_name);
> +                       return ERR_PTR(-EAGAIN);
> +               }
>                 request_module(NET_ACT_ALIAS_PREFIX "%s", act_name);
> -               if (rtnl_held)
> -                       rtnl_lock();
>
>                 a_o = tc_lookup_action_n(act_name);
>
> @@ -2107,6 +2111,9 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,
>                                       &attr_size, flags, 0, extack);
>                 if (ret != -EAGAIN)
>                         break;
> +               rtnl_unlock();
> +               rtnl_load_mod_check();
> +               rtnl_lock();
>         }
>
>         if (ret < 0)
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 4829c27446e3..1b0f762d6e4b 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -46,6 +46,19 @@
>  /* The list of all installed classifier types */
>  static LIST_HEAD(tcf_proto_base);
>
> +char *rtnl_load_mod;
> +
> +void rtnl_load_mod_check(void)
> +{
> +       char *mod = rtnl_load_mod;
> +
> +       if (mod) {
> +               rtnl_load_mod = NULL;
> +               request_module("%s", mod);
> +               kfree(mod);
> +       }
> +}
> +
>  /* Protects list of registered TC modules. It is pure SMP lock. */
>  static DEFINE_RWLOCK(cls_mod_lock);
>
> @@ -255,17 +268,15 @@ tcf_proto_lookup_ops(const char *kind, bool rtnl_held,
>         if (ops)
>                 return ops;
>  #ifdef CONFIG_MODULES
> -       if (rtnl_held)
> -               rtnl_unlock();
> +       if (rtnl_held) {
> +               if (WARN_ON_ONCE(rtnl_load_mod))
> +                       return ERR_PTR(-EINVAL);
> +               rtnl_load_mod = kasprintf(GFP_KERNEL,
> +                                         NET_CLS_ALIAS_PREFIX "%s", kind);
> +               return ERR_PTR(-EAGAIN);
> +       }
>         request_module(NET_CLS_ALIAS_PREFIX "%s", kind);
> -       if (rtnl_held)
> -               rtnl_lock();
>         ops = __tcf_proto_lookup_ops(kind);
> -       /* We dropped the RTNL semaphore in order to perform
> -        * the module load. So, even if we succeeded in loading
> -        * the module we have to replay the request. We indicate
> -        * this using -EAGAIN.
> -        */
>         if (ops) {
>                 module_put(ops->owner);
>                 return ERR_PTR(-EAGAIN);
> @@ -2459,6 +2470,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>                  * of target chain.
>                  */
>                 rtnl_held = true;
> +               rtnl_load_mod_check();
>                 /* Replay the request. */
>                 goto replay;
>         }
> @@ -3230,9 +3242,13 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
>         tcf_chain_put(chain);
>  errout_block:
>         tcf_block_release(q, block, true);
> -       if (err == -EAGAIN)
> +       if (err == -EAGAIN) {
> +               rtnl_unlock();
> +               rtnl_load_mod_check();
> +               rtnl_lock();
>                 /* Replay the request. */
>                 goto replay;
> +       }
>         return err;
>
>  errout_block_locked:

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-15 15:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-07 21:20 [PATCH net] net/sched: Mark qdisc for deletion if graft cannot delete Jamal Hadi Salim
2026-03-11  1:47 ` Jakub Kicinski
2026-03-11 16:22   ` Jamal Hadi Salim
2026-03-12  0:52     ` Jakub Kicinski
2026-03-12 20:36       ` Jamal Hadi Salim
2026-03-12 23:51         ` Jakub Kicinski
2026-03-13 15:56           ` Jamal Hadi Salim
2026-03-13 19:36             ` Jamal Hadi Salim
2026-03-14 15:00               ` Jakub Kicinski
2026-03-15 15:56                 ` Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox