public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 02/12] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree"
       [not found] <20260313211646.12549-1-stephen@networkplumber.org>
@ 2026-03-13 21:15 ` Stephen Hemminger
  2026-03-13 21:15 ` [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication Stephen Hemminger
  2026-03-13 21:15 ` [PATCH 04/12] net/sched: netem: restructure dequeue to avoid re-entrancy with child qdisc Stephen Hemminger
  2 siblings, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2026-03-13 21:15 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Ji-Soo Chung, Gerlinde, stable, Cong Wang

This reverts commit ec8e0e3d7adef940cdf9475e2352c0680189d14e.

The restriction breaks valid uses of netem such as using different
netem values on different branches of HTB. This even broke some
of the examples in the netem documentation.

The intent of blocking recursion is handled in next patch.

Fixes: ec8e0e3d7adef ("net/sched: Restrict conditions for adding duplicating netems to qdisc tree")
Reported-by: Ji-Soo Chung <jschung2@proton.me>
Reported-by: Gerlinde <lrGerlinde@mailfence.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
Cc: stable@vger.kernel.org
Originally-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 net/sched/sch_netem.c | 40 ----------------------------------------
 1 file changed, 40 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 5de1c932944a..0ccf74a9cb82 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -974,41 +974,6 @@ static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
 	return 0;
 }
 
-static const struct Qdisc_class_ops netem_class_ops;
-
-static int check_netem_in_tree(struct Qdisc *sch, bool duplicates,
-			       struct netlink_ext_ack *extack)
-{
-	struct Qdisc *root, *q;
-	unsigned int i;
-
-	root = qdisc_root_sleeping(sch);
-
-	if (sch != root && root->ops->cl_ops == &netem_class_ops) {
-		if (duplicates ||
-		    ((struct netem_sched_data *)qdisc_priv(root))->duplicate)
-			goto err;
-	}
-
-	if (!qdisc_dev(root))
-		return 0;
-
-	hash_for_each(qdisc_dev(root)->qdisc_hash, i, q, hash) {
-		if (sch != q && q->ops->cl_ops == &netem_class_ops) {
-			if (duplicates ||
-			    ((struct netem_sched_data *)qdisc_priv(q))->duplicate)
-				goto err;
-		}
-	}
-
-	return 0;
-
-err:
-	NL_SET_ERR_MSG(extack,
-		       "netem: cannot mix duplicating netems with other netems in tree");
-	return -EINVAL;
-}
-
 /* Parse netlink message to set options */
 static int netem_change(struct Qdisc *sch, struct nlattr *opt,
 			struct netlink_ext_ack *extack)
@@ -1067,11 +1032,6 @@ static int netem_change(struct Qdisc *sch, struct nlattr *opt,
 	q->gap = qopt->gap;
 	q->counter = 0;
 	q->loss = qopt->loss;
-
-	ret = check_netem_in_tree(sch, qopt->duplicate, extack);
-	if (ret)
-		goto unlock;
-
 	q->duplicate = qopt->duplicate;
 
 	/* for compatibility with earlier versions.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
       [not found] <20260313211646.12549-1-stephen@networkplumber.org>
  2026-03-13 21:15 ` [PATCH 02/12] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Stephen Hemminger
@ 2026-03-13 21:15 ` Stephen Hemminger
  2026-03-14 19:29   ` William Liu
  2026-03-13 21:15 ` [PATCH 04/12] net/sched: netem: restructure dequeue to avoid re-entrancy with child qdisc Stephen Hemminger
  2 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2026-03-13 21:15 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, stable, William Liu, Savino Dicanosa

Add a per-CPU recursion depth counter to netem_enqueue(). When netem
duplicates a packet, the clone is re-enqueued at the root qdisc. If
the tree contains other netem instances, this can recurse without
bound, causing soft lockups and OOM.

This approach was previously considered but rejected on the grounds
that netem_dequeue calling enqueue on a child netem could bypass the
depth check. That concern does not apply: the child netem's
netem_enqueue() increments the same per-CPU counter, so the total
nesting depth across all netem instances in the call chain is tracked
correctly.

A depth limit of 4 is generous for any legitimate configuration.

Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
Cc: stable@vger.kernel.org
Reported-by: William Liu <will@willsroot.io>
Reported-by: Savino Dicanosa <savy@syst3mfailure.io>

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 net/sched/sch_netem.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 0ccf74a9cb82..085fa3ad6f83 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -21,6 +21,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/reciprocal_div.h>
 #include <linux/rbtree.h>
+#include <linux/percpu.h>
 
 #include <net/gso.h>
 #include <net/netlink.h>
@@ -29,6 +30,15 @@
 
 #define VERSION "1.3"
 
+/*
+ * Limit for recursion from duplication.
+ * Duplicated packets are re-enqueued at the root qdisc, which may
+ * reach this or another netem instance, causing nested calls to
+ * netem_enqueue(). This per-CPU counter limits the total depth.
+ */
+static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
+#define NETEM_RECURSION_LIMIT	4
+
 /*	Network Emulation Queuing algorithm.
 	====================================
 
@@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	/* Do not fool qdisc_drop_all() */
 	skb->prev = NULL;
 
+	/* Guard against recursion from duplication re-injection. */
+	if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
+		     NETEM_RECURSION_LIMIT)) {
+		this_cpu_dec(netem_enqueue_depth);
+		qdisc_drop(skb, sch, to_free);
+		return NET_XMIT_DROP;
+	}
+
 	/* Random duplication */
 	if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
 		++count;
@@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	if (count == 0) {
 		qdisc_qstats_drop(sch);
 		__qdisc_drop(skb, to_free);
+		this_cpu_dec(netem_enqueue_depth);
 		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	}
 
@@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		qdisc_drop_all(skb, sch, to_free);
 		if (skb2)
 			__qdisc_drop(skb2, to_free);
+		this_cpu_dec(netem_enqueue_depth);
 		return NET_XMIT_DROP;
 	}
 
@@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		/* Parent qdiscs accounted for 1 skb of size @prev_len */
 		qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
 	} else if (!skb) {
+		this_cpu_dec(netem_enqueue_depth);
 		return NET_XMIT_DROP;
 	}
+	this_cpu_dec(netem_enqueue_depth);
 	return NET_XMIT_SUCCESS;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 04/12] net/sched: netem: restructure dequeue to avoid re-entrancy with child qdisc
       [not found] <20260313211646.12549-1-stephen@networkplumber.org>
  2026-03-13 21:15 ` [PATCH 02/12] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Stephen Hemminger
  2026-03-13 21:15 ` [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication Stephen Hemminger
@ 2026-03-13 21:15 ` Stephen Hemminger
  2 siblings, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2026-03-13 21:15 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, stable

netem_dequeue() currently enqueues time-ready packets into the child
qdisc during the dequeue call path. This creates several problems:

1. Parent qdiscs like HFSC track class active/inactive state based on
   qlen transitions. The child enqueue during netem's dequeue can cause
   qlen to increase while the parent is mid-dequeue, leading to
   double-insertion in HFSC's eltree (CVE-2025-37890, CVE-2025-38001).

2. If the child qdisc is non-work-conserving (e.g., TBF), it may refuse
   to release packets during its dequeue even though they were just
   enqueued. The parent then sees netem returning NULL despite having
   backlog, violating the work-conserving contract and causing stalls
   with parents like DRR that deactivate classes in this case.

Restructure netem_dequeue so that when a child qdisc is present, all
time-ready packets are transferred from the tfifo to the child in a
batch before asking the child for output. This ensures the child only
receives packets whose delay has already elapsed. The no-child path
(tfifo direct dequeue) is unchanged.

Fixes: 50612537e9ab ("netem: fix classful handling")
Cc: stable@vger.kernel.org
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 net/sched/sch_netem.c | 82 +++++++++++++++++++++++++++++--------------
 1 file changed, 56 insertions(+), 26 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 085fa3ad6f83..08006a60849e 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -726,7 +726,6 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 	struct netem_sched_data *q = qdisc_priv(sch);
 	struct sk_buff *skb;
 
-tfifo_dequeue:
 	skb = __qdisc_dequeue_head(&sch->q);
 	if (skb) {
 deliver:
@@ -734,24 +733,28 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 		qdisc_bstats_update(sch, skb);
 		return skb;
 	}
-	skb = netem_peek(q);
-	if (skb) {
-		u64 time_to_send;
+
+	/* If we have a child qdisc, transfer all time-ready packets
+	 * from the tfifo into the child, then dequeue from the child.
+	 * This avoids enqueueing into the child during the parent's
+	 * dequeue callback, which can confuse parents that track
+	 * active/inactive state based on qlen transitions (HFSC).
+	 */
+	if (q->qdisc) {
 		u64 now = ktime_get_ns();
 
-		/* if more time remaining? */
-		time_to_send = netem_skb_cb(skb)->time_to_send;
-		if (q->slot.slot_next && q->slot.slot_next < time_to_send)
-			get_slot_next(q, now);
+		while ((skb = netem_peek(q)) != NULL) {
+			u64 t = netem_skb_cb(skb)->time_to_send;
+
+			if (t > now)
+				break;
+			if (q->slot.slot_next && q->slot.slot_next > now)
+				break;
 
-		if (time_to_send <= now && q->slot.slot_next <= now) {
 			netem_erase_head(q, skb);
 			q->t_len--;
 			skb->next = NULL;
 			skb->prev = NULL;
-			/* skb->dev shares skb->rbnode area,
-			 * we need to restore its value.
-			 */
 			skb->dev = qdisc_dev(sch);
 
 			if (q->slot.slot_next) {
@@ -762,7 +765,7 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 					get_slot_next(q, now);
 			}
 
-			if (q->qdisc) {
+			{
 				unsigned int pkt_len = qdisc_pkt_len(skb);
 				struct sk_buff *to_free = NULL;
 				int err;
@@ -774,34 +777,61 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 						qdisc_qstats_drop(sch);
 					sch->qstats.backlog -= pkt_len;
 					sch->q.qlen--;
-					qdisc_tree_reduce_backlog(sch, 1, pkt_len);
+					qdisc_tree_reduce_backlog(sch,
+								  1, pkt_len);
 				}
-				goto tfifo_dequeue;
 			}
+		}
+
+		skb = q->qdisc->ops->dequeue(q->qdisc);
+		if (skb) {
 			sch->q.qlen--;
 			goto deliver;
 		}
-
-		if (q->qdisc) {
-			skb = q->qdisc->ops->dequeue(q->qdisc);
-			if (skb) {
+	} else {
+		/* No child qdisc: dequeue directly from tfifo */
+		skb = netem_peek(q);
+		if (skb) {
+			u64 time_to_send;
+			u64 now = ktime_get_ns();
+
+			time_to_send = netem_skb_cb(skb)->time_to_send;
+			if (q->slot.slot_next &&
+			    q->slot.slot_next < time_to_send)
+				get_slot_next(q, now);
+
+			if (time_to_send <= now &&
+			    q->slot.slot_next <= now) {
+				netem_erase_head(q, skb);
+				q->t_len--;
+				skb->next = NULL;
+				skb->prev = NULL;
+				skb->dev = qdisc_dev(sch);
+
+				if (q->slot.slot_next) {
+					q->slot.packets_left--;
+					q->slot.bytes_left -=
+						qdisc_pkt_len(skb);
+					if (q->slot.packets_left <= 0 ||
+					    q->slot.bytes_left <= 0)
+						get_slot_next(q, now);
+				}
 				sch->q.qlen--;
 				goto deliver;
 			}
 		}
+	}
+
+	/* Schedule watchdog for next time-ready packet */
+	skb = netem_peek(q);
+	if (skb) {
+		u64 time_to_send = netem_skb_cb(skb)->time_to_send;
 
 		qdisc_watchdog_schedule_ns(&q->watchdog,
 					   max(time_to_send,
 					       q->slot.slot_next));
 	}
 
-	if (q->qdisc) {
-		skb = q->qdisc->ops->dequeue(q->qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			goto deliver;
-		}
-	}
 	return NULL;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
  2026-03-13 21:15 ` [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication Stephen Hemminger
@ 2026-03-14 19:29   ` William Liu
  2026-03-15 16:06     ` Stephen Hemminger
  0 siblings, 1 reply; 8+ messages in thread
From: William Liu @ 2026-03-14 19:29 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, stable, Savino Dicanosa, Jamal Hadi Salim,
	Victor Nogueira

Looping in Jamal and Victor.

On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:

> Add a per-CPU recursion depth counter to netem_enqueue(). When netem
> duplicates a packet, the clone is re-enqueued at the root qdisc. If
> the tree contains other netem instances, this can recurse without
> bound, causing soft lockups and OOM.
> 
> This approach was previously considered but rejected on the grounds
> that netem_dequeue calling enqueue on a child netem could bypass the
> depth check. That concern does not apply: the child netem's
> netem_enqueue() increments the same per-CPU counter, so the total
> nesting depth across all netem instances in the call chain is tracked
> correctly.

I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2].

If I remember correctly,  the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.

> 
> A depth limit of 4 is generous for any legitimate configuration.
> 
> Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
> Cc: stable@vger.kernel.org
> Reported-by: William Liu <will@willsroot.io>
> Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  net/sched/sch_netem.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index 0ccf74a9cb82..085fa3ad6f83 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -21,6 +21,7 @@
>  #include <linux/rtnetlink.h>
>  #include <linux/reciprocal_div.h>
>  #include <linux/rbtree.h>
> +#include <linux/percpu.h>
> 
>  #include <net/gso.h>
>  #include <net/netlink.h>
> @@ -29,6 +30,15 @@
> 
>  #define VERSION "1.3"
> 
> +/*
> + * Limit for recursion from duplication.
> + * Duplicated packets are re-enqueued at the root qdisc, which may
> + * reach this or another netem instance, causing nested calls to
> + * netem_enqueue(). This per-CPU counter limits the total depth.
> + */
> +static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
> +#define NETEM_RECURSION_LIMIT	4
> +
>  /*	Network Emulation Queuing algorithm.
>  	====================================
> 
> @@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>  	/* Do not fool qdisc_drop_all() */
>  	skb->prev = NULL;
> 
> +	/* Guard against recursion from duplication re-injection. */
> +	if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
> +		     NETEM_RECURSION_LIMIT)) {
> +		this_cpu_dec(netem_enqueue_depth);
> +		qdisc_drop(skb, sch, to_free);
> +		return NET_XMIT_DROP;
> +	}
> +
>  	/* Random duplication */
>  	if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
>  		++count;
> @@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>  	if (count == 0) {
>  		qdisc_qstats_drop(sch);
>  		__qdisc_drop(skb, to_free);
> +		this_cpu_dec(netem_enqueue_depth);
>  		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
>  	}
> 
> @@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>  		qdisc_drop_all(skb, sch, to_free);
>  		if (skb2)
>  			__qdisc_drop(skb2, to_free);
> +		this_cpu_dec(netem_enqueue_depth);
>  		return NET_XMIT_DROP;
>  	}
> 
> @@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>  		/* Parent qdiscs accounted for 1 skb of size @prev_len */
>  		qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
>  	} else if (!skb) {
> +		this_cpu_dec(netem_enqueue_depth);
>  		return NET_XMIT_DROP;
>  	}
> +	this_cpu_dec(netem_enqueue_depth);
>  	return NET_XMIT_SUCCESS;
>  }
> 
> --
> 2.51.0
> 
> 

What about the last suggestion for a robust fix from [3]?

Best,
Will

[1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/
[2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/
[3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
  2026-03-14 19:29   ` William Liu
@ 2026-03-15 16:06     ` Stephen Hemminger
  2026-03-15 16:19       ` Jamal Hadi Salim
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2026-03-15 16:06 UTC (permalink / raw)
  To: William Liu
  Cc: netdev, stable, Savino Dicanosa, Jamal Hadi Salim,
	Victor Nogueira

On Sat, 14 Mar 2026 19:29:10 +0000
William Liu <will@willsroot.io> wrote:

> Looping in Jamal and Victor.
> 
> On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> > Add a per-CPU recursion depth counter to netem_enqueue(). When netem
> > duplicates a packet, the clone is re-enqueued at the root qdisc. If
> > the tree contains other netem instances, this can recurse without
> > bound, causing soft lockups and OOM.
> > 
> > This approach was previously considered but rejected on the grounds
> > that netem_dequeue calling enqueue on a child netem could bypass the
> > depth check. That concern does not apply: the child netem's
> > netem_enqueue() increments the same per-CPU counter, so the total
> > nesting depth across all netem instances in the call chain is tracked
> > correctly.  
> 
> I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2].
> 
> If I remember correctly,  the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.
> 
> > 
> > A depth limit of 4 is generous for any legitimate configuration.
> > 
> > Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
> > Cc: stable@vger.kernel.org
> > Reported-by: William Liu <will@willsroot.io>
> > Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > ---
> >  net/sched/sch_netem.c | 22 ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> > 
> > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > index 0ccf74a9cb82..085fa3ad6f83 100644
> > --- a/net/sched/sch_netem.c
> > +++ b/net/sched/sch_netem.c
> > @@ -21,6 +21,7 @@
> >  #include <linux/rtnetlink.h>
> >  #include <linux/reciprocal_div.h>
> >  #include <linux/rbtree.h>
> > +#include <linux/percpu.h>
> > 
> >  #include <net/gso.h>
> >  #include <net/netlink.h>
> > @@ -29,6 +30,15 @@
> > 
> >  #define VERSION "1.3"
> > 
> > +/*
> > + * Limit for recursion from duplication.
> > + * Duplicated packets are re-enqueued at the root qdisc, which may
> > + * reach this or another netem instance, causing nested calls to
> > + * netem_enqueue(). This per-CPU counter limits the total depth.
> > + */
> > +static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
> > +#define NETEM_RECURSION_LIMIT	4
> > +
> >  /*	Network Emulation Queuing algorithm.
> >  	====================================
> > 
> > @@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> >  	/* Do not fool qdisc_drop_all() */
> >  	skb->prev = NULL;
> > 
> > +	/* Guard against recursion from duplication re-injection. */
> > +	if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
> > +		     NETEM_RECURSION_LIMIT)) {
> > +		this_cpu_dec(netem_enqueue_depth);
> > +		qdisc_drop(skb, sch, to_free);
> > +		return NET_XMIT_DROP;
> > +	}
> > +
> >  	/* Random duplication */
> >  	if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> >  		++count;
> > @@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> >  	if (count == 0) {
> >  		qdisc_qstats_drop(sch);
> >  		__qdisc_drop(skb, to_free);
> > +		this_cpu_dec(netem_enqueue_depth);
> >  		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> >  	}
> > 
> > @@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> >  		qdisc_drop_all(skb, sch, to_free);
> >  		if (skb2)
> >  			__qdisc_drop(skb2, to_free);
> > +		this_cpu_dec(netem_enqueue_depth);
> >  		return NET_XMIT_DROP;
> >  	}
> > 
> > @@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> >  		/* Parent qdiscs accounted for 1 skb of size @prev_len */
> >  		qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
> >  	} else if (!skb) {
> > +		this_cpu_dec(netem_enqueue_depth);
> >  		return NET_XMIT_DROP;
> >  	}
> > +	this_cpu_dec(netem_enqueue_depth);
> >  	return NET_XMIT_SUCCESS;
> >  }
> > 
> > --
> > 2.51.0
> > 
> >   
> 
> What about the last suggestion for a robust fix from [3]?
> 
> Best,
> Will
> 
> [1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/
> [2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/
> [3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/

Thanks, this is a tight corner here, and not all solutions work out.

You're right that the per-CPU guard alone doesn't cover the
dequeue-to-child-enqueue path you described. That's exactly why
the series has two patches working together:

Patch 02 adds the per-CPU recursion guard, which handles the
direct enqueue recursion (rootq->enqueue duplicated packet hits
another netem_enqueue in the same call chain).

Patch 04 restructures netem_dequeue to eliminate the pump. The
old code had "goto tfifo_dequeue" which looped back after each
child enqueue, so packets the child duplicated back to root would
immediately get picked up by the same dequeue iteration. The new
code transfers all currently-ready packets from the tfifo to the
child in one batch, then does a single dequeue from the child and
returns. Packets that the child duplicates back to root land in
the tfifo but won't be processed until the next dequeue call from
the parent — breaking the loop you diagrammed.

The original repro is:

  tc qdisc add dev lo root handle 1: netem delay 1ms duplicate 100%
  tc qdisc add dev lo parent 1:1 handle 2: netem delay 1ms duplicate 100%
  ping -f localhost

This is covered by tdc test f2a3 (nested netem config acceptance)
and test 7a07 (nested netem with duplication, traffic via scapy).
More tests are added in the new version.

Jamal's proposed change with skb ttl would also work but
it was rejected because it required adding ttl field to skb
and skb size is a performance critical. As Cong pointed out
adding a couple of bits for ttl makes it increase.
So considered the idea and decided against it.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
  2026-03-15 16:06     ` Stephen Hemminger
@ 2026-03-15 16:19       ` Jamal Hadi Salim
  2026-03-15 17:18         ` Stephen Hemminger
  0 siblings, 1 reply; 8+ messages in thread
From: Jamal Hadi Salim @ 2026-03-15 16:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: William Liu, netdev, stable, Savino Dicanosa, Victor Nogueira

On Sun, Mar 15, 2026 at 12:06 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Sat, 14 Mar 2026 19:29:10 +0000
> William Liu <will@willsroot.io> wrote:
>
> > Looping in Jamal and Victor.
> >
> > On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> >
> > > Add a per-CPU recursion depth counter to netem_enqueue(). When netem
> > > duplicates a packet, the clone is re-enqueued at the root qdisc. If
> > > the tree contains other netem instances, this can recurse without
> > > bound, causing soft lockups and OOM.
> > >
> > > This approach was previously considered but rejected on the grounds
> > > that netem_dequeue calling enqueue on a child netem could bypass the
> > > depth check. That concern does not apply: the child netem's
> > > netem_enqueue() increments the same per-CPU counter, so the total
> > > nesting depth across all netem instances in the call chain is tracked
> > > correctly.
> >
> > I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2].
> >
> > If I remember correctly,  the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.
> >
> > >
> > > A depth limit of 4 is generous for any legitimate configuration.
> > >
> > > Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
> > > Cc: stable@vger.kernel.org
> > > Reported-by: William Liu <will@willsroot.io>
> > > Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
> > >
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > > ---
> > >  net/sched/sch_netem.c | 22 ++++++++++++++++++++++
> > >  1 file changed, 22 insertions(+)
> > >
> > > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > > index 0ccf74a9cb82..085fa3ad6f83 100644
> > > --- a/net/sched/sch_netem.c
> > > +++ b/net/sched/sch_netem.c
> > > @@ -21,6 +21,7 @@
> > >  #include <linux/rtnetlink.h>
> > >  #include <linux/reciprocal_div.h>
> > >  #include <linux/rbtree.h>
> > > +#include <linux/percpu.h>
> > >
> > >  #include <net/gso.h>
> > >  #include <net/netlink.h>
> > > @@ -29,6 +30,15 @@
> > >
> > >  #define VERSION "1.3"
> > >
> > > +/*
> > > + * Limit for recursion from duplication.
> > > + * Duplicated packets are re-enqueued at the root qdisc, which may
> > > + * reach this or another netem instance, causing nested calls to
> > > + * netem_enqueue(). This per-CPU counter limits the total depth.
> > > + */
> > > +static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
> > > +#define NETEM_RECURSION_LIMIT      4
> > > +
> > >  /* Network Emulation Queuing algorithm.
> > >     ====================================
> > >
> > > @@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > >     /* Do not fool qdisc_drop_all() */
> > >     skb->prev = NULL;
> > >
> > > +   /* Guard against recursion from duplication re-injection. */
> > > +   if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
> > > +                NETEM_RECURSION_LIMIT)) {
> > > +           this_cpu_dec(netem_enqueue_depth);
> > > +           qdisc_drop(skb, sch, to_free);
> > > +           return NET_XMIT_DROP;
> > > +   }
> > > +
> > >     /* Random duplication */
> > >     if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> > >             ++count;
> > > @@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > >     if (count == 0) {
> > >             qdisc_qstats_drop(sch);
> > >             __qdisc_drop(skb, to_free);
> > > +           this_cpu_dec(netem_enqueue_depth);
> > >             return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > >     }
> > >
> > > @@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > >             qdisc_drop_all(skb, sch, to_free);
> > >             if (skb2)
> > >                     __qdisc_drop(skb2, to_free);
> > > +           this_cpu_dec(netem_enqueue_depth);
> > >             return NET_XMIT_DROP;
> > >     }
> > >
> > > @@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > >             /* Parent qdiscs accounted for 1 skb of size @prev_len */
> > >             qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
> > >     } else if (!skb) {
> > > +           this_cpu_dec(netem_enqueue_depth);
> > >             return NET_XMIT_DROP;
> > >     }
> > > +   this_cpu_dec(netem_enqueue_depth);
> > >     return NET_XMIT_SUCCESS;
> > >  }
> > >
> > > --
> > > 2.51.0
> > >
> > >
> >
> > What about the last suggestion for a robust fix from [3]?
> >
> > Best,
> > Will
> >
> > [1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/
> > [2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/
> > [3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/
>
> Thanks, this is a tight corner here, and not all solutions work out.
>
> You're right that the per-CPU guard alone doesn't cover the
> dequeue-to-child-enqueue path you described. That's exactly why
> the series has two patches working together:
>
> Patch 02 adds the per-CPU recursion guard, which handles the
> direct enqueue recursion (rootq->enqueue duplicated packet hits
> another netem_enqueue in the same call chain).
>
> Patch 04 restructures netem_dequeue to eliminate the pump. The
> old code had "goto tfifo_dequeue" which looped back after each
> child enqueue, so packets the child duplicated back to root would
> immediately get picked up by the same dequeue iteration. The new
> code transfers all currently-ready packets from the tfifo to the
> child in one batch, then does a single dequeue from the child and
> returns. Packets that the child duplicates back to root land in
> the tfifo but won't be processed until the next dequeue call from
> the parent — breaking the loop you diagrammed.
>
> The original repro is:
>
>   tc qdisc add dev lo root handle 1: netem delay 1ms duplicate 100%
>   tc qdisc add dev lo parent 1:1 handle 2: netem delay 1ms duplicate 100%
>   ping -f localhost
>
> This is covered by tdc test f2a3 (nested netem config acceptance)
> and test 7a07 (nested netem with duplication, traffic via scapy).
> More tests are added in the new version.
>
> Jamal's proposed change with skb ttl would also work but
> it was rejected because it required adding ttl field to skb
> and skb size is a performance critical. As Cong pointed out
> adding a couple of bits for ttl makes it increase.
> So considered the idea and decided against it.

It was not "rejected" - other than Cong making those claims (which i
responded to).
Last posting i had some feedback from Willem but i dropped the ball.
And that variant i posted had issues which were not caught by the AI
review - required human knowledge (it didnt consider the GRO code
path).
If you want to go this path - i am fine with it, I will just focus on
the mirred loop.
Also tell your AI to Cc the stakeholders next time it posts via
get_maintainers (and not cc stable) - then i will be fine reviewing.
Commit log (or cover letter) would help if you cite the "documented
examples" you said were broken.

cheers,
jamal


> Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
  2026-03-15 16:19       ` Jamal Hadi Salim
@ 2026-03-15 17:18         ` Stephen Hemminger
  2026-03-16 17:52           ` Jamal Hadi Salim
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2026-03-15 17:18 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: William Liu, netdev, stable, Savino Dicanosa, Victor Nogueira

On Sun, 15 Mar 2026 12:19:02 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On Sun, Mar 15, 2026 at 12:06 PM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> > On Sat, 14 Mar 2026 19:29:10 +0000
> > William Liu <will@willsroot.io> wrote:
> >  
> > > Looping in Jamal and Victor.
> > >
> > > On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> > >  
> > > > Add a per-CPU recursion depth counter to netem_enqueue(). When netem
> > > > duplicates a packet, the clone is re-enqueued at the root qdisc. If
> > > > the tree contains other netem instances, this can recurse without
> > > > bound, causing soft lockups and OOM.
> > > >
> > > > This approach was previously considered but rejected on the grounds
> > > > that netem_dequeue calling enqueue on a child netem could bypass the
> > > > depth check. That concern does not apply: the child netem's
> > > > netem_enqueue() increments the same per-CPU counter, so the total
> > > > nesting depth across all netem instances in the call chain is tracked
> > > > correctly.  
> > >
> > > I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2].
> > >
> > > If I remember correctly,  the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.
> > >  
> > > >
> > > > A depth limit of 4 is generous for any legitimate configuration.
> > > >
> > > > Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: William Liu <will@willsroot.io>
> > > > Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
> > > >
> > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > > > ---
> > > >  net/sched/sch_netem.c | 22 ++++++++++++++++++++++
> > > >  1 file changed, 22 insertions(+)
> > > >
> > > > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > > > index 0ccf74a9cb82..085fa3ad6f83 100644
> > > > --- a/net/sched/sch_netem.c
> > > > +++ b/net/sched/sch_netem.c
> > > > @@ -21,6 +21,7 @@
> > > >  #include <linux/rtnetlink.h>
> > > >  #include <linux/reciprocal_div.h>
> > > >  #include <linux/rbtree.h>
> > > > +#include <linux/percpu.h>
> > > >
> > > >  #include <net/gso.h>
> > > >  #include <net/netlink.h>
> > > > @@ -29,6 +30,15 @@
> > > >
> > > >  #define VERSION "1.3"
> > > >
> > > > +/*
> > > > + * Limit for recursion from duplication.
> > > > + * Duplicated packets are re-enqueued at the root qdisc, which may
> > > > + * reach this or another netem instance, causing nested calls to
> > > > + * netem_enqueue(). This per-CPU counter limits the total depth.
> > > > + */
> > > > +static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
> > > > +#define NETEM_RECURSION_LIMIT      4
> > > > +
> > > >  /* Network Emulation Queuing algorithm.
> > > >     ====================================
> > > >
> > > > @@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > >     /* Do not fool qdisc_drop_all() */
> > > >     skb->prev = NULL;
> > > >
> > > > +   /* Guard against recursion from duplication re-injection. */
> > > > +   if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
> > > > +                NETEM_RECURSION_LIMIT)) {
> > > > +           this_cpu_dec(netem_enqueue_depth);
> > > > +           qdisc_drop(skb, sch, to_free);
> > > > +           return NET_XMIT_DROP;
> > > > +   }
> > > > +
> > > >     /* Random duplication */
> > > >     if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> > > >             ++count;
> > > > @@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > >     if (count == 0) {
> > > >             qdisc_qstats_drop(sch);
> > > >             __qdisc_drop(skb, to_free);
> > > > +           this_cpu_dec(netem_enqueue_depth);
> > > >             return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > > >     }
> > > >
> > > > @@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > >             qdisc_drop_all(skb, sch, to_free);
> > > >             if (skb2)
> > > >                     __qdisc_drop(skb2, to_free);
> > > > +           this_cpu_dec(netem_enqueue_depth);
> > > >             return NET_XMIT_DROP;
> > > >     }
> > > >
> > > > @@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > >             /* Parent qdiscs accounted for 1 skb of size @prev_len */
> > > >             qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
> > > >     } else if (!skb) {
> > > > +           this_cpu_dec(netem_enqueue_depth);
> > > >             return NET_XMIT_DROP;
> > > >     }
> > > > +   this_cpu_dec(netem_enqueue_depth);
> > > >     return NET_XMIT_SUCCESS;
> > > >  }
> > > >
> > > > --
> > > > 2.51.0
> > > >
> > > >  
> > >
> > > What about the last suggestion for a robust fix from [3]?
> > >
> > > Best,
> > > Will
> > >
> > > [1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/
> > > [2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/
> > > [3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/  
> >
> > Thanks, this is a tight corner here, and not all solutions work out.
> >
> > You're right that the per-CPU guard alone doesn't cover the
> > dequeue-to-child-enqueue path you described. That's exactly why
> > the series has two patches working together:
> >
> > Patch 02 adds the per-CPU recursion guard, which handles the
> > direct enqueue recursion (rootq->enqueue duplicated packet hits
> > another netem_enqueue in the same call chain).
> >
> > Patch 04 restructures netem_dequeue to eliminate the pump. The
> > old code had "goto tfifo_dequeue" which looped back after each
> > child enqueue, so packets the child duplicated back to root would
> > immediately get picked up by the same dequeue iteration. The new
> > code transfers all currently-ready packets from the tfifo to the
> > child in one batch, then does a single dequeue from the child and
> > returns. Packets that the child duplicates back to root land in
> > the tfifo but won't be processed until the next dequeue call from
> > the parent — breaking the loop you diagrammed.
> >
> > The original repro is:
> >
> >   tc qdisc add dev lo root handle 1: netem delay 1ms duplicate 100%
> >   tc qdisc add dev lo parent 1:1 handle 2: netem delay 1ms duplicate 100%
> >   ping -f localhost
> >
> > This is covered by tdc test f2a3 (nested netem config acceptance)
> > and test 7a07 (nested netem with duplication, traffic via scapy).
> > More tests are added in the new version.
> >
> > Jamal's proposed change with skb ttl would also work but
> > it was rejected because it required adding ttl field to skb
> > and skb size is a performance critical. As Cong pointed out
> > adding a couple of bits for ttl makes it increase.
> > So considered the idea and decided against it.  
> 
> It was not "rejected" - other than Cong making those claims (which i
> responded to).
> Last posting i had some feedback from Willem but i dropped the ball.
> And that variant i posted had issues which were not caught by the AI
> review - required human knowledge (it didnt consider the GRO code
> path).
> If you want to go this path - i am fine with it, I will just focus on
> the mirred loop.
> Also tell your AI to Cc the stakeholders next time it posts via
> get_maintainers (and not cc stable) - then i will be fine reviewing.
> Commit log (or cover letter) would help if you cite the "documented
> examples" you said were broken.
> 
> cheers,
> jamal

The AI never does any posting, I do. It is used for review only.

You are correct that the original skb ttl size objection was a
"red herring"; the size only changed in a minimal config corner
case and alignment padding absorbs it anyway.

Looking deeper at the ttl approach, I noticed that sharing the
skb ttl across multiple subsystems could lead to new problems.
For example: netem with duplication + mirred redirect is a
legitimate combination. Netem increments ttl on the duplicate,
mirred increments it again on redirect — with only 2 bits (0-3),
a valid packet gets dropped after just two hops. Each subsystem
needs its own budget, which is what separate per-CPU counters
give you.

The per-CPU counter approach is simple and proven. Earlier versions
of mirred used the same pattern (tcf_mirred_nest_level). Why did
it get changed?

Regarding the broken documented examples: the netem wiki shows
HTB with netem leaves on different classes, and HFSC with netem
children. check_netem_in_tree() rejects both if any branch has
duplication enabled. I'll add specific citations in the next
version.

Lastly, my understanding of Linus's rules on regressions is
that a regression must not be introduced even if it fixes a bug. 
The "No regressions" rule is highest priority here.
 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
  2026-03-15 17:18         ` Stephen Hemminger
@ 2026-03-16 17:52           ` Jamal Hadi Salim
  0 siblings, 0 replies; 8+ messages in thread
From: Jamal Hadi Salim @ 2026-03-16 17:52 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: William Liu, netdev, stable, Savino Dicanosa, Victor Nogueira

On Sun, Mar 15, 2026 at 1:18 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Sun, 15 Mar 2026 12:19:02 -0400
> Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> > On Sun, Mar 15, 2026 at 12:06 PM Stephen Hemminger
> > <stephen@networkplumber.org> wrote:
> > >
> > > On Sat, 14 Mar 2026 19:29:10 +0000
> > > William Liu <will@willsroot.io> wrote:
> > >
> > > > Looping in Jamal and Victor.
> > > >
> > > > On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> > > >
> > > > > Add a per-CPU recursion depth counter to netem_enqueue(). When netem
> > > > > duplicates a packet, the clone is re-enqueued at the root qdisc. If
> > > > > the tree contains other netem instances, this can recurse without
> > > > > bound, causing soft lockups and OOM.
> > > > >
> > > > > This approach was previously considered but rejected on the grounds
> > > > > that netem_dequeue calling enqueue on a child netem could bypass the
> > > > > depth check. That concern does not apply: the child netem's
> > > > > netem_enqueue() increments the same per-CPU counter, so the total
> > > > > nesting depth across all netem instances in the call chain is tracked
> > > > > correctly.
> > > >
> > > > I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2].
> > > >
> > > > If I remember correctly,  the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.
> > > >
> > > > >
> > > > > A depth limit of 4 is generous for any legitimate configuration.
> > > > >
> > > > > Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
> > > > > Cc: stable@vger.kernel.org
> > > > > Reported-by: William Liu <will@willsroot.io>
> > > > > Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
> > > > >
> > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > > > > ---
> > > > >  net/sched/sch_netem.c | 22 ++++++++++++++++++++++
> > > > >  1 file changed, 22 insertions(+)
> > > > >
> > > > > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > > > > index 0ccf74a9cb82..085fa3ad6f83 100644
> > > > > --- a/net/sched/sch_netem.c
> > > > > +++ b/net/sched/sch_netem.c
> > > > > @@ -21,6 +21,7 @@
> > > > >  #include <linux/rtnetlink.h>
> > > > >  #include <linux/reciprocal_div.h>
> > > > >  #include <linux/rbtree.h>
> > > > > +#include <linux/percpu.h>
> > > > >
> > > > >  #include <net/gso.h>
> > > > >  #include <net/netlink.h>
> > > > > @@ -29,6 +30,15 @@
> > > > >
> > > > >  #define VERSION "1.3"
> > > > >
> > > > > +/*
> > > > > + * Limit for recursion from duplication.
> > > > > + * Duplicated packets are re-enqueued at the root qdisc, which may
> > > > > + * reach this or another netem instance, causing nested calls to
> > > > > + * netem_enqueue(). This per-CPU counter limits the total depth.
> > > > > + */
> > > > > +static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
> > > > > +#define NETEM_RECURSION_LIMIT      4
> > > > > +
> > > > >  /* Network Emulation Queuing algorithm.
> > > > >     ====================================
> > > > >
> > > > > @@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > >     /* Do not fool qdisc_drop_all() */
> > > > >     skb->prev = NULL;
> > > > >
> > > > > +   /* Guard against recursion from duplication re-injection. */
> > > > > +   if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
> > > > > +                NETEM_RECURSION_LIMIT)) {
> > > > > +           this_cpu_dec(netem_enqueue_depth);
> > > > > +           qdisc_drop(skb, sch, to_free);
> > > > > +           return NET_XMIT_DROP;
> > > > > +   }
> > > > > +
> > > > >     /* Random duplication */
> > > > >     if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> > > > >             ++count;
> > > > > @@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > >     if (count == 0) {
> > > > >             qdisc_qstats_drop(sch);
> > > > >             __qdisc_drop(skb, to_free);
> > > > > +           this_cpu_dec(netem_enqueue_depth);
> > > > >             return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > > > >     }
> > > > >
> > > > > @@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > >             qdisc_drop_all(skb, sch, to_free);
> > > > >             if (skb2)
> > > > >                     __qdisc_drop(skb2, to_free);
> > > > > +           this_cpu_dec(netem_enqueue_depth);
> > > > >             return NET_XMIT_DROP;
> > > > >     }
> > > > >
> > > > > @@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > >             /* Parent qdiscs accounted for 1 skb of size @prev_len */
> > > > >             qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
> > > > >     } else if (!skb) {
> > > > > +           this_cpu_dec(netem_enqueue_depth);
> > > > >             return NET_XMIT_DROP;
> > > > >     }
> > > > > +   this_cpu_dec(netem_enqueue_depth);
> > > > >     return NET_XMIT_SUCCESS;
> > > > >  }
> > > > >
> > > > > --
> > > > > 2.51.0
> > > > >
> > > > >
> > > >
> > > > What about the last suggestion for a robust fix from [3]?
> > > >
> > > > Best,
> > > > Will
> > > >
> > > > [1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/
> > > > [2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/
> > > > [3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/
> > >
> > > Thanks, this is a tight corner here, and not all solutions work out.
> > >
> > > You're right that the per-CPU guard alone doesn't cover the
> > > dequeue-to-child-enqueue path you described. That's exactly why
> > > the series has two patches working together:
> > >
> > > Patch 02 adds the per-CPU recursion guard, which handles the
> > > direct enqueue recursion (rootq->enqueue duplicated packet hits
> > > another netem_enqueue in the same call chain).
> > >
> > > Patch 04 restructures netem_dequeue to eliminate the pump. The
> > > old code had "goto tfifo_dequeue" which looped back after each
> > > child enqueue, so packets the child duplicated back to root would
> > > immediately get picked up by the same dequeue iteration. The new
> > > code transfers all currently-ready packets from the tfifo to the
> > > child in one batch, then does a single dequeue from the child and
> > > returns. Packets that the child duplicates back to root land in
> > > the tfifo but won't be processed until the next dequeue call from
> > > the parent — breaking the loop you diagrammed.
> > >
> > > The original repro is:
> > >
> > >   tc qdisc add dev lo root handle 1: netem delay 1ms duplicate 100%
> > >   tc qdisc add dev lo parent 1:1 handle 2: netem delay 1ms duplicate 100%
> > >   ping -f localhost
> > >
> > > This is covered by tdc test f2a3 (nested netem config acceptance)
> > > and test 7a07 (nested netem with duplication, traffic via scapy).
> > > More tests are added in the new version.
> > >
> > > Jamal's proposed change with skb ttl would also work but
> > > it was rejected because it required adding ttl field to skb
> > > and skb size is a performance critical. As Cong pointed out
> > > adding a couple of bits for ttl makes it increase.
> > > So considered the idea and decided against it.
> >
> > It was not "rejected" - other than Cong making those claims (which i
> > responded to).
> > Last posting i had some feedback from Willem but i dropped the ball.
> > And that variant i posted had issues which were not caught by the AI
> > review - required human knowledge (it didnt consider the GRO code
> > path).
> > If you want to go this path - i am fine with it, I will just focus on
> > the mirred loop.
> > Also tell your AI to Cc the stakeholders next time it posts via
> > get_maintainers (and not cc stable) - then i will be fine reviewing.
> > Commit log (or cover letter) would help if you cite the "documented
> > examples" you said were broken.
> >
> > cheers,
> > jamal
>
> The AI never does any posting, I do. It is used for review only.
>

Ok - just cc the relevant people please..

> You are correct that the original skb ttl size objection was a
> "red herring"; the size only changed in a minimal config corner
> case and alignment padding absorbs it anyway.
>

Right. Here's a before and after of pahole on the skb struct.

-------
--- skb-all-config-pahole-before-ttl 2026-03-16 03:40:08.900884717 -0400
+++ skb-all-config-pahole-after-ttl 2026-03-16 04:03:55.970069673 -0400
@@ -79,8 +79,8 @@
  __u8       slow_gro:1;           /*   132: 3  1 */
  __u8       csum_not_inet:1;      /*   132: 4  1 */
  __u8       unreadable:1;         /*   132: 5  1 */
+ __u8       ttl:2;                /*   132: 6  1 */

- /* XXX 2 bits hole, try to pack */
  /* XXX 1 byte hole, try to pack */

  __u16      tc_index;             /*   134     2 */
--------

> Looking deeper at the ttl approach, I noticed that sharing the
> skb ttl across multiple subsystems could lead to new problems.
> For example: netem with duplication + mirred redirect is a
> legitimate combination. Netem increments ttl on the duplicate,
> mirred increments it again on redirect — with only 2 bits (0-3),
> a valid packet gets dropped after just two hops. Each subsystem
> needs its own budget, which is what separate per-CPU counters
> give you.

Yes, this is true. The way I weigh it out is:
Does that config even make sense? It may be a niche case, but should
we introduce extra complexity just to serve this niche case (which
will still work albeit with some constraints)?
I will make some time and post the patches later today - if you think
strongly about going with the approach you took, i will drop the netem
patch.

> The per-CPU counter approach is simple and proven. Earlier versions
> of mirred used the same pattern (tcf_mirred_nest_level). Why did
> it get changed?
>

It's still per-CPU—things just things got shifted around (and are more
clever IMO). See:
commit fe946a751d9b52b7c45ca34899723b314b79b249
Author: Eric Dumazet <edumazet@google.com>
The per-CPU is useful if your loop stays in the same CPU and executes
back-to-back; Eric's thing will catch all that. It fails in two
scenarios:
1) if we queue the packet somewhere and then restart processing later.
The per-cpu state cant be maintained in such a case (example, it gets
wiped out the moment we go egress->ingress and queue the packet in the
backlog and later packets are being pulled from backlog)
2) If we have X/RPS where it came in one CPU but may end up on a different CPU.

Also note as Willem mentioned we used to have 3 bits for this loop
counter (and i seem to be the guy who took them out).

> Regarding the broken documented examples: the netem wiki shows
> HTB with netem leaves on different classes, and HFSC with netem
> children. check_netem_in_tree() rejects both if any branch has
> duplication enabled. I'll add specific citations in the next
> version.

I dont see it on the netem wiki. Is it this?
https://wiki.linuxfoundation.org/networking/netem)
In any case, I was not aware of this setup. A citation would help, and
would have helped more if you spoke up at the time.

>
> Lastly, my understanding of Linus's rules on regressions is
> that a regression must not be introduced even if it fixes a bug.
> The "No regressions" rule is highest priority here.
>

I dont believe this view should be taken as dogma . Let me try to make
my case with a gun analogy.
The large large majority of the qdiscs setups that come up in the bug
reports are based on illegal configurations. IOW, some hierarchies are
just nonsense - but are used to stage a setup which triggers a bug.
When the tc and qdisc infra in particular, were being implemented, the
philosophy was old skule "I am giving you the gun for hunting but if
you want to shoot your big toe, i am not going to stop you". The
general idea is that the gun owner is not into shooting their big toe;
if they do, we'd tell them that was not the gun's intent;-> The
problem now is there's profit in shooting the big toe (bounty hunting
for example) - so "we" the gun manufacturers are held responsible
whenever a toe gets shot. I am not sure if that got my point across
;->
The problem is we keep adding hacks to address specific issues for
nonsense setups. You can make it work but it would be equivalent to
this urban legend: https://i.imgur.com/vEwVkrl.jpeg

To make it short, that was the spirit I used to make the call to
accept that patch.

cheers,
jamal

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-16 17:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260313211646.12549-1-stephen@networkplumber.org>
2026-03-13 21:15 ` [PATCH 02/12] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Stephen Hemminger
2026-03-13 21:15 ` [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication Stephen Hemminger
2026-03-14 19:29   ` William Liu
2026-03-15 16:06     ` Stephen Hemminger
2026-03-15 16:19       ` Jamal Hadi Salim
2026-03-15 17:18         ` Stephen Hemminger
2026-03-16 17:52           ` Jamal Hadi Salim
2026-03-13 21:15 ` [PATCH 04/12] net/sched: netem: restructure dequeue to avoid re-entrancy with child qdisc Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox