From: Stephen Hemminger <stephen@networkplumber.org>
To: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: William Liu <will@willsroot.io>,
netdev@vger.kernel.org, stable@vger.kernel.org,
Savino Dicanosa <savy@syst3mfailure.io>,
Victor Nogueira <victor@mojatatu.com>
Subject: Re: [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication
Date: Sun, 15 Mar 2026 10:18:18 -0700 [thread overview]
Message-ID: <20260315101818.1382d6f6@phoenix.local> (raw)
In-Reply-To: <CAM0EoMnh7gYjEGVB0eqGfannC=i=R5YQZbBfZK1K+CqKJyOMOQ@mail.gmail.com>
On Sun, 15 Mar 2026 12:19:02 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On Sun, Mar 15, 2026 at 12:06 PM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> > On Sat, 14 Mar 2026 19:29:10 +0000
> > William Liu <will@willsroot.io> wrote:
> >
> > > Looping in Jamal and Victor.
> > >
> > > On Friday, March 13th, 2026 at 9:17 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> > >
> > > > Add a per-CPU recursion depth counter to netem_enqueue(). When netem
> > > > duplicates a packet, the clone is re-enqueued at the root qdisc. If
> > > > the tree contains other netem instances, this can recurse without
> > > > bound, causing soft lockups and OOM.
> > > >
> > > > This approach was previously considered but rejected on the grounds
> > > > that netem_dequeue calling enqueue on a child netem could bypass the
> > > > depth check. That concern does not apply: the child netem's
> > > > netem_enqueue() increments the same per-CPU counter, so the total
> > > > nesting depth across all netem instances in the call chain is tracked
> > > > correctly.
> > >
> > > I'm assuming you are referring to [1] (and other relevant followup messages), but has this setup been tested against the original repro? I think there was a similar draft fix originally but it failed during testing because DOS still happened [2].
> > >
> > > If I remember correctly, the issue is less so the recursive depth but more so being able to differentiate between packets that are previously involved in duplication or not.
> > >
> > > >
> > > > A depth limit of 4 is generous for any legitimate configuration.
> > > >
> > > > Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=220774
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: William Liu <will@willsroot.io>
> > > > Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
> > > >
> > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > > > ---
> > > > net/sched/sch_netem.c | 22 ++++++++++++++++++++++
> > > > 1 file changed, 22 insertions(+)
> > > >
> > > > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > > > index 0ccf74a9cb82..085fa3ad6f83 100644
> > > > --- a/net/sched/sch_netem.c
> > > > +++ b/net/sched/sch_netem.c
> > > > @@ -21,6 +21,7 @@
> > > > #include <linux/rtnetlink.h>
> > > > #include <linux/reciprocal_div.h>
> > > > #include <linux/rbtree.h>
> > > > +#include <linux/percpu.h>
> > > >
> > > > #include <net/gso.h>
> > > > #include <net/netlink.h>
> > > > @@ -29,6 +30,15 @@
> > > >
> > > > #define VERSION "1.3"
> > > >
> > > > +/*
> > > > + * Limit for recursion from duplication.
> > > > + * Duplicated packets are re-enqueued at the root qdisc, which may
> > > > + * reach this or another netem instance, causing nested calls to
> > > > + * netem_enqueue(). This per-CPU counter limits the total depth.
> > > > + */
> > > > +static DEFINE_PER_CPU(unsigned int, netem_enqueue_depth);
> > > > +#define NETEM_RECURSION_LIMIT 4
> > > > +
> > > > /* Network Emulation Queuing algorithm.
> > > > ====================================
> > > >
> > > > @@ -460,6 +470,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > /* Do not fool qdisc_drop_all() */
> > > > skb->prev = NULL;
> > > >
> > > > + /* Guard against recursion from duplication re-injection. */
> > > > + if (unlikely(this_cpu_inc_return(netem_enqueue_depth) >
> > > > + NETEM_RECURSION_LIMIT)) {
> > > > + this_cpu_dec(netem_enqueue_depth);
> > > > + qdisc_drop(skb, sch, to_free);
> > > > + return NET_XMIT_DROP;
> > > > + }
> > > > +
> > > > /* Random duplication */
> > > > if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> > > > ++count;
> > > > @@ -474,6 +492,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > if (count == 0) {
> > > > qdisc_qstats_drop(sch);
> > > > __qdisc_drop(skb, to_free);
> > > > + this_cpu_dec(netem_enqueue_depth);
> > > > return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > > > }
> > > >
> > > > @@ -529,6 +548,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > qdisc_drop_all(skb, sch, to_free);
> > > > if (skb2)
> > > > __qdisc_drop(skb2, to_free);
> > > > + this_cpu_dec(netem_enqueue_depth);
> > > > return NET_XMIT_DROP;
> > > > }
> > > >
> > > > @@ -643,8 +663,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > > > /* Parent qdiscs accounted for 1 skb of size @prev_len */
> > > > qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
> > > > } else if (!skb) {
> > > > + this_cpu_dec(netem_enqueue_depth);
> > > > return NET_XMIT_DROP;
> > > > }
> > > > + this_cpu_dec(netem_enqueue_depth);
> > > > return NET_XMIT_SUCCESS;
> > > > }
> > > >
> > > > --
> > > > 2.51.0
> > > >
> > > >
> > >
> > > What about the last suggestion for a robust fix from [3]?
> > >
> > > Best,
> > > Will
> > >
> > > [1] https://lore.kernel.org/netdev/DISZZlS5CdbUKITzkIyT3jki3inTWSMecT6FplNmkpYs9bJizbs0iwRbTGMrnqEXrL3-__IjOQxdULPdZwGdKFSXJ1DZYIj6xmWPBZxerdk=@willsroot.io/
> > > [2] https://lore.kernel.org/netdev/q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io/
> > > [3] https://lore.kernel.org/netdev/20260111163947.811248-6-jhs@mojatatu.com/
> >
> > Thanks, this is a tight corner here, and not all solutions work out.
> >
> > You're right that the per-CPU guard alone doesn't cover the
> > dequeue-to-child-enqueue path you described. That's exactly why
> > the series has two patches working together:
> >
> > Patch 02 adds the per-CPU recursion guard, which handles the
> > direct enqueue recursion (rootq->enqueue duplicated packet hits
> > another netem_enqueue in the same call chain).
> >
> > Patch 04 restructures netem_dequeue to eliminate the pump. The
> > old code had "goto tfifo_dequeue" which looped back after each
> > child enqueue, so packets the child duplicated back to root would
> > immediately get picked up by the same dequeue iteration. The new
> > code transfers all currently-ready packets from the tfifo to the
> > child in one batch, then does a single dequeue from the child and
> > returns. Packets that the child duplicates back to root land in
> > the tfifo but won't be processed until the next dequeue call from
> > the parent — breaking the loop you diagrammed.
> >
> > The original repro is:
> >
> > tc qdisc add dev lo root handle 1: netem delay 1ms duplicate 100%
> > tc qdisc add dev lo parent 1:1 handle 2: netem delay 1ms duplicate 100%
> > ping -f localhost
> >
> > This is covered by tdc test f2a3 (nested netem config acceptance)
> > and test 7a07 (nested netem with duplication, traffic via scapy).
> > More tests are added in the new version.
> >
> > Jamal's proposed change with skb ttl would also work but
> > it was rejected because it required adding ttl field to skb
> > and skb size is a performance critical. As Cong pointed out
> > adding a couple of bits for ttl makes it increase.
> > So considered the idea and decided against it.
>
> It was not "rejected" - other than Cong making those claims (which i
> responded to).
> Last posting i had some feedback from Willem but i dropped the ball.
> And that variant i posted had issues which were not caught by the AI
> review - required human knowledge (it didnt consider the GRO code
> path).
> If you want to go this path - i am fine with it, I will just focus on
> the mirred loop.
> Also tell your AI to Cc the stakeholders next time it posts via
> get_maintainers (and not cc stable) - then i will be fine reviewing.
> Commit log (or cover letter) would help if you cite the "documented
> examples" you said were broken.
>
> cheers,
> jamal
The AI never does any posting, I do. It is used for review only.
You are correct that the original skb ttl size objection was a
"red herring"; the size only changed in a minimal config corner
case and alignment padding absorbs it anyway.
Looking deeper at the ttl approach, I noticed that sharing the
skb ttl across multiple subsystems could lead to new problems.
For example: netem with duplication + mirred redirect is a
legitimate combination. Netem increments ttl on the duplicate,
mirred increments it again on redirect — with only 2 bits (0-3),
a valid packet gets dropped after just two hops. Each subsystem
needs its own budget, which is what separate per-CPU counters
give you.
The per-CPU counter approach is simple and proven. Earlier versions
of mirred used the same pattern (tcf_mirred_nest_level). Why did
it get changed?
Regarding the broken documented examples: the netem wiki shows
HTB with netem leaves on different classes, and HFSC with netem
children. check_netem_in_tree() rejects both if any branch has
duplication enabled. I'll add specific citations in the next
version.
Lastly, my understanding of Linus's rules on regressions is
that a regression must not be introduced even if it fixes a bug.
The "No regressions" rule is highest priority here.
next prev parent reply other threads:[~2026-03-15 17:18 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260313211646.12549-1-stephen@networkplumber.org>
2026-03-13 21:15 ` [PATCH 02/12] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Stephen Hemminger
2026-03-13 21:15 ` [PATCH 03/12] net/sched: netem: add per-CPU recursion guard for duplication Stephen Hemminger
2026-03-14 19:29 ` William Liu
2026-03-15 16:06 ` Stephen Hemminger
2026-03-15 16:19 ` Jamal Hadi Salim
2026-03-15 17:18 ` Stephen Hemminger [this message]
2026-03-16 17:52 ` Jamal Hadi Salim
2026-03-13 21:15 ` [PATCH 04/12] net/sched: netem: restructure dequeue to avoid re-entrancy with child qdisc Stephen Hemminger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260315101818.1382d6f6@phoenix.local \
--to=stephen@networkplumber.org \
--cc=jhs@mojatatu.com \
--cc=netdev@vger.kernel.org \
--cc=savy@syst3mfailure.io \
--cc=stable@vger.kernel.org \
--cc=victor@mojatatu.com \
--cc=will@willsroot.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox