* [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops
@ 2025-11-26 19:52 Cong Wang
2025-11-26 19:52 ` [Patch net v5 1/9] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Cong Wang
` (8 more replies)
0 siblings, 9 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang
This patchset fixes the infinite loops due to duplication in netem, the
real root cause of this problem is enqueuing to the root qdisc, which is
now changed to enqueuing to the same qdisc. This is more reasonable,
more predictable from users' perspective, less error-proone and more elegant.
Please see more details in patch 1/6 which contains two pages of detailed
explanation including why it is safe and better.
This reverts the offending commits from William which clearly broke
mq+netem use cases.
All the test cases pass with this patchset.
---
v5: Reverted the offending commits
Added a init_user_ns check (4/9)
Rebased to the latest -net branch
v4: Added a fix for qfq qdisc (2/6)
Updated 1/6 patch description
Added a patch to update the enqueue reentrant behaviour tests
v3: Fixed the root cause of enqueuing to root
Switched back to netem_skb_cb safely
Added two more test cases
v2: Fixed a typo
Improved tdc selftest to check sent bytes
Cong Wang (9):
Revert "net/sched: Restrict conditions for adding duplicating netems
to qdisc tree"
Revert "selftests/tc-testing: Add tests for restrictions on netem
duplication"
net_sched: Implement the right netem duplication behavior
net_sched: Prevent using netem duplication in non-initial user
namespace
net_sched: Check the return value of qfq_choose_next_agg()
selftests/tc-testing: Add a nested netem duplicate test
selftests/tc-testing: Add a test case for piro with netem duplicate
selftests/tc-testing: Add a test case for mq with netem duplicate
selftests/tc-testing: Update test cases with netem duplicate
net/sched/sch_netem.c | 72 ++++-------
net/sched/sch_qfq.c | 2 +
.../tc-testing/tc-tests/infra/qdiscs.json | 115 +++++++++++++-----
.../tc-testing/tc-tests/qdiscs/netem.json | 90 +++-----------
4 files changed, 126 insertions(+), 153 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Patch net v5 1/9] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree"
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-11-26 19:52 ` [Patch net v5 2/9] Revert "selftests/tc-testing: Add tests for restrictions on netem duplication" Cong Wang
` (7 subsequent siblings)
8 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang, Ji-Soo Chung, Gerlinde
From: Cong Wang <cwang@multikernel.io>
This reverts commit ec8e0e3d7adef940cdf9475e2352c0680189d14e since it
breaks mq+netem.
Reported-by: Ji-Soo Chung <jschung2@proton.me>
Reported-by: Gerlinde <lrGerlinde@mailfence.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220774
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
net/sched/sch_netem.c | 40 ----------------------------------------
1 file changed, 40 deletions(-)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index eafc316ae319..fdd79d3ccd8c 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -973,41 +973,6 @@ static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
return 0;
}
-static const struct Qdisc_class_ops netem_class_ops;
-
-static int check_netem_in_tree(struct Qdisc *sch, bool duplicates,
- struct netlink_ext_ack *extack)
-{
- struct Qdisc *root, *q;
- unsigned int i;
-
- root = qdisc_root_sleeping(sch);
-
- if (sch != root && root->ops->cl_ops == &netem_class_ops) {
- if (duplicates ||
- ((struct netem_sched_data *)qdisc_priv(root))->duplicate)
- goto err;
- }
-
- if (!qdisc_dev(root))
- return 0;
-
- hash_for_each(qdisc_dev(root)->qdisc_hash, i, q, hash) {
- if (sch != q && q->ops->cl_ops == &netem_class_ops) {
- if (duplicates ||
- ((struct netem_sched_data *)qdisc_priv(q))->duplicate)
- goto err;
- }
- }
-
- return 0;
-
-err:
- NL_SET_ERR_MSG(extack,
- "netem: cannot mix duplicating netems with other netems in tree");
- return -EINVAL;
-}
-
/* Parse netlink message to set options */
static int netem_change(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
@@ -1066,11 +1031,6 @@ static int netem_change(struct Qdisc *sch, struct nlattr *opt,
q->gap = qopt->gap;
q->counter = 0;
q->loss = qopt->loss;
-
- ret = check_netem_in_tree(sch, qopt->duplicate, extack);
- if (ret)
- goto unlock;
-
q->duplicate = qopt->duplicate;
/* for compatibility with earlier versions.
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 2/9] Revert "selftests/tc-testing: Add tests for restrictions on netem duplication"
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
2025-11-26 19:52 ` [Patch net v5 1/9] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-11-26 19:52 ` [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior Cong Wang
` (6 subsequent siblings)
8 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang
From: Cong Wang <cwang@multikernel.io>
This reverts commit ecdec65ec78d67d3ebd17edc88b88312054abe0d.
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
.../tc-testing/tc-tests/infra/qdiscs.json | 5 +-
.../tc-testing/tc-tests/qdiscs/netem.json | 81 -------------------
2 files changed, 3 insertions(+), 83 deletions(-)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index 0091bcd91c2c..b5f367fcc150 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -702,6 +702,7 @@
"$TC qdisc add dev $DUMMY parent 1:1 handle 2:0 netem duplicate 100%",
"$TC filter add dev $DUMMY parent 1:0 protocol ip prio 1 u32 match ip dst 10.10.10.1/32 flowid 1:1",
"$TC class add dev $DUMMY parent 1:0 classid 1:2 hfsc ls m2 10Mbit",
+ "$TC qdisc add dev $DUMMY parent 1:2 handle 3:0 netem duplicate 100%",
"$TC filter add dev $DUMMY parent 1:0 protocol ip prio 2 u32 match ip dst 10.10.10.2/32 flowid 1:2",
"ping -c 1 10.10.10.1 -I$DUMMY > /dev/null || true",
"$TC filter del dev $DUMMY parent 1:0 protocol ip prio 1",
@@ -714,8 +715,8 @@
{
"kind": "hfsc",
"handle": "1:",
- "bytes": 294,
- "packets": 3
+ "bytes": 392,
+ "packets": 4
}
],
"matchCount": "1",
diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
index 718d2df2aafa..3c4444961488 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
@@ -336,86 +336,5 @@
"teardown": [
"$TC qdisc del dev $DUMMY handle 1: root"
]
- },
- {
- "id": "d34d",
- "name": "NETEM test qdisc duplication restriction in qdisc tree in netem_change root",
- "category": ["qdisc", "netem"],
- "plugins": {
- "requires": "nsPlugin"
- },
- "setup": [
- "$TC qdisc add dev $DUMMY root handle 1: netem limit 1",
- "$TC qdisc add dev $DUMMY parent 1: handle 2: netem limit 1"
- ],
- "cmdUnderTest": "$TC qdisc change dev $DUMMY handle 1: netem duplicate 50%",
- "expExitCode": "2",
- "verifyCmd": "$TC -s qdisc show dev $DUMMY",
- "matchPattern": "qdisc netem",
- "matchCount": "2",
- "teardown": [
- "$TC qdisc del dev $DUMMY handle 1:0 root"
- ]
- },
- {
- "id": "b33f",
- "name": "NETEM test qdisc duplication restriction in qdisc tree in netem_change non-root",
- "category": ["qdisc", "netem"],
- "plugins": {
- "requires": "nsPlugin"
- },
- "setup": [
- "$TC qdisc add dev $DUMMY root handle 1: netem limit 1",
- "$TC qdisc add dev $DUMMY parent 1: handle 2: netem limit 1"
- ],
- "cmdUnderTest": "$TC qdisc change dev $DUMMY handle 2: netem duplicate 50%",
- "expExitCode": "2",
- "verifyCmd": "$TC -s qdisc show dev $DUMMY",
- "matchPattern": "qdisc netem",
- "matchCount": "2",
- "teardown": [
- "$TC qdisc del dev $DUMMY handle 1:0 root"
- ]
- },
- {
- "id": "cafe",
- "name": "NETEM test qdisc duplication restriction in qdisc tree",
- "category": ["qdisc", "netem"],
- "plugins": {
- "requires": "nsPlugin"
- },
- "setup": [
- "$TC qdisc add dev $DUMMY root handle 1: netem limit 1 duplicate 100%"
- ],
- "cmdUnderTest": "$TC qdisc add dev $DUMMY parent 1: handle 2: netem duplicate 100%",
- "expExitCode": "2",
- "verifyCmd": "$TC -s qdisc show dev $DUMMY",
- "matchPattern": "qdisc netem",
- "matchCount": "1",
- "teardown": [
- "$TC qdisc del dev $DUMMY handle 1:0 root"
- ]
- },
- {
- "id": "1337",
- "name": "NETEM test qdisc duplication restriction in qdisc tree across branches",
- "category": ["qdisc", "netem"],
- "plugins": {
- "requires": "nsPlugin"
- },
- "setup": [
- "$TC qdisc add dev $DUMMY parent root handle 1:0 hfsc",
- "$TC class add dev $DUMMY parent 1:0 classid 1:1 hfsc rt m2 10Mbit",
- "$TC qdisc add dev $DUMMY parent 1:1 handle 2:0 netem",
- "$TC class add dev $DUMMY parent 1:0 classid 1:2 hfsc rt m2 10Mbit"
- ],
- "cmdUnderTest": "$TC qdisc add dev $DUMMY parent 1:2 handle 3:0 netem duplicate 100%",
- "expExitCode": "2",
- "verifyCmd": "$TC -s qdisc show dev $DUMMY",
- "matchPattern": "qdisc netem",
- "matchCount": "1",
- "teardown": [
- "$TC qdisc del dev $DUMMY handle 1:0 root"
- ]
}
]
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
2025-11-26 19:52 ` [Patch net v5 1/9] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Cong Wang
2025-11-26 19:52 ` [Patch net v5 2/9] Revert "selftests/tc-testing: Add tests for restrictions on netem duplication" Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-11-26 20:30 ` William Liu
2025-12-03 15:05 ` Stephen Hemminger
2025-11-26 19:52 ` [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace Cong Wang
` (5 subsequent siblings)
8 siblings, 2 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang, William Liu, Savino Dicanosa
In the old behavior, duplicated packets were sent back to the root qdisc,
which could create dangerous infinite loops in hierarchical setups -
imagine a scenario where each level of a multi-stage netem hierarchy kept
feeding duplicates back to the top, potentially causing system instability
or resource exhaustion.
The new behavior elegantly solves this by enqueueing duplicates to the same
qdisc that created them, ensuring that packet duplication occurs exactly
once per netem stage in a controlled, predictable manner. This change
enables users to safely construct complex network emulation scenarios using
netem hierarchies (like the 4x multiplication demonstrated in testing)
without worrying about runaway packet generation, while still preserving
the intended duplication effects.
Another advantage of this approach is that it eliminates the enqueue reentrant
behaviour which triggered many vulnerabilities. See the last patch in this
patchset which updates the test cases for such vulnerabilities.
Now users can confidently chain multiple netem qdiscs together to achieve
sophisticated network impairment combinations, knowing that each stage will
apply its effects exactly once to the packet flow, making network testing
scenarios more reliable and results more deterministic.
I tested netem packet duplication in two configurations:
1. Nest netem-to-netem hierarchy using parent/child attachment
2. Single netem using prio qdisc with netem leaf
Setup commands and results:
Single netem hierarchy (prio + netem):
tc qdisc add dev lo root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tc filter add dev lo parent 1:0 protocol ip matchall classid 1:1
tc qdisc add dev lo parent 1:1 handle 10: netem limit 4 duplicate 100%
Result: 2x packet multiplication (1→2 packets)
2 echo requests + 4 echo replies = 6 total packets
Expected behavior: Only one netem stage exists in this hierarchy, so
1 ping becomes 2 packets (100% duplication). The 2 echo requests generate
2 echo replies, which also get duplicated to 4 replies, yielding the
predictable total of 6 packets (2 requests + 4 replies).
Nest netem hierarchy (netem + netem):
tc qdisc add dev lo root handle 1: netem limit 1000 duplicate 100%
tc qdisc add dev lo parent 1: handle 2: netem limit 1000 duplicate 100%
Result: 4x packet multiplication (1→2→4 packets)
4 echo requests + 16 echo replies = 20 total packets
Expected behavior: Root netem duplicates 1 ping to 2 packets, child netem
receives 2 packets and duplicates each to create 4 total packets. Since
ping operates bidirectionally, 4 echo requests generate 4 echo replies,
which also get duplicated through the same hierarchy (4→8→16), resulting
in the predictable total of 20 packets (4 requests + 16 replies).
The new netem duplication behavior does not break the documented
semantics of "creates a copy of the packet before queuing." The man page
description remains true since duplication occurs before the queuing
process, creating both original and duplicate packets that are then
enqueued. The documentation does not specify which qdisc should receive
the duplicates, only that copying happens before queuing. The implementation
choice to enqueue duplicates to the same qdisc (rather than root) is an
internal detail that maintains the documented behavior while preventing
infinite loops in hierarchical configurations.
Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Reported-by: William Liu <will@willsroot.io>
Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
net/sched/sch_netem.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index fdd79d3ccd8c..191f64bd68ff 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -165,6 +165,7 @@ struct netem_sched_data {
*/
struct netem_skb_cb {
u64 time_to_send;
+ u8 duplicate : 1;
};
static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
@@ -460,8 +461,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
skb->prev = NULL;
/* Random duplication */
- if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
- ++count;
+ if (q->duplicate) {
+ bool dup = true;
+
+ if (netem_skb_cb(skb)->duplicate) {
+ netem_skb_cb(skb)->duplicate = 0;
+ dup = false;
+ }
+ if (dup && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
+ ++count;
+ }
/* Drop packet? */
if (loss_event(q)) {
@@ -532,17 +541,12 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
}
/*
- * If doing duplication then re-insert at top of the
- * qdisc tree, since parent queuer expects that only one
- * skb will be queued.
+ * If doing duplication then re-insert at the same qdisc,
+ * as going back to the root would induce loops.
*/
if (skb2) {
- struct Qdisc *rootq = qdisc_root_bh(sch);
- u32 dupsave = q->duplicate; /* prevent duplicating a dup... */
-
- q->duplicate = 0;
- rootq->enqueue(skb2, rootq, to_free);
- q->duplicate = dupsave;
+ netem_skb_cb(skb2)->duplicate = 1;
+ qdisc_enqueue(skb2, sch, to_free);
skb2 = NULL;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
` (2 preceding siblings ...)
2025-11-26 19:52 ` [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-12-02 0:25 ` Jakub Kicinski
` (2 more replies)
2025-11-26 19:52 ` [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg() Cong Wang
` (4 subsequent siblings)
8 siblings, 3 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang
From: Cong Wang <cwang@multikernel.io>
The netem qdisc has a known security issue with packet duplication
that makes it unsafe to use in unprivileged contexts. While netem
typically requires CAP_NET_ADMIN to load, users with "root" privileges
inside a user namespace also have CAP_NET_ADMIN within that namespace,
allowing them to potentially exploit this feature.
To address this, we need to restrict the netem duplication to only the
initial user namespace.
Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
net/sched/sch_netem.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 191f64bd68ff..f87b862c769a 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -991,6 +991,12 @@ static int netem_change(struct Qdisc *sch, struct nlattr *opt,
int ret;
qopt = nla_data(opt);
+
+ if (dev_net(qdisc_dev(sch))->user_ns != &init_user_ns && qopt->duplicate) {
+ NL_SET_ERR_MSG(extack, "Duplication is not allowed in unprivileged namespaces");
+ return -EINVAL;
+ }
+
ret = parse_attr(tb, TCA_NETEM_MAX, opt, netem_policy, sizeof(*qopt));
if (ret < 0)
return ret;
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg()
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
` (3 preceding siblings ...)
2025-11-26 19:52 ` [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-12-02 9:20 ` Paolo Abeni
2025-12-02 21:18 ` Xiang Mei
2025-11-26 19:52 ` [Patch net v5 6/9] selftests/tc-testing: Add a nested netem duplicate test Cong Wang
` (3 subsequent siblings)
8 siblings, 2 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang, Xiang Mei
qfq_choose_next_agg() could return NULL so its return value should be
properly checked unless NULL is acceptable.
There are two cases we need to deal with:
1) q->in_serv_agg, which is okay with NULL since it is either checked or
just compared with other pointer without dereferencing. In fact, it
is even intentionally set to NULL in one of the cases.
2) in_serv_agg, which is a temporary local variable, which is not okay
with NULL, since it is dereferenced immediately, hence must be checked.
This fix corrects one of the 2nd cases, and leaving the 1st case as they are.
Although this bug is triggered with the netem duplicate change, the root
cause is still within qfq qdisc.
Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Cc: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
net/sched/sch_qfq.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index 2255355e51d3..a438ebaf53e0 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -1145,6 +1145,8 @@ static struct sk_buff *qfq_dequeue(struct Qdisc *sch)
* choose the new aggregate to serve.
*/
in_serv_agg = q->in_serv_agg = qfq_choose_next_agg(q);
+ if (!in_serv_agg)
+ return NULL;
skb = qfq_peek_skb(in_serv_agg, &cl, &len);
}
if (!skb)
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 6/9] selftests/tc-testing: Add a nested netem duplicate test
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
` (4 preceding siblings ...)
2025-11-26 19:52 ` [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg() Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-11-26 19:52 ` [Patch net v5 7/9] selftests/tc-testing: Add a test case for piro with netem duplicate Cong Wang
` (2 subsequent siblings)
8 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang, William Liu
Integrate the reproducer from William into tc-testing and use scapy
to generate packets for testing:
# ./tdc.py -e 1676
-- ns/SubPlugin.__init__
-- scapy/SubPlugin.__init__
Test 1676: NETEM nested duplicate 100%
[ 1154.071135] v0p0id1676: entered promiscuous mode
[ 1154.101066] virtio_net virtio0 enp1s0: entered promiscuous mode
[ 1154.146301] virtio_net virtio0 enp1s0: left promiscuous mode
.
Sent 1 packets.
[ 1154.173695] v0p0id1676: left promiscuous mode
[ 1154.216159] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.238398] v0p0id1676: left promiscuous mode
[ 1154.260909] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.282708] v0p0id1676: left promiscuous mode
[ 1154.309399] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.337949] v0p0id1676: left promiscuous mode
[ 1154.360924] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.383522] v0p0id1676: left promiscuous mode
All test results:
1..1
ok 1 1676 - NETEM nested duplicate 100%
Reported-by: William Liu <will@willsroot.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
.../tc-testing/tc-tests/qdiscs/netem.json | 25 +++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
index 3c4444961488..03c4ceb22990 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
@@ -336,5 +336,30 @@
"teardown": [
"$TC qdisc del dev $DUMMY handle 1: root"
]
+ },
+ {
+ "id": "1676",
+ "name": "NETEM nested duplicate 100%",
+ "category": ["qdisc", "netem"],
+ "setup": [
+ "$TC qdisc add dev $DEV1 root handle 1: netem limit 1000 duplicate 100%",
+ "$TC qdisc add dev $DEV1 parent 1: handle 2: netem limit 1000 duplicate 100%"
+ ],
+ "teardown": [
+ "$TC qdisc del dev $DEV1 root"
+ ],
+ "plugins": {
+ "requires": ["nsPlugin", "scapyPlugin"]
+ },
+ "scapy": {
+ "iface": "$DEV0",
+ "count": 5,
+ "packet": "Ether()/IP(dst='10.10.10.1', src='10.10.10.10')/TCP(sport=12345, dport=80)"
+ },
+ "cmdUnderTest": "$TC -s qdisc show dev $DEV1",
+ "expExitCode": "0",
+ "verifyCmd": "$TC -s qdisc show dev $DEV1",
+ "matchPattern": "Sent [0-9]+ bytes [0-9]+ pkt",
+ "matchCount": "2"
}
]
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 7/9] selftests/tc-testing: Add a test case for piro with netem duplicate
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
` (5 preceding siblings ...)
2025-11-26 19:52 ` [Patch net v5 6/9] selftests/tc-testing: Add a nested netem duplicate test Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-11-26 19:52 ` [Patch net v5 8/9] selftests/tc-testing: Add a test case for mq " Cong Wang
2025-11-26 19:52 ` [Patch net v5 9/9] selftests/tc-testing: Update test cases " Cong Wang
8 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang
Integrate the test case from Jamal into tc-testing:
Test 94a7: Test PRIO with NETEM duplication
All test results:
1..1
ok 1 94a7 - Test PRIO with NETEM duplication
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
.../tc-testing/tc-tests/infra/qdiscs.json | 29 +++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index b5f367fcc150..1f342173e8fe 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -961,6 +961,35 @@
"matchJSON": [],
"teardown": [
"$TC qdisc del dev $DUMMY root"
+ ]
+ },
+ {
+ "id": "94a7",
+ "name": "Test PRIO with NETEM duplication",
+ "category": [
+ "qdisc",
+ "prio",
+ "netem"
+ ],
+ "plugins": {
+ "requires": [
+ "nsPlugin"
+ ]
+ },
+ "setup": [
+ "$IP link set dev $DUMMY up || true",
+ "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+ "$TC qdisc add dev $DUMMY root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0",
+ "$TC filter add dev $DUMMY parent 1:0 protocol ip matchall classid 1:1",
+ "$TC qdisc add dev $DUMMY parent 1:1 handle 10: netem limit 4 duplicate 100%"
+ ],
+ "cmdUnderTest": "ping -c1 -W0.01 -I $DUMMY 10.10.11.11",
+ "expExitCode": "1",
+ "verifyCmd": "$TC -s qdisc show dev $DUMMY | grep -A 5 'qdisc netem' | grep -E 'Sent [0-9]+ bytes [0-9]+ pkt'",
+ "matchPattern": "Sent \\d+ bytes (\\d+) pkt",
+ "matchCount": "1",
+ "teardown": [
+ "$TC qdisc del dev $DUMMY root handle 1: prio"
]
},
{
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 8/9] selftests/tc-testing: Add a test case for mq with netem duplicate
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
` (6 preceding siblings ...)
2025-11-26 19:52 ` [Patch net v5 7/9] selftests/tc-testing: Add a test case for piro with netem duplicate Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
2025-11-26 19:52 ` [Patch net v5 9/9] selftests/tc-testing: Update test cases " Cong Wang
8 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang
Given that multi-queue NICs are prevalent and the global spinlock issue with
single netem instances is a known performance limitation, the setup using
mq as a parent for netem is an excellent and highly reasonable pattern for
applying netem effects like 100% duplication efficiently on modern Linux
systems.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
.../tc-testing/tc-tests/infra/qdiscs.json | 31 +++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index 1f342173e8fe..ce8c9c14dabb 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -1035,5 +1035,36 @@
"teardown": [
"$TC qdisc del dev $DUMMY clsact"
]
+ },
+ {
+ "id": "94a8",
+ "name": "Test MQ with NETEM duplication",
+ "category": [
+ "qdisc",
+ "mq",
+ "netem"
+ ],
+ "plugins": {
+ "requires": ["nsPlugin", "scapyPlugin"]
+ },
+ "setup": [
+ "$IP link set dev $DEV1 up",
+ "$TC qdisc add dev $DEV1 root handle 1: mq",
+ "$TC qdisc add dev $DEV1 parent 1:1 handle 10: netem duplicate 100%",
+ "$TC qdisc add dev $DEV1 parent 1:2 handle 20: netem duplicate 100%"
+ ],
+ "scapy": {
+ "iface": "$DEV0",
+ "count": 5,
+ "packet": "Ether()/IP(dst='10.10.10.1', src='10.10.10.10')/ICMP()"
+ },
+ "cmdUnderTest": "$TC -s qdisc show dev $DEV1",
+ "expExitCode": "0",
+ "verifyCmd": "$TC -s qdisc show dev $DEV1 | grep -A 5 'qdisc netem' | grep -E 'Sent [0-9]+ bytes [0-9]+ pkt'",
+ "matchPattern": "Sent \\d+ bytes (\\d+) pkt",
+ "matchCount": "2",
+ "teardown": [
+ "$TC qdisc del dev $DEV1 root handle 1: mq"
+ ]
}
]
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Patch net v5 9/9] selftests/tc-testing: Update test cases with netem duplicate
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
` (7 preceding siblings ...)
2025-11-26 19:52 ` [Patch net v5 8/9] selftests/tc-testing: Add a test case for mq " Cong Wang
@ 2025-11-26 19:52 ` Cong Wang
8 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-26 19:52 UTC (permalink / raw)
To: netdev; +Cc: stephen, kuba, Cong Wang
Now netem does no longer trigger reentrant behaviour of its upper
qdiscs, the whole architecture becomes more solid and less error prone.
Keep these test cases since one of them still sucessfully caught a bug
in QFQ qdisc, but update them to the new netem enqueue behavior.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
.../tc-testing/tc-tests/infra/qdiscs.json | 54 +++++++++----------
1 file changed, 25 insertions(+), 29 deletions(-)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index ce8c9c14dabb..c60ff51da810 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -579,7 +579,7 @@
},
{
"id": "90ec",
- "name": "Test DRR's enqueue reentrant behaviour with netem",
+ "name": "Test DRR with NETEM duplication",
"category": [
"qdisc",
"drr"
@@ -597,11 +597,11 @@
],
"cmdUnderTest": "ping -c 1 -I $DUMMY 10.10.10.1 > /dev/null || true",
"expExitCode": "0",
- "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 1:0",
+ "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 2:0",
"matchJSON": [
{
- "kind": "drr",
- "handle": "1:",
+ "kind": "netem",
+ "handle": "2:",
"bytes": 196,
"packets": 2
}
@@ -614,7 +614,7 @@
},
{
"id": "1f1f",
- "name": "Test ETS's enqueue reentrant behaviour with netem",
+ "name": "Test ETS with NETEM duplication",
"category": [
"qdisc",
"ets"
@@ -632,15 +632,13 @@
],
"cmdUnderTest": "ping -c 1 -I $DUMMY 10.10.10.1 > /dev/null || true",
"expExitCode": "0",
- "verifyCmd": "$TC -j -s class show dev $DUMMY",
+ "verifyCmd": "$TC -j -s qdisc show dev $DUMMY handle 2:0",
"matchJSON": [
{
- "class": "ets",
- "handle": "1:1",
- "stats": {
- "bytes": 196,
- "packets": 2
- }
+ "kind": "netem",
+ "handle": "2:",
+ "bytes": 196,
+ "packets": 2
}
],
"matchCount": "1",
@@ -651,7 +649,7 @@
},
{
"id": "5e6d",
- "name": "Test QFQ's enqueue reentrant behaviour with netem",
+ "name": "Test QFQ with NETEM duplication",
"category": [
"qdisc",
"qfq"
@@ -669,11 +667,11 @@
],
"cmdUnderTest": "ping -c 1 -I $DUMMY 10.10.10.1 > /dev/null || true",
"expExitCode": "0",
- "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 1:0",
+ "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 2:0",
"matchJSON": [
{
- "kind": "qfq",
- "handle": "1:",
+ "kind": "netem",
+ "handle": "2:",
"bytes": 196,
"packets": 2
}
@@ -686,7 +684,7 @@
},
{
"id": "bf1d",
- "name": "Test HFSC's enqueue reentrant behaviour with netem",
+ "name": "Test HFSC with NETEM duplication",
"category": [
"qdisc",
"hfsc"
@@ -710,13 +708,11 @@
],
"cmdUnderTest": "ping -c 1 10.10.10.2 -I$DUMMY > /dev/null || true",
"expExitCode": "0",
- "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 1:0",
+ "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 3:0",
"matchJSON": [
{
- "kind": "hfsc",
- "handle": "1:",
- "bytes": 392,
- "packets": 4
+ "kind": "netem",
+ "handle": "3:"
}
],
"matchCount": "1",
@@ -727,7 +723,7 @@
},
{
"id": "7c3b",
- "name": "Test nested DRR's enqueue reentrant behaviour with netem",
+ "name": "Test nested DRR with NETEM duplication",
"category": [
"qdisc",
"drr"
@@ -748,13 +744,13 @@
],
"cmdUnderTest": "ping -c 1 -I $DUMMY 10.10.10.1 > /dev/null || true",
"expExitCode": "0",
- "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 1:0",
+ "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY handle 3:0",
"matchJSON": [
{
- "kind": "drr",
- "handle": "1:",
- "bytes": 196,
- "packets": 2
+ "kind": "netem",
+ "handle": "3:",
+ "bytes": 98,
+ "packets": 1
}
],
"matchCount": "1",
@@ -827,7 +823,7 @@
},
{
"id": "309e",
- "name": "Test HFSC eltree double add with reentrant enqueue behaviour on netem",
+ "name": "Test complex HFSC with NETEM duplication",
"category": [
"qdisc",
"hfsc"
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-26 19:52 ` [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior Cong Wang
@ 2025-11-26 20:30 ` William Liu
2025-11-26 22:08 ` Cong Wang
2025-12-03 15:05 ` Stephen Hemminger
1 sibling, 1 reply; 24+ messages in thread
From: William Liu @ 2025-11-26 20:30 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, stephen, kuba, Savino Dicanosa, Jamal Hadi Salim
Jamal should be added to this - he was the one brainstorming fixes for this bug from the very beginning for a whole month [1].
On Wednesday, November 26th, 2025 at 7:53 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>
> In the old behavior, duplicated packets were sent back to the root qdisc,
> which could create dangerous infinite loops in hierarchical setups -
> imagine a scenario where each level of a multi-stage netem hierarchy kept
> feeding duplicates back to the top, potentially causing system instability
> or resource exhaustion.
>
> The new behavior elegantly solves this by enqueueing duplicates to the same
> qdisc that created them, ensuring that packet duplication occurs exactly
> once per netem stage in a controlled, predictable manner. This change
> enables users to safely construct complex network emulation scenarios using
> netem hierarchies (like the 4x multiplication demonstrated in testing)
> without worrying about runaway packet generation, while still preserving
> the intended duplication effects.
>
> Another advantage of this approach is that it eliminates the enqueue reentrant
> behaviour which triggered many vulnerabilities. See the last patch in this
> patchset which updates the test cases for such vulnerabilities.
>
> Now users can confidently chain multiple netem qdiscs together to achieve
> sophisticated network impairment combinations, knowing that each stage will
> apply its effects exactly once to the packet flow, making network testing
> scenarios more reliable and results more deterministic.
>
Cong, this approach has an issue we previously raised - please refer to [2]. I re-posted the summary of the issues with the various other approaches in [3] just 2 days ago in a thread with you on it. As both Jamal and Stephen have pointed out, this breaks expected user behavior as well, and the enqueuing at root was done for the sake of proper accounting and rate limit semantics. You pointed out that this doesn't violate manpage semantics, but this is still changing long-term user behavior. It doesn't make sense imo to change one longtime user behavior for another.
Jamal suggested a really reasonable fix with tc_skb_ext - can we please take a look at its soundness and attempt that approach? No user behavior would be affected in that case.
> I tested netem packet duplication in two configurations:
> 1. Nest netem-to-netem hierarchy using parent/child attachment
> 2. Single netem using prio qdisc with netem leaf
>
> Setup commands and results:
>
> Single netem hierarchy (prio + netem):
> tc qdisc add dev lo root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> tc filter add dev lo parent 1:0 protocol ip matchall classid 1:1
> tc qdisc add dev lo parent 1:1 handle 10: netem limit 4 duplicate 100%
>
> Result: 2x packet multiplication (1→2 packets)
> 2 echo requests + 4 echo replies = 6 total packets
>
> Expected behavior: Only one netem stage exists in this hierarchy, so
> 1 ping becomes 2 packets (100% duplication). The 2 echo requests generate
> 2 echo replies, which also get duplicated to 4 replies, yielding the
> predictable total of 6 packets (2 requests + 4 replies).
>
> Nest netem hierarchy (netem + netem):
> tc qdisc add dev lo root handle 1: netem limit 1000 duplicate 100%
> tc qdisc add dev lo parent 1: handle 2: netem limit 1000 duplicate 100%
>
> Result: 4x packet multiplication (1→2→4 packets)
> 4 echo requests + 16 echo replies = 20 total packets
>
> Expected behavior: Root netem duplicates 1 ping to 2 packets, child netem
> receives 2 packets and duplicates each to create 4 total packets. Since
> ping operates bidirectionally, 4 echo requests generate 4 echo replies,
> which also get duplicated through the same hierarchy (4→8→16), resulting
> in the predictable total of 20 packets (4 requests + 16 replies).
>
> The new netem duplication behavior does not break the documented
> semantics of "creates a copy of the packet before queuing." The man page
> description remains true since duplication occurs before the queuing
> process, creating both original and duplicate packets that are then
> enqueued. The documentation does not specify which qdisc should receive
> the duplicates, only that copying happens before queuing. The implementation
> choice to enqueue duplicates to the same qdisc (rather than root) is an
> internal detail that maintains the documented behavior while preventing
> infinite loops in hierarchical configurations.
>
> Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> Reported-by: William Liu will@willsroot.io
>
> Reported-by: Savino Dicanosa savy@syst3mfailure.io
>
> Signed-off-by: Cong Wang xiyou.wangcong@gmail.com
>
> ---
> net/sched/sch_netem.c | 26 +++++++++++++++-----------
> 1 file changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index fdd79d3ccd8c..191f64bd68ff 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -165,6 +165,7 @@ struct netem_sched_data {
> */
> struct netem_skb_cb {
> u64 time_to_send;
> + u8 duplicate : 1;
> };
>
> static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
> @@ -460,8 +461,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> skb->prev = NULL;
>
>
> /* Random duplication */
> - if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
>
> - ++count;
> + if (q->duplicate) {
>
> + bool dup = true;
> +
> + if (netem_skb_cb(skb)->duplicate) {
>
> + netem_skb_cb(skb)->duplicate = 0;
>
> + dup = false;
> + }
> + if (dup && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
>
> + ++count;
> + }
>
> /* Drop packet? */
> if (loss_event(q)) {
> @@ -532,17 +541,12 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc sch,
> }
>
> /
> - * If doing duplication then re-insert at top of the
> - * qdisc tree, since parent queuer expects that only one
> - * skb will be queued.
> + * If doing duplication then re-insert at the same qdisc,
> + * as going back to the root would induce loops.
> */
> if (skb2) {
> - struct Qdisc rootq = qdisc_root_bh(sch);
> - u32 dupsave = q->duplicate; / prevent duplicating a dup... */
>
> -
> - q->duplicate = 0;
>
> - rootq->enqueue(skb2, rootq, to_free);
>
> - q->duplicate = dupsave;
>
> + netem_skb_cb(skb2)->duplicate = 1;
>
> + qdisc_enqueue(skb2, sch, to_free);
> skb2 = NULL;
> }
>
> --
> 2.34.1
[1] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/T/#u
[2] https://lore.kernel.org/netdev/CAM0EoMmTZon=nFmLsDPKhDEzHruw701iV9=mq92At9oKo0LGpA@mail.gmail.com/T/#u
[3] https://lore.kernel.org/netdev/PKMd5btHYmJcKSiIJdtxQvZBEfuS4RQkBnE4M-TZkjUq_Rdj6Wgm8wDmX-p6rIkSRGDJN8ufn0HcDI6-r2lgibdSk7cn1mHIdbZEohJFKMg=@willsroot.io/
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-26 20:30 ` William Liu
@ 2025-11-26 22:08 ` Cong Wang
2025-11-26 22:43 ` William Liu
0 siblings, 1 reply; 24+ messages in thread
From: Cong Wang @ 2025-11-26 22:08 UTC (permalink / raw)
To: William Liu; +Cc: netdev, stephen, kuba, Savino Dicanosa, Jamal Hadi Salim
Hi William,
On Wed, Nov 26, 2025 at 08:30:21PM +0000, William Liu wrote:
> On Wednesday, November 26th, 2025 at 7:53 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> >
> >
> > In the old behavior, duplicated packets were sent back to the root qdisc,
> > which could create dangerous infinite loops in hierarchical setups -
> > imagine a scenario where each level of a multi-stage netem hierarchy kept
> > feeding duplicates back to the top, potentially causing system instability
> > or resource exhaustion.
> >
> > The new behavior elegantly solves this by enqueueing duplicates to the same
> > qdisc that created them, ensuring that packet duplication occurs exactly
> > once per netem stage in a controlled, predictable manner. This change
> > enables users to safely construct complex network emulation scenarios using
> > netem hierarchies (like the 4x multiplication demonstrated in testing)
> > without worrying about runaway packet generation, while still preserving
> > the intended duplication effects.
> >
> > Another advantage of this approach is that it eliminates the enqueue reentrant
> > behaviour which triggered many vulnerabilities. See the last patch in this
> > patchset which updates the test cases for such vulnerabilities.
> >
> > Now users can confidently chain multiple netem qdiscs together to achieve
> > sophisticated network impairment combinations, knowing that each stage will
> > apply its effects exactly once to the packet flow, making network testing
> > scenarios more reliable and results more deterministic.
> >
Thanks for your quick response.
>
> Cong, this approach has an issue we previously raised - please refer to [2]. I re-posted the summary of the issues with the various other approaches in [3] just 2 days ago in a thread with you on it. As both Jamal and Stephen have pointed out, this breaks expected user behavior as well, and the enqueuing at root was done for the sake of proper accounting and rate limit semantics. You pointed out that this doesn't violate manpage semantics, but this is still changing long-term user behavior. It doesn't make sense imo to change one longtime user behavior for another.
If you have a better standard than man page, please kindly point it out.
I am happy to follow.
I think we both agree it should not be either my standard or anyone's
personal stardard, this is why I use man page as a neutral and reasonable
stardard.
If you disagree man page is reasonable, please offer a better one for me
to follow. I am very open, I just simply don't know anything better than
man page.
Sorry for my ignorance. Please help me out. :)
>
> Jamal suggested a really reasonable fix with tc_skb_ext - can we please take a look at its soundness and attempt that approach? No user behavior would be affected in that case.
As I already explained, tc_skb_ext is for cross-layer, in this specific
case, we don't cross layers, the skb is immediately queued to the _same_
layer before others.
Could you please kindly explain why you still believe tc_skb_ext is
better? I am very open to your thoughts, please enlighten me here.
Thanks,
Cong
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-26 22:08 ` Cong Wang
@ 2025-11-26 22:43 ` William Liu
2025-11-26 23:13 ` Cong Wang
0 siblings, 1 reply; 24+ messages in thread
From: William Liu @ 2025-11-26 22:43 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, stephen, kuba, Savino Dicanosa, Jamal Hadi Salim
On Wednesday, November 26th, 2025 at 10:08 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>
> Hi William,
>
> On Wed, Nov 26, 2025 at 08:30:21PM +0000, William Liu wrote:
>
> > On Wednesday, November 26th, 2025 at 7:53 PM, Cong Wang xiyou.wangcong@gmail.com wrote:
> >
> > > In the old behavior, duplicated packets were sent back to the root qdisc,
> > > which could create dangerous infinite loops in hierarchical setups -
> > > imagine a scenario where each level of a multi-stage netem hierarchy kept
> > > feeding duplicates back to the top, potentially causing system instability
> > > or resource exhaustion.
> > >
> > > The new behavior elegantly solves this by enqueueing duplicates to the same
> > > qdisc that created them, ensuring that packet duplication occurs exactly
> > > once per netem stage in a controlled, predictable manner. This change
> > > enables users to safely construct complex network emulation scenarios using
> > > netem hierarchies (like the 4x multiplication demonstrated in testing)
> > > without worrying about runaway packet generation, while still preserving
> > > the intended duplication effects.
> > >
> > > Another advantage of this approach is that it eliminates the enqueue reentrant
> > > behaviour which triggered many vulnerabilities. See the last patch in this
> > > patchset which updates the test cases for such vulnerabilities.
> > >
> > > Now users can confidently chain multiple netem qdiscs together to achieve
> > > sophisticated network impairment combinations, knowing that each stage will
> > > apply its effects exactly once to the packet flow, making network testing
> > > scenarios more reliable and results more deterministic.
>
>
>
> Thanks for your quick response.
>
> > Cong, this approach has an issue we previously raised - please refer to [2]. I re-posted the summary of the issues with the various other approaches in [3] just 2 days ago in a thread with you on it. As both Jamal and Stephen have pointed out, this breaks expected user behavior as well, and the enqueuing at root was done for the sake of proper accounting and rate limit semantics. You pointed out that this doesn't violate manpage semantics, but this is still changing long-term user behavior. It doesn't make sense imo to change one longtime user behavior for another.
>
>
> If you have a better standard than man page, please kindly point it out.
> I am happy to follow.
>
> I think we both agree it should not be either my standard or anyone's
> personal stardard, this is why I use man page as a neutral and reasonable
> stardard.
>
> If you disagree man page is reasonable, please offer a better one for me
> to follow. I am very open, I just simply don't know anything better than
> man page.
I agree that your change does not violate manpage semantics. This was the original fix I suggested from the beginning, though other maintainers pointed out the issue that I am relaying.
As I wrote in my previous email, "as both Jamal and Stephen have pointed out, this breaks expected user behavior as well, and the enqueuing at root was done for the sake of proper accounting and rate limit semantics."
The previous netem fix changed user behavior that did not violate the manpage (to my knowledge). This one is the same - you are fixing one user behavior break with another. Both are cases of Hyrum's law.
>
> Sorry for my ignorance. Please help me out. :)
>
> > Jamal suggested a really reasonable fix with tc_skb_ext - can we please take a look at its soundness and attempt that approach? No user behavior would be affected in that case.
>
>
> As I already explained, tc_skb_ext is for cross-layer, in this specific
> case, we don't cross layers, the skb is immediately queued to the same
> layer before others.
>
> Could you please kindly explain why you still believe tc_skb_ext is
> better? I am very open to your thoughts, please enlighten me here.
>
Yes, if we re-enqueue the packet to the same netem qdisc, we don't need this, but that changes expected user behavior and may introduce additional correctness issues pointed out above.
If understood Jamal correctly, tc_skb_ext allows us to maintain both the re-entrant at root behavior AND prevent DOS.
I hope you can understand I am trying to relay problems other maintainers have pointed out repeatedly; I personally don't have a strong stake in this.
> Thanks,
> Cong
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-26 22:43 ` William Liu
@ 2025-11-26 23:13 ` Cong Wang
2025-11-27 2:09 ` William Liu
0 siblings, 1 reply; 24+ messages in thread
From: Cong Wang @ 2025-11-26 23:13 UTC (permalink / raw)
To: William Liu; +Cc: netdev, stephen, kuba, Savino Dicanosa, Jamal Hadi Salim
On Wed, Nov 26, 2025 at 10:43:07PM +0000, William Liu wrote:
> >
> > If you have a better standard than man page, please kindly point it out.
> > I am happy to follow.
> >
> > I think we both agree it should not be either my standard or anyone's
> > personal stardard, this is why I use man page as a neutral and reasonable
> > stardard.
> >
> > If you disagree man page is reasonable, please offer a better one for me
> > to follow. I am very open, I just simply don't know anything better than
> > man page.
>
> I agree that your change does not violate manpage semantics. This was the original fix I suggested from the beginning, though other maintainers pointed out the issue that I am relaying.
>
> As I wrote in my previous email, "as both Jamal and Stephen have pointed out, this breaks expected user behavior as well, and the enqueuing at root was done for the sake of proper accounting and rate limit semantics."
>
> The previous netem fix changed user behavior that did not violate the manpage (to my knowledge). This one is the same - you are fixing one user behavior break with another. Both are cases of Hyrum's law.
They are two different things here:
1) The behavior of "duplicate" option of netem, which is already
documented in the man page. This is why I use man page as the standard
to follow.
2) There are infinite combinations of TC components, obviously, it is
impossible to document all the combinations. This is also why I don't
think Victor's patch could fix all of them, it is a simple known
unknown.
For 1), the documented behavior is not violated by my patch, as you
agreed.
For 2), there is no known valid combination broken by this patch. At
least not the well-known mq+netem combination.
I am open to be wrong, but no one could even provide any specific case so
far, people just keep talking with speculations, so unfortunately there is
no action I can take with pure speculations.
I hope this now makes better sense to you.
>
> >
> > Sorry for my ignorance. Please help me out. :)
> >
> > > Jamal suggested a really reasonable fix with tc_skb_ext - can we please take a look at its soundness and attempt that approach? No user behavior would be affected in that case.
> >
> >
> > As I already explained, tc_skb_ext is for cross-layer, in this specific
> > case, we don't cross layers, the skb is immediately queued to the same
> > layer before others.
> >
> > Could you please kindly explain why you still believe tc_skb_ext is
> > better? I am very open to your thoughts, please enlighten me here.
> >
>
> Yes, if we re-enqueue the packet to the same netem qdisc, we don't need this, but that changes expected user behavior and may introduce additional correctness issues pointed out above.
Again, it does not violate the man page. What standard are you referring
to when you say "expected user behavior"? Please kindly point me to the
standard you refer here, I am happy to look into it.
>
> If understood Jamal correctly, tc_skb_ext allows us to maintain both the re-entrant at root behavior AND prevent DOS.
No, the whole point of this patch is to change this problematic
behavior, _without_ violating man page.
>
> I hope you can understand I am trying to relay problems other maintainers have pointed out repeatedly; I personally don't have a strong stake in this.
Your independent thoughts are welcome, no one is absolutely right, there
is no one you need to follow or relay.
BTW, I already responed to them. Please let me know how I can be even more
clear.
Regards,
Cong
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-26 23:13 ` Cong Wang
@ 2025-11-27 2:09 ` William Liu
2025-11-27 3:01 ` Cong Wang
0 siblings, 1 reply; 24+ messages in thread
From: William Liu @ 2025-11-27 2:09 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, stephen, kuba, Savino Dicanosa, Jamal Hadi Salim
On Wednesday, November 26th, 2025 at 11:13 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>
> On Wed, Nov 26, 2025 at 10:43:07PM +0000, William Liu wrote:
>
> > > If you have a better standard than man page, please kindly point it out.
> > > I am happy to follow.
> > >
> > > I think we both agree it should not be either my standard or anyone's
> > > personal stardard, this is why I use man page as a neutral and reasonable
> > > stardard.
> > >
> > > If you disagree man page is reasonable, please offer a better one for me
> > > to follow. I am very open, I just simply don't know anything better than
> > > man page.
> >
> > I agree that your change does not violate manpage semantics. This was the original fix I suggested from the beginning, though other maintainers pointed out the issue that I am relaying.
> >
> > As I wrote in my previous email, "as both Jamal and Stephen have pointed out, this breaks expected user behavior as well, and the enqueuing at root was done for the sake of proper accounting and rate limit semantics."
> >
> > The previous netem fix changed user behavior that did not violate the manpage (to my knowledge). This one is the same - you are fixing one user behavior break with another. Both are cases of Hyrum's law.
>
>
> They are two different things here:
>
> 1) The behavior of "duplicate" option of netem, which is already
> documented in the man page. This is why I use man page as the standard
> to follow.
>
> 2) There are infinite combinations of TC components, obviously, it is
> impossible to document all the combinations. This is also why I don't
> think Victor's patch could fix all of them, it is a simple known
> unknown.
>
> For 1), the documented behavior is not violated by my patch, as you
> agreed.
>
> For 2), there is no known valid combination broken by this patch. At
> least not the well-known mq+netem combination.
>
> I am open to be wrong, but no one could even provide any specific case so
> far, people just keep talking with speculations, so unfortunately there is
> no action I can take with pure speculations.
>
> I hope this now makes better sense to you.
>
> > > Sorry for my ignorance. Please help me out. :)
> > >
> > > > Jamal suggested a really reasonable fix with tc_skb_ext - can we please take a look at its soundness and attempt that approach? No user behavior would be affected in that case.
> > >
> > > As I already explained, tc_skb_ext is for cross-layer, in this specific
> > > case, we don't cross layers, the skb is immediately queued to the same
> > > layer before others.
> > >
> > > Could you please kindly explain why you still believe tc_skb_ext is
> > > better? I am very open to your thoughts, please enlighten me here.
> >
> > Yes, if we re-enqueue the packet to the same netem qdisc, we don't need this, but that changes expected user behavior and may introduce additional correctness issues pointed out above.
>
>
> Again, it does not violate the man page. What standard are you referring
> to when you say "expected user behavior"? Please kindly point me to the
> standard you refer here, I am happy to look into it.
I meant long-time existing user-observable behavior (since 2005).
>
> > If understood Jamal correctly, tc_skb_ext allows us to maintain both the re-entrant at root behavior AND prevent DOS.
>
>
> No, the whole point of this patch is to change this problematic
> behavior, without violating man page.
>
If that's the case, I will defer to other maintainers then.
FWIW, the commit (0afb51e72855) that introduced this mentioned that the current behavior helps "avoid problems with qlen accounting with nested qdisc."
If you were just trying to fix the bug, then a fix that prevents DOS and changes no existing observable behavior is better imo.
> > I hope you can understand I am trying to relay problems other maintainers have pointed out repeatedly; I personally don't have a strong stake in this.
>
>
> Your independent thoughts are welcome, no one is absolutely right, there
> is no one you need to follow or relay.
>
> BTW, I already responed to them. Please let me know how I can be even more
> clear.
>
> Regards,
> Cong
Best,
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-27 2:09 ` William Liu
@ 2025-11-27 3:01 ` Cong Wang
0 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-11-27 3:01 UTC (permalink / raw)
To: William Liu; +Cc: netdev, stephen, kuba, Savino Dicanosa, Jamal Hadi Salim
On Thu, Nov 27, 2025 at 02:09:58AM +0000, William Liu wrote:
> On Wednesday, November 26th, 2025 at 11:13 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > Again, it does not violate the man page. What standard are you referring
> > to when you say "expected user behavior"? Please kindly point me to the
> > standard you refer here, I am happy to look into it.
>
> I meant long-time existing user-observable behavior (since 2005).
If you believe this does not violate man page, then it is safe.
Otherwise, please be specific on how it violates man page. There is only
one sentence in the man page: "creates a copy of the packet before queuing."
Let's reduce it down to two words: "before queuing", please kindly point out
which word my patch violates. I am happy to consider your opinion, but
only when you are willing to help.
Keep saying long-time or user-expected does not help anything here, man
page is the only "contract" we have with the users.
>
> If you were just trying to fix the bug, then a fix that prevents DOS and changes no existing observable behavior is better imo.
The problematic behavior of duplication is the root cause. So, we can't
fix the bug without fixing the root cause.
Let's put security aside, it is still problematic logically. There is no
way to define a logiclly correct behavior with queuing back to root.
Since you ignored my 3-page long changelog, let me copy-n-paste it for
your convenience:
Single netem hierarchy (prio + netem):
tc qdisc add dev lo root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tc filter add dev lo parent 1:0 protocol ip matchall classid 1:1
tc qdisc add dev lo parent 1:1 handle 10: netem limit 4 duplicate 100%
Result: 2x packet multiplication (1→2 packets)
2 echo requests + 4 echo replies = 6 total packets
Expected behavior: Only one netem stage exists in this hierarchy, so
1 ping becomes 2 packets (100% duplication). The 2 echo requests generate
2 echo replies, which also get duplicated to 4 replies, yielding the
predictable total of 6 packets (2 requests + 4 replies).
Nest netem hierarchy (netem + netem):
tc qdisc add dev lo root handle 1: netem limit 1000 duplicate 100%
tc qdisc add dev lo parent 1: handle 2: netem limit 1000 duplicate 100%
Result: 4x packet multiplication (1→2→4 packets)
4 echo requests + 16 echo replies = 20 total packets
If a netem clone is reinjected at the root, then each duplicate enters the
entire qdisc hierarchy again, so the clone becomes
an input to netem again, producing: 1 → 2 → 4 → 8 → …
This is no longer a single probability distribution; it becomes a cascade
of netem stages, even if you only configured one.
The behavior becomes structural, not probabilistic. Which is not what users
expect when they set duplicate 100%.
They intuitively expect:
- One duplication,
- One netem pass,
- No recursion.
This matches same-qdisc enqueueing.
This is why the right behavior matters for users, regardless of security
concern. Hence it must be corrected.
If any part of it is not clear, please let me know. I am very happy to
explain to you. If you need, we can have a video meeting too, happy to
walk you through this step-by-step.
Thanks,
Cong
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace
2025-11-26 19:52 ` [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace Cong Wang
@ 2025-12-02 0:25 ` Jakub Kicinski
2025-12-03 5:41 ` Cong Wang
2025-12-02 0:40 ` Stephen Hemminger
2025-12-02 9:16 ` Paolo Abeni
2 siblings, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2025-12-02 0:25 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, stephen, Cong Wang
On Wed, 26 Nov 2025 11:52:39 -0800 Cong Wang wrote:
> The netem qdisc has a known security issue with packet duplication
> that makes it unsafe to use in unprivileged contexts. While netem
> typically requires CAP_NET_ADMIN to load, users with "root" privileges
> inside a user namespace also have CAP_NET_ADMIN within that namespace,
> allowing them to potentially exploit this feature.
>
> To address this, we need to restrict the netem duplication to only the
> initial user namespace.
What gives us the confidence that this won't break existing setups?
Pretty sure we use user ns at Meta, tho not sure if any of our
workloads uses both those and netem dup.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace
2025-11-26 19:52 ` [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace Cong Wang
2025-12-02 0:25 ` Jakub Kicinski
@ 2025-12-02 0:40 ` Stephen Hemminger
2025-12-02 9:16 ` Paolo Abeni
2 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2025-12-02 0:40 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, kuba, Cong Wang
On Wed, 26 Nov 2025 11:52:39 -0800
Cong Wang <xiyou.wangcong@gmail.com> wrote:
> From: Cong Wang <cwang@multikernel.io>
>
> The netem qdisc has a known security issue with packet duplication
> that makes it unsafe to use in unprivileged contexts. While netem
> typically requires CAP_NET_ADMIN to load, users with "root" privileges
> inside a user namespace also have CAP_NET_ADMIN within that namespace,
> allowing them to potentially exploit this feature.
>
> To address this, we need to restrict the netem duplication to only the
> initial user namespace.
>
> Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> Signed-off-by: Cong Wang <cwang@multikernel.io>
> ---
Duplication logic should be fixed rather than trying to build
a wall around it.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace
2025-11-26 19:52 ` [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace Cong Wang
2025-12-02 0:25 ` Jakub Kicinski
2025-12-02 0:40 ` Stephen Hemminger
@ 2025-12-02 9:16 ` Paolo Abeni
2 siblings, 0 replies; 24+ messages in thread
From: Paolo Abeni @ 2025-12-02 9:16 UTC (permalink / raw)
To: Cong Wang, netdev; +Cc: stephen, kuba, Cong Wang
On 11/26/25 8:52 PM, Cong Wang wrote:
> From: Cong Wang <cwang@multikernel.io>
>
> The netem qdisc has a known security issue with packet duplication
> that makes it unsafe to use in unprivileged contexts. While netem
> typically requires CAP_NET_ADMIN to load, users with "root" privileges
> inside a user namespace also have CAP_NET_ADMIN within that namespace,
> allowing them to potentially exploit this feature.
>
> To address this, we need to restrict the netem duplication to only the
> initial user namespace.
>
> Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> Signed-off-by: Cong Wang <cwang@multikernel.io>
If I read correctly, this will prevent any kind of duplication inside
some/most container orchestration platforms.
I think that nowadays completely disabling a feature for containers is
not much different than removing that feature entirely.
/P
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg()
2025-11-26 19:52 ` [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg() Cong Wang
@ 2025-12-02 9:20 ` Paolo Abeni
2025-12-03 5:42 ` Cong Wang
2025-12-02 21:18 ` Xiang Mei
1 sibling, 1 reply; 24+ messages in thread
From: Paolo Abeni @ 2025-12-02 9:20 UTC (permalink / raw)
To: Cong Wang, netdev; +Cc: stephen, kuba, Xiang Mei
On 11/26/25 8:52 PM, Cong Wang wrote:
> qfq_choose_next_agg() could return NULL so its return value should be
> properly checked unless NULL is acceptable.
>
> There are two cases we need to deal with:
>
> 1) q->in_serv_agg, which is okay with NULL since it is either checked or
> just compared with other pointer without dereferencing. In fact, it
> is even intentionally set to NULL in one of the cases.
>
> 2) in_serv_agg, which is a temporary local variable, which is not okay
> with NULL, since it is dereferenced immediately, hence must be checked.
>
> This fix corrects one of the 2nd cases, and leaving the 1st case as they are.
>
> Although this bug is triggered with the netem duplicate change, the root
> cause is still within qfq qdisc.
Given the above, I think this patch should come first in the series WRT
"net_sched: Implement the right netem duplication behavior"
/P
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg()
2025-11-26 19:52 ` [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg() Cong Wang
2025-12-02 9:20 ` Paolo Abeni
@ 2025-12-02 21:18 ` Xiang Mei
1 sibling, 0 replies; 24+ messages in thread
From: Xiang Mei @ 2025-12-02 21:18 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, stephen, kuba
On Wed, Nov 26, 2025 at 11:52:40AM -0800, Cong Wang wrote:
> qfq_choose_next_agg() could return NULL so its return value should be
> properly checked unless NULL is acceptable.
>
> There are two cases we need to deal with:
>
> 1) q->in_serv_agg, which is okay with NULL since it is either checked or
> just compared with other pointer without dereferencing. In fact, it
> is even intentionally set to NULL in one of the cases.
>
> 2) in_serv_agg, which is a temporary local variable, which is not okay
> with NULL, since it is dereferenced immediately, hence must be checked.
>
> This fix corrects one of the 2nd cases, and leaving the 1st case as they are.
>
> Although this bug is triggered with the netem duplicate change, the root
> cause is still within qfq qdisc.
>
> Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
> Cc: Xiang Mei <xmei5@asu.edu>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
> net/sched/sch_qfq.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
> index 2255355e51d3..a438ebaf53e0 100644
> --- a/net/sched/sch_qfq.c
> +++ b/net/sched/sch_qfq.c
> @@ -1145,6 +1145,8 @@ static struct sk_buff *qfq_dequeue(struct Qdisc *sch)
> * choose the new aggregate to serve.
> */
> in_serv_agg = q->in_serv_agg = qfq_choose_next_agg(q);
> + if (!in_serv_agg)
> + return NULL;
> skb = qfq_peek_skb(in_serv_agg, &cl, &len);
> }
> if (!skb)
> --
> 2.34.1
>
The explanations make sense to me. The patch also handles the case
correctly.
Reviewed-by: Xiang Mei <xmei5@asu.edu>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace
2025-12-02 0:25 ` Jakub Kicinski
@ 2025-12-03 5:41 ` Cong Wang
0 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-12-03 5:41 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, stephen, Cong Wang
On Mon, Dec 01, 2025 at 04:25:24PM -0800, Jakub Kicinski wrote:
> On Wed, 26 Nov 2025 11:52:39 -0800 Cong Wang wrote:
> > The netem qdisc has a known security issue with packet duplication
> > that makes it unsafe to use in unprivileged contexts. While netem
> > typically requires CAP_NET_ADMIN to load, users with "root" privileges
> > inside a user namespace also have CAP_NET_ADMIN within that namespace,
> > allowing them to potentially exploit this feature.
> >
> > To address this, we need to restrict the netem duplication to only the
> > initial user namespace.
>
> What gives us the confidence that this won't break existing setups?
> Pretty sure we use user ns at Meta, tho not sure if any of our
> workloads uses both those and netem dup.
All the reports (https://bugzilla.kernel.org/show_bug.cgi?id=220774) we
had so far didn't mention user namespace. This is the only data point I
have.
I can drop this patch, but I am not sure if patch 3/9 is sufficient to
convince Will on user namespace security.
Regards,
Cong
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg()
2025-12-02 9:20 ` Paolo Abeni
@ 2025-12-03 5:42 ` Cong Wang
0 siblings, 0 replies; 24+ messages in thread
From: Cong Wang @ 2025-12-03 5:42 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, stephen, kuba, Xiang Mei
On Tue, Dec 02, 2025 at 10:20:20AM +0100, Paolo Abeni wrote:
> On 11/26/25 8:52 PM, Cong Wang wrote:
> > qfq_choose_next_agg() could return NULL so its return value should be
> > properly checked unless NULL is acceptable.
> >
> > There are two cases we need to deal with:
> >
> > 1) q->in_serv_agg, which is okay with NULL since it is either checked or
> > just compared with other pointer without dereferencing. In fact, it
> > is even intentionally set to NULL in one of the cases.
> >
> > 2) in_serv_agg, which is a temporary local variable, which is not okay
> > with NULL, since it is dereferenced immediately, hence must be checked.
> >
> > This fix corrects one of the 2nd cases, and leaving the 1st case as they are.
> >
> > Although this bug is triggered with the netem duplicate change, the root
> > cause is still within qfq qdisc.
>
> Given the above, I think this patch should come first in the series WRT
> "net_sched: Implement the right netem duplication behavior"
Will do.
Regards,
Cong
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior
2025-11-26 19:52 ` [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior Cong Wang
2025-11-26 20:30 ` William Liu
@ 2025-12-03 15:05 ` Stephen Hemminger
1 sibling, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2025-12-03 15:05 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, kuba, William Liu, Savino Dicanosa
On Wed, 26 Nov 2025 11:52:38 -0800
Cong Wang <xiyou.wangcong@gmail.com> wrote:
> In the old behavior, duplicated packets were sent back to the root qdisc,
> which could create dangerous infinite loops in hierarchical setups -
> imagine a scenario where each level of a multi-stage netem hierarchy kept
> feeding duplicates back to the top, potentially causing system instability
> or resource exhaustion.
>
> The new behavior elegantly solves this by enqueueing duplicates to the same
> qdisc that created them, ensuring that packet duplication occurs exactly
> once per netem stage in a controlled, predictable manner. This change
> enables users to safely construct complex network emulation scenarios using
> netem hierarchies (like the 4x multiplication demonstrated in testing)
> without worrying about runaway packet generation, while still preserving
> the intended duplication effects.
>
> Another advantage of this approach is that it eliminates the enqueue reentrant
> behaviour which triggered many vulnerabilities. See the last patch in this
> patchset which updates the test cases for such vulnerabilities.
>
> Now users can confidently chain multiple netem qdiscs together to achieve
> sophisticated network impairment combinations, knowing that each stage will
> apply its effects exactly once to the packet flow, making network testing
> scenarios more reliable and results more deterministic.
>
> I tested netem packet duplication in two configurations:
> 1. Nest netem-to-netem hierarchy using parent/child attachment
> 2. Single netem using prio qdisc with netem leaf
>
> Setup commands and results:
>
> Single netem hierarchy (prio + netem):
> tc qdisc add dev lo root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> tc filter add dev lo parent 1:0 protocol ip matchall classid 1:1
> tc qdisc add dev lo parent 1:1 handle 10: netem limit 4 duplicate 100%
>
> Result: 2x packet multiplication (1→2 packets)
> 2 echo requests + 4 echo replies = 6 total packets
>
> Expected behavior: Only one netem stage exists in this hierarchy, so
> 1 ping becomes 2 packets (100% duplication). The 2 echo requests generate
> 2 echo replies, which also get duplicated to 4 replies, yielding the
> predictable total of 6 packets (2 requests + 4 replies).
>
> Nest netem hierarchy (netem + netem):
> tc qdisc add dev lo root handle 1: netem limit 1000 duplicate 100%
> tc qdisc add dev lo parent 1: handle 2: netem limit 1000 duplicate 100%
>
> Result: 4x packet multiplication (1→2→4 packets)
> 4 echo requests + 16 echo replies = 20 total packets
>
> Expected behavior: Root netem duplicates 1 ping to 2 packets, child netem
> receives 2 packets and duplicates each to create 4 total packets. Since
> ping operates bidirectionally, 4 echo requests generate 4 echo replies,
> which also get duplicated through the same hierarchy (4→8→16), resulting
> in the predictable total of 20 packets (4 requests + 16 replies).
>
> The new netem duplication behavior does not break the documented
> semantics of "creates a copy of the packet before queuing." The man page
> description remains true since duplication occurs before the queuing
> process, creating both original and duplicate packets that are then
> enqueued. The documentation does not specify which qdisc should receive
> the duplicates, only that copying happens before queuing. The implementation
> choice to enqueue duplicates to the same qdisc (rather than root) is an
> internal detail that maintains the documented behavior while preventing
> infinite loops in hierarchical configurations.
>
> Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> Reported-by: William Liu <will@willsroot.io>
> Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
> net/sched/sch_netem.c | 26 +++++++++++++++-----------
> 1 file changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index fdd79d3ccd8c..191f64bd68ff 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -165,6 +165,7 @@ struct netem_sched_data {
> */
> struct netem_skb_cb {
> u64 time_to_send;
> + u8 duplicate : 1;
> };
>
> static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
> @@ -460,8 +461,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> skb->prev = NULL;
>
> /* Random duplication */
> - if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> - ++count;
> + if (q->duplicate) {
> + bool dup = true;
> +
> + if (netem_skb_cb(skb)->duplicate) {
> + netem_skb_cb(skb)->duplicate = 0;
> + dup = false;
> + }
> + if (dup && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> + ++count;
> + }
>
> /* Drop packet? */
> if (loss_event(q)) {
> @@ -532,17 +541,12 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> }
>
> /*
> - * If doing duplication then re-insert at top of the
> - * qdisc tree, since parent queuer expects that only one
> - * skb will be queued.
> + * If doing duplication then re-insert at the same qdisc,
> + * as going back to the root would induce loops.
> */
> if (skb2) {
> - struct Qdisc *rootq = qdisc_root_bh(sch);
> - u32 dupsave = q->duplicate; /* prevent duplicating a dup... */
> -
> - q->duplicate = 0;
> - rootq->enqueue(skb2, rootq, to_free);
> - q->duplicate = dupsave;
> + netem_skb_cb(skb2)->duplicate = 1;
> + qdisc_enqueue(skb2, sch, to_free);
> skb2 = NULL;
> }
>
I wonder if a lot of the issues would go away if netem used a workqueue
to do the duplication. It would avoid nested calls etc.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-12-03 15:05 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-26 19:52 [Patch net v5 0/9] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
2025-11-26 19:52 ` [Patch net v5 1/9] Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree" Cong Wang
2025-11-26 19:52 ` [Patch net v5 2/9] Revert "selftests/tc-testing: Add tests for restrictions on netem duplication" Cong Wang
2025-11-26 19:52 ` [Patch net v5 3/9] net_sched: Implement the right netem duplication behavior Cong Wang
2025-11-26 20:30 ` William Liu
2025-11-26 22:08 ` Cong Wang
2025-11-26 22:43 ` William Liu
2025-11-26 23:13 ` Cong Wang
2025-11-27 2:09 ` William Liu
2025-11-27 3:01 ` Cong Wang
2025-12-03 15:05 ` Stephen Hemminger
2025-11-26 19:52 ` [Patch net v5 4/9] net_sched: Prevent using netem duplication in non-initial user namespace Cong Wang
2025-12-02 0:25 ` Jakub Kicinski
2025-12-03 5:41 ` Cong Wang
2025-12-02 0:40 ` Stephen Hemminger
2025-12-02 9:16 ` Paolo Abeni
2025-11-26 19:52 ` [Patch net v5 5/9] net_sched: Check the return value of qfq_choose_next_agg() Cong Wang
2025-12-02 9:20 ` Paolo Abeni
2025-12-03 5:42 ` Cong Wang
2025-12-02 21:18 ` Xiang Mei
2025-11-26 19:52 ` [Patch net v5 6/9] selftests/tc-testing: Add a nested netem duplicate test Cong Wang
2025-11-26 19:52 ` [Patch net v5 7/9] selftests/tc-testing: Add a test case for piro with netem duplicate Cong Wang
2025-11-26 19:52 ` [Patch net v5 8/9] selftests/tc-testing: Add a test case for mq " Cong Wang
2025-11-26 19:52 ` [Patch net v5 9/9] selftests/tc-testing: Update test cases " Cong Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).