* [Patch v3 net 0/4] netem: Fix skb duplication logic and prevent infinite loops
@ 2025-07-13 21:47 Cong Wang
2025-07-13 21:47 ` [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior Cong Wang
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Cong Wang @ 2025-07-13 21:47 UTC (permalink / raw)
To: netdev; +Cc: jhs, will, stephen, Cong Wang
This patchset fixes the infinite loops due to duplication in netem, the
real root cause of this problem is enqueuing to the root qdisc, which is
now changed to enqueuing to the same qdisc. This is more reasonable,
more predictable from users' perspective, and more elegant.
Please see more details in patch 1/4 which contains two pages of
detailed explanation including why it is safe and better.
This replaces the patches from William, with much less code and without
any workaround. More importantly, this does not break any use case at
all.
---
v3: Fixed the root cause of enqueuing to root
Switched back to netem_skb_cb safely
Added two more test cases
v2: Fixed a typo
Improved tdc selftest to check sent bytes
Cong Wang (4):
net_sched: Implement the right netem duplication behavior
selftests/tc-testing: Add a nested netem duplicate test
selftests/tc-testing: Add a test case for piro with netem duplicate
selftests/tc-testing: Add a test case for mq with netem duplicate
net/sched/sch_netem.c | 26 ++++----
.../tc-testing/tc-tests/infra/qdiscs.json | 59 +++++++++++++++++++
.../tc-testing/tc-tests/qdiscs/netem.json | 25 ++++++++
3 files changed, 99 insertions(+), 11 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior
2025-07-13 21:47 [Patch v3 net 0/4] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
@ 2025-07-13 21:47 ` Cong Wang
2025-07-13 22:01 ` Cong Wang
` (2 more replies)
2025-07-13 21:47 ` [Patch v3 net 2/4] selftests/tc-testing: Add a nested netem duplicate test Cong Wang
` (2 subsequent siblings)
3 siblings, 3 replies; 11+ messages in thread
From: Cong Wang @ 2025-07-13 21:47 UTC (permalink / raw)
To: netdev; +Cc: jhs, will, stephen, Cong Wang, Savino Dicanosa
In the old behavior, duplicated packets were sent back to the root qdisc,
which could create dangerous infinite loops in hierarchical setups -
imagine a scenario where each level of a multi-stage netem hierarchy kept
feeding duplicates back to the top, potentially causing system instability
or resource exhaustion.
The new behavior elegantly solves this by enqueueing duplicates to the same
qdisc that created them, ensuring that packet duplication occurs exactly
once per netem stage in a controlled, predictable manner. This change
enables users to safely construct complex network emulation scenarios using
netem hierarchies (like the 4x multiplication demonstrated in testing)
without worrying about runaway packet generation, while still preserving
the intended duplication effects.
Users can now confidently chain multiple netem qdiscs together to achieve
sophisticated network impairment combinations, knowing that each stage will
apply its effects exactly once to the packet flow, making network testing
scenarios more reliable and results more deterministic.
I tested netem packet duplication in two configurations:
1. Nest netem-to-netem hierarchy using parent/child attachment
2. Single netem using prio qdisc with netem leaf
Setup commands and results:
Single netem hierarchy (prio + netem):
tc qdisc add dev lo root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tc filter add dev lo parent 1:0 protocol ip matchall classid 1:1
tc qdisc add dev lo parent 1:1 handle 10: netem limit 4 duplicate 100%
Result: 2x packet multiplication (1→2 packets)
2 echo requests + 4 echo replies = 6 total packets
Expected behavior: Only one netem stage exists in this hierarchy, so
1 ping becomes 2 packets (100% duplication). The 2 echo requests generate
2 echo replies, which also get duplicated to 4 replies, yielding the
predictable total of 6 packets (2 requests + 4 replies).
Nest netem hierarchy (netem + netem):
tc qdisc add dev lo root handle 1: netem limit 1000 duplicate 100%
tc qdisc add dev lo parent 1: handle 2: netem limit 1000 duplicate 100%
Result: 4x packet multiplication (1→2→4 packets)
4 echo requests + 16 echo replies = 20 total packets
Expected behavior: Root netem duplicates 1 ping to 2 packets, child netem
receives 2 packets and duplicates each to create 4 total packets. Since
ping operates bidirectionally, 4 echo requests generate 4 echo replies,
which also get duplicated through the same hierarchy (4→8→16), resulting
in the predictable total of 20 packets (4 requests + 16 replies).
The new netem duplication behavior does not break the documented
semantics of "creates a copy of the packet before queuing." The man page
description remains true since duplication occurs before the queuing
process, creating both original and duplicate packets that are then
enqueued. The documentation does not specify which qdisc should receive
the duplicates, only that copying happens before queuing. The implementation
choice to enqueue duplicates to the same qdisc (rather than root) is an
internal detail that maintains the documented behavior while preventing
infinite loops in hierarchical configurations.
Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Reported-by: William Liu <will@willsroot.io>
Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
net/sched/sch_netem.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index fdd79d3ccd8c..191f64bd68ff 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -165,6 +165,7 @@ struct netem_sched_data {
*/
struct netem_skb_cb {
u64 time_to_send;
+ u8 duplicate : 1;
};
static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
@@ -460,8 +461,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
skb->prev = NULL;
/* Random duplication */
- if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
- ++count;
+ if (q->duplicate) {
+ bool dup = true;
+
+ if (netem_skb_cb(skb)->duplicate) {
+ netem_skb_cb(skb)->duplicate = 0;
+ dup = false;
+ }
+ if (dup && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
+ ++count;
+ }
/* Drop packet? */
if (loss_event(q)) {
@@ -532,17 +541,12 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
}
/*
- * If doing duplication then re-insert at top of the
- * qdisc tree, since parent queuer expects that only one
- * skb will be queued.
+ * If doing duplication then re-insert at the same qdisc,
+ * as going back to the root would induce loops.
*/
if (skb2) {
- struct Qdisc *rootq = qdisc_root_bh(sch);
- u32 dupsave = q->duplicate; /* prevent duplicating a dup... */
-
- q->duplicate = 0;
- rootq->enqueue(skb2, rootq, to_free);
- q->duplicate = dupsave;
+ netem_skb_cb(skb2)->duplicate = 1;
+ qdisc_enqueue(skb2, sch, to_free);
skb2 = NULL;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Patch v3 net 2/4] selftests/tc-testing: Add a nested netem duplicate test
2025-07-13 21:47 [Patch v3 net 0/4] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
2025-07-13 21:47 ` [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior Cong Wang
@ 2025-07-13 21:47 ` Cong Wang
2025-07-13 21:47 ` [Patch v3 net 3/4] selftests/tc-testing: Add a test case for piro with netem duplicate Cong Wang
2025-07-13 21:47 ` [Patch v3 net 4/4] selftests/tc-testing: Add a test case for mq " Cong Wang
3 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2025-07-13 21:47 UTC (permalink / raw)
To: netdev; +Cc: jhs, will, stephen, Cong Wang
Integrate the reproducer from William into tc-testing and use scapy
to generate packets for testing:
# ./tdc.py -e 1676
-- ns/SubPlugin.__init__
-- scapy/SubPlugin.__init__
Test 1676: NETEM nested duplicate 100%
[ 1154.071135] v0p0id1676: entered promiscuous mode
[ 1154.101066] virtio_net virtio0 enp1s0: entered promiscuous mode
[ 1154.146301] virtio_net virtio0 enp1s0: left promiscuous mode
.
Sent 1 packets.
[ 1154.173695] v0p0id1676: left promiscuous mode
[ 1154.216159] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.238398] v0p0id1676: left promiscuous mode
[ 1154.260909] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.282708] v0p0id1676: left promiscuous mode
[ 1154.309399] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.337949] v0p0id1676: left promiscuous mode
[ 1154.360924] v0p0id1676: entered promiscuous mode
.
Sent 1 packets.
[ 1154.383522] v0p0id1676: left promiscuous mode
All test results:
1..1
ok 1 1676 - NETEM nested duplicate 100%
Reported-by: William Liu <will@willsroot.io>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
.../tc-testing/tc-tests/qdiscs/netem.json | 25 +++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
index 3c4444961488..03c4ceb22990 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
@@ -336,5 +336,30 @@
"teardown": [
"$TC qdisc del dev $DUMMY handle 1: root"
]
+ },
+ {
+ "id": "1676",
+ "name": "NETEM nested duplicate 100%",
+ "category": ["qdisc", "netem"],
+ "setup": [
+ "$TC qdisc add dev $DEV1 root handle 1: netem limit 1000 duplicate 100%",
+ "$TC qdisc add dev $DEV1 parent 1: handle 2: netem limit 1000 duplicate 100%"
+ ],
+ "teardown": [
+ "$TC qdisc del dev $DEV1 root"
+ ],
+ "plugins": {
+ "requires": ["nsPlugin", "scapyPlugin"]
+ },
+ "scapy": {
+ "iface": "$DEV0",
+ "count": 5,
+ "packet": "Ether()/IP(dst='10.10.10.1', src='10.10.10.10')/TCP(sport=12345, dport=80)"
+ },
+ "cmdUnderTest": "$TC -s qdisc show dev $DEV1",
+ "expExitCode": "0",
+ "verifyCmd": "$TC -s qdisc show dev $DEV1",
+ "matchPattern": "Sent [0-9]+ bytes [0-9]+ pkt",
+ "matchCount": "2"
}
]
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Patch v3 net 3/4] selftests/tc-testing: Add a test case for piro with netem duplicate
2025-07-13 21:47 [Patch v3 net 0/4] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
2025-07-13 21:47 ` [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior Cong Wang
2025-07-13 21:47 ` [Patch v3 net 2/4] selftests/tc-testing: Add a nested netem duplicate test Cong Wang
@ 2025-07-13 21:47 ` Cong Wang
2025-07-13 21:47 ` [Patch v3 net 4/4] selftests/tc-testing: Add a test case for mq " Cong Wang
3 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2025-07-13 21:47 UTC (permalink / raw)
To: netdev; +Cc: jhs, will, stephen, Cong Wang
Integrate the test case from Jamal into tc-testing:
Test 94a7: Test PRIO with NETEM duplication
All test results:
1..1
ok 1 94a7 - Test PRIO with NETEM duplication
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
.../tc-testing/tc-tests/infra/qdiscs.json | 29 +++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index 5c6851e8d311..bfa6de751270 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -672,5 +672,34 @@
"teardown": [
"$TC qdisc del dev $DUMMY root handle 1: drr"
]
+ },
+ {
+ "id": "94a7",
+ "name": "Test PRIO with NETEM duplication",
+ "category": [
+ "qdisc",
+ "prio",
+ "netem"
+ ],
+ "plugins": {
+ "requires": [
+ "nsPlugin"
+ ]
+ },
+ "setup": [
+ "$IP link set dev $DUMMY up || true",
+ "$IP addr add 10.10.11.10/24 dev $DUMMY || true",
+ "$TC qdisc add dev $DUMMY root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0",
+ "$TC filter add dev $DUMMY parent 1:0 protocol ip matchall classid 1:1",
+ "$TC qdisc add dev $DUMMY parent 1:1 handle 10: netem limit 4 duplicate 100%"
+ ],
+ "cmdUnderTest": "ping -c1 -W0.01 -I $DUMMY 10.10.11.11",
+ "expExitCode": "1",
+ "verifyCmd": "$TC -s qdisc show dev $DUMMY | grep -A 5 'qdisc netem' | grep -E 'Sent [0-9]+ bytes [0-9]+ pkt'",
+ "matchPattern": "Sent \\d+ bytes (\\d+) pkt",
+ "matchCount": "1",
+ "teardown": [
+ "$TC qdisc del dev $DUMMY root handle 1: prio"
+ ]
}
]
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Patch v3 net 4/4] selftests/tc-testing: Add a test case for mq with netem duplicate
2025-07-13 21:47 [Patch v3 net 0/4] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
` (2 preceding siblings ...)
2025-07-13 21:47 ` [Patch v3 net 3/4] selftests/tc-testing: Add a test case for piro with netem duplicate Cong Wang
@ 2025-07-13 21:47 ` Cong Wang
3 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2025-07-13 21:47 UTC (permalink / raw)
To: netdev; +Cc: jhs, will, stephen, Cong Wang
This is a very reasonable use case for multiqueue NIC, add it to
tc-testing to ensure no one should break it.
Test 94a8: Test MQ with NETEM duplication
[ 753.877611] v0p0id94a8: entered promiscuous mode
[ 753.909783] virtio_net virtio0 enp1s0: entered promiscuous mode
[ 753.936686] virtio_net virtio0 enp1s0: left promiscuous mode
.
Sent 1 packets.
[ 753.984974] v0p0id94a8: left promiscuous mode
[ 754.010725] v0p0id94a8: entered promiscuous mode
.
Sent 1 packets.
[ 754.030879] v0p0id94a8: left promiscuous mode
[ 754.067966] v0p0id94a8: entered promiscuous mode
.
Sent 1 packets.
[ 754.096516] v0p0id94a8: left promiscuous mode
[ 754.129166] v0p0id94a8: entered promiscuous mode
.
Sent 1 packets.
[ 754.156371] v0p0id94a8: left promiscuous mode
[ 754.187278] v0p0id94a8: entered promiscuous mode
.
Sent 1 packets.
[ 754.212102] v0p0id94a8: left promiscuous mode
All test results:
1..1
ok 1 94a8 - Test MQ with NETEM duplication
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
.../tc-testing/tc-tests/infra/qdiscs.json | 30 +++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
index bfa6de751270..8e206260fa79 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
@@ -701,5 +701,35 @@
"teardown": [
"$TC qdisc del dev $DUMMY root handle 1: prio"
]
+ },
+ {
+ "id": "94a8",
+ "name": "Test MQ with NETEM duplication",
+ "category": [
+ "qdisc",
+ "mq",
+ "netem"
+ ],
+ "plugins": {
+ "requires": ["nsPlugin", "scapyPlugin"]
+ },
+ "setup": [
+ "$IP link set dev $DEV1 up",
+ "$TC qdisc add dev $DEV1 root handle 1: mq",
+ "$TC qdisc add dev $DEV1 parent 1:1 handle 10: netem duplicate 100%"
+ ],
+ "scapy": {
+ "iface": "$DEV0",
+ "count": 5,
+ "packet": "Ether()/IP(dst='10.10.10.1', src='10.10.10.10')/ICMP()"
+ },
+ "cmdUnderTest": "$TC -s qdisc show dev $DEV1",
+ "expExitCode": "0",
+ "verifyCmd": "$TC -s qdisc show dev $DEV1 | grep -A 5 'qdisc netem' | grep -E 'Sent [0-9]+ bytes [0-9]+ pkt'",
+ "matchPattern": "Sent \\d+ bytes (\\d+) pkt",
+ "matchCount": "1",
+ "teardown": [
+ "$TC qdisc del dev $DEV1 root handle 1: mq"
+ ]
}
]
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior
2025-07-13 21:47 ` [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior Cong Wang
@ 2025-07-13 22:01 ` Cong Wang
2025-07-13 22:12 ` Stephen Hemminger
2025-07-14 2:30 ` William Liu
2 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2025-07-13 22:01 UTC (permalink / raw)
To: netdev; +Cc: jhs, will, stephen, Savino Dicanosa
On Sun, Jul 13, 2025 at 02:47:45PM -0700, Cong Wang wrote:
> I tested netem packet duplication in two configurations:
> 1. Nest netem-to-netem hierarchy using parent/child attachment
> 2. Single netem using prio qdisc with netem leaf
>
Below is the complete tcpdump output I had for each of the above setup,
just provide more evidence here since a lot of people don't trust me.
(Don't get me wrong, it is obviously my fault of not being able to gain
trust).
[root@localhost ~]# tcpdump -i lo -nn -c 20 icmp
[ 589.079074] lo: entered promiscuous mode
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:59:23.485638 IP 127.0.0.1 > 127.0.0.1: ICMP echo request, id 5, seq 1, length 64
12:59:23.485844 IP 127.0.0.1 > 127.0.0.1: ICMP echo request, id 5, seq 1, length 64
12:59:23.486714 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 5, seq 1, length 64
12:59:23.486996 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 5, seq 1, length 64
12:59:23.487867 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 5, seq 1, length 64
12:59:23.488163 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 5, seq 1, length 64
[root@localhost ~]# tcpdump -i lo -nn -c 20 icmp
[ 361.831773] lo: entered promiscuous mode
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:55:37.074400 IP 127.0.0.1 > 127.0.0.1: ICMP echo request, id 3, seq 1, length 64
12:55:37.074606 IP 127.0.0.1 > 127.0.0.1: ICMP echo request, id 3, seq 1, length 64
12:55:37.074806 IP 127.0.0.1 > 127.0.0.1: ICMP echo request, id 3, seq 1, length 64
12:55:37.075012 IP 127.0.0.1 > 127.0.0.1: ICMP echo request, id 3, seq 1, length 64
12:55:37.076508 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.076789 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.077069 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.077404 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.078825 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.079109 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.079404 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.079927 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.081125 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.081477 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.081763 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.082044 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.083253 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.083534 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.083816 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
12:55:37.084101 IP 127.0.0.1 > 127.0.0.1: ICMP echo reply, id 3, seq 1, length 64
20 packets captured
40 packets received by filter
I didn't include them because the patch description is already very
long. I am happy to add them to the patch description on request.
Please let me know if I miss anything here. I am very very open to
continue iterating this patch.
Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior
2025-07-13 21:47 ` [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior Cong Wang
2025-07-13 22:01 ` Cong Wang
@ 2025-07-13 22:12 ` Stephen Hemminger
2025-07-15 17:48 ` Cong Wang
2025-07-14 2:30 ` William Liu
2 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2025-07-13 22:12 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, jhs, will, Savino Dicanosa
On Sun, 13 Jul 2025 14:47:45 -0700
Cong Wang <xiyou.wangcong@gmail.com> wrote:
> + if (q->duplicate) {
> + bool dup = true;
> +
> + if (netem_skb_cb(skb)->duplicate) {
> + netem_skb_cb(skb)->duplicate = 0;
> + dup = false;
> + }
> + if (dup && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> + ++count;
> + }
Doesn't look ideal.
Why do yo need the temporary variable here?
And you risk having bug where first duplicate sets the flag then second clears it
and a third layer would do duplicate and reset it.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior
2025-07-13 21:47 ` [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior Cong Wang
2025-07-13 22:01 ` Cong Wang
2025-07-13 22:12 ` Stephen Hemminger
@ 2025-07-14 2:30 ` William Liu
2025-07-15 18:03 ` Cong Wang
2 siblings, 1 reply; 11+ messages in thread
From: William Liu @ 2025-07-14 2:30 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, jhs, stephen, Savino Dicanosa
On Sunday, July 13th, 2025 at 9:48 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>
> In the old behavior, duplicated packets were sent back to the root qdisc,
> which could create dangerous infinite loops in hierarchical setups -
> imagine a scenario where each level of a multi-stage netem hierarchy kept
> feeding duplicates back to the top, potentially causing system instability
> or resource exhaustion.
>
> The new behavior elegantly solves this by enqueueing duplicates to the same
> qdisc that created them, ensuring that packet duplication occurs exactly
> once per netem stage in a controlled, predictable manner. This change
> enables users to safely construct complex network emulation scenarios using
> netem hierarchies (like the 4x multiplication demonstrated in testing)
> without worrying about runaway packet generation, while still preserving
> the intended duplication effects.
>
> Users can now confidently chain multiple netem qdiscs together to achieve
> sophisticated network impairment combinations, knowing that each stage will
> apply its effects exactly once to the packet flow, making network testing
> scenarios more reliable and results more deterministic.
>
> I tested netem packet duplication in two configurations:
> 1. Nest netem-to-netem hierarchy using parent/child attachment
> 2. Single netem using prio qdisc with netem leaf
>
> Setup commands and results:
>
> Single netem hierarchy (prio + netem):
> tc qdisc add dev lo root handle 1: prio bands 3 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> tc filter add dev lo parent 1:0 protocol ip matchall classid 1:1
> tc qdisc add dev lo parent 1:1 handle 10: netem limit 4 duplicate 100%
>
> Result: 2x packet multiplication (1→2 packets)
> 2 echo requests + 4 echo replies = 6 total packets
>
> Expected behavior: Only one netem stage exists in this hierarchy, so
> 1 ping becomes 2 packets (100% duplication). The 2 echo requests generate
> 2 echo replies, which also get duplicated to 4 replies, yielding the
> predictable total of 6 packets (2 requests + 4 replies).
>
> Nest netem hierarchy (netem + netem):
> tc qdisc add dev lo root handle 1: netem limit 1000 duplicate 100%
> tc qdisc add dev lo parent 1: handle 2: netem limit 1000 duplicate 100%
>
> Result: 4x packet multiplication (1→2→4 packets)
> 4 echo requests + 16 echo replies = 20 total packets
>
> Expected behavior: Root netem duplicates 1 ping to 2 packets, child netem
> receives 2 packets and duplicates each to create 4 total packets. Since
> ping operates bidirectionally, 4 echo requests generate 4 echo replies,
> which also get duplicated through the same hierarchy (4→8→16), resulting
> in the predictable total of 20 packets (4 requests + 16 replies).
>
> The new netem duplication behavior does not break the documented
> semantics of "creates a copy of the packet before queuing." The man page
> description remains true since duplication occurs before the queuing
> process, creating both original and duplicate packets that are then
> enqueued. The documentation does not specify which qdisc should receive
> the duplicates, only that copying happens before queuing. The implementation
> choice to enqueue duplicates to the same qdisc (rather than root) is an
> internal detail that maintains the documented behavior while preventing
> infinite loops in hierarchical configurations.
>
> Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
> Reported-by: William Liu will@willsroot.io
>
> Reported-by: Savino Dicanosa savy@syst3mfailure.io
>
> Signed-off-by: Cong Wang xiyou.wangcong@gmail.com
>
> ---
> net/sched/sch_netem.c | 26 +++++++++++++++-----------
> 1 file changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index fdd79d3ccd8c..191f64bd68ff 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -165,6 +165,7 @@ struct netem_sched_data {
> */
> struct netem_skb_cb {
> u64 time_to_send;
> + u8 duplicate : 1;
> };
>
> static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
> @@ -460,8 +461,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> skb->prev = NULL;
>
>
> /* Random duplication */
> - if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
>
> - ++count;
> + if (q->duplicate) {
>
> + bool dup = true;
> +
> + if (netem_skb_cb(skb)->duplicate) {
>
> + netem_skb_cb(skb)->duplicate = 0;
>
> + dup = false;
> + }
> + if (dup && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
>
> + ++count;
> + }
>
> /* Drop packet? */
> if (loss_event(q)) {
> @@ -532,17 +541,12 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc sch,
> }
>
> /
> - * If doing duplication then re-insert at top of the
> - * qdisc tree, since parent queuer expects that only one
> - * skb will be queued.
> + * If doing duplication then re-insert at the same qdisc,
> + * as going back to the root would induce loops.
> */
> if (skb2) {
> - struct Qdisc rootq = qdisc_root_bh(sch);
> - u32 dupsave = q->duplicate; / prevent duplicating a dup... */
>
> -
> - q->duplicate = 0;
>
> - rootq->enqueue(skb2, rootq, to_free);
>
> - q->duplicate = dupsave;
>
> + netem_skb_cb(skb2)->duplicate = 1;
>
> + qdisc_enqueue(skb2, sch, to_free);
> skb2 = NULL;
> }
>
> --
> 2.34.1
FWIW, I suggested changing this behavior to not enqueue from the root a while ago too on the security mailing list for the HFSC rsc bug (as the re-entrancy violated assumptions in other qdiscs), but was told some users might be expecting that behavior and we would break their setups.
If we really want to preserve the ability to have multiple duplicating netems in a tree, I think Jamal had a good suggestion here to rely on tc_skb_ext extensions [1].
However, I noted that there are implementation issues that we would have to deal with. Copying what I said there [2]:
"The tc_skb_ext approach has a problem... the config option that enables it is NET_TC_SKB_EXT. I assumed this is a generic name for skb extensions in the tc subsystem, but unfortunately this is hardcoded for NET_CLS_ACT recirculation support.
So what this means is we have the following choices:
1. Make SCH_NETEM depend on NET_CLS_ACT and NET_TC_SKB_EXT
2. Add "|| IS_ENABLED(CONFIG_SCH_NETEM)" next to "IS_ENABLED(CONFIG_NET_TC_SKB_EXT)"
3. Separate NET_TC_SKB_EXT and the idea of recirculation support. But I'm not sure how people feel about renaming config options. And this would require a small change to the Mellanox driver subsystem.
None of these sound too nice to do, and I'm not sure which approach to take. In an ideal world, 3 would be best, but I'm not sure how others would feel about all that just to account for a netem edge case."
Of course, we can add an extra extension enum for netem but that will just make this even messier imo.
[1] https://lore.kernel.org/netdev/CAM0EoMmBdZBzfUAms5-0hH5qF5ODvxWfgqrbHaGT6p3-uOD6vg@mail.gmail.com/
[2] https://lore.kernel.org/netdev/lhR3z8brE3wSKO4PDITIAGXGGW8vnrt1zIPo7C10g2rH0zdQ1lA8zFOuUBklLOTAgMcw4Z6N5YnqRXRzWnkHO-unr5g62msCAUHow-NmY7k=@willsroot.io/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior
2025-07-13 22:12 ` Stephen Hemminger
@ 2025-07-15 17:48 ` Cong Wang
0 siblings, 0 replies; 11+ messages in thread
From: Cong Wang @ 2025-07-15 17:48 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, jhs, will, Savino Dicanosa
On Sun, Jul 13, 2025 at 03:12:20PM -0700, Stephen Hemminger wrote:
> On Sun, 13 Jul 2025 14:47:45 -0700
> Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > + if (q->duplicate) {
> > + bool dup = true;
> > +
> > + if (netem_skb_cb(skb)->duplicate) {
> > + netem_skb_cb(skb)->duplicate = 0;
> > + dup = false;
> > + }
> > + if (dup && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> > + ++count;
> > + }
>
> Doesn't look ideal.
>
> Why do yo need the temporary variable here?
It is all because we need to clear the duplicate bit.
> And you risk having bug where first duplicate sets the flag then second clears it
> and a third layer would do duplicate and reset it.
I am not sure I follow you here. After this patch, we only enqueue the
duplicate skb to the same qdisc and this skb's duplicate bit gets
immediately cleared here. They have no chance to traverse to other qdisc
before clearing this bit. This is actually why it is safe to use
netem_skb_cb() now (instead of skb ext or tc_skb_cb). Or am I missing anything?
If you have a specific setup you suspect this may break, please share it
with me and I am happy to test and integrate into TDC.
Thanks for your review!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior
2025-07-14 2:30 ` William Liu
@ 2025-07-15 18:03 ` Cong Wang
2025-07-15 18:41 ` William Liu
0 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2025-07-15 18:03 UTC (permalink / raw)
To: William Liu; +Cc: netdev, jhs, stephen, Savino Dicanosa
On Mon, Jul 14, 2025 at 02:30:26AM +0000, William Liu wrote:
> FWIW, I suggested changing this behavior to not enqueue from the root a while ago too on the security mailing list for the HFSC rsc bug (as the re-entrancy violated assumptions in other qdiscs), but was told some users might be expecting that behavior and we would break their setups.
Thanks for your valuable input.
Instead of arguing on what users expect, I think it is fair to use the
man page as our argreement with user. Please let me know if you have
more reasonable argreement or more reasonable use case for us to justify
updates to the man page.
I have an open mind.
>
> If we really want to preserve the ability to have multiple duplicating netems in a tree, I think Jamal had a good suggestion here to rely on tc_skb_ext extensions [1].
Do you mind to be more specific here? I don't think I am following you
on why tc_skb_ext is better here.
The reason why I changed back to netem_skb_cb is exactly because of the
enqueue beahvior change, which now only allows the skb to be queued to
the same qdisc.
If you have a specific reasonable use case you suspect my patch might
break, please share it with me. It would help me to understand you
better and more importantly to test this patch more comprehensively,
I'd love to add as many selftests as I can.
>
> However, I noted that there are implementation issues that we would have to deal with. Copying what I said there [2]:
>
> "The tc_skb_ext approach has a problem... the config option that enables it is NET_TC_SKB_EXT. I assumed this is a generic name for skb extensions in the tc subsystem, but unfortunately this is hardcoded for NET_CLS_ACT recirculation support.
IMHO, Kconfig is not a problem here, we just need to deal with the
necessary dependency if we really need to use it.
Like I said above, I don't see the problem of using netem_skb_cb after
enqueuing to the same qdisc, this is the only reason why I don't see the
need to changing it to either tc_skb_cb or skb_ext.
Thanks for your review!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior
2025-07-15 18:03 ` Cong Wang
@ 2025-07-15 18:41 ` William Liu
0 siblings, 0 replies; 11+ messages in thread
From: William Liu @ 2025-07-15 18:41 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, jhs, stephen, Savino Dicanosa
On Tuesday, July 15th, 2025 at 6:03 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>
> On Mon, Jul 14, 2025 at 02:30:26AM +0000, William Liu wrote:
>
> > FWIW, I suggested changing this behavior to not enqueue from the root a while ago too on the security mailing list for the HFSC rsc bug (as the re-entrancy violated assumptions in other qdiscs), but was told some users might be expecting that behavior and we would break their setups.
>
>
> Thanks for your valuable input.
>
> Instead of arguing on what users expect, I think it is fair to use the
> man page as our argreement with user. Please let me know if you have
> more reasonable argreement or more reasonable use case for us to justify
> updates to the man page.
>
> I have an open mind.
I don't really have too strong of an opinion here and personally think it makes more sense to enqueue from the current qdisc under duplication. However, if the code has been performing the enqueue at the root from so many years ago..
I will defer to what others say on this.
> > If we really want to preserve the ability to have multiple duplicating netems in a tree, I think Jamal had a good suggestion here to rely on tc_skb_ext extensions [1].
>
>
> Do you mind to be more specific here? I don't think I am following you
> on why tc_skb_ext is better here.
>
> The reason why I changed back to netem_skb_cb is exactly because of the
> enqueue beahvior change, which now only allows the skb to be queued to
> the same qdisc.
>
> If you have a specific reasonable use case you suspect my patch might
> break, please share it with me. It would help me to understand you
> better and more importantly to test this patch more comprehensively,
> I'd love to add as many selftests as I can.
>
> > However, I noted that there are implementation issues that we would have to deal with. Copying what I said there [2]:
> >
> > "The tc_skb_ext approach has a problem... the config option that enables it is NET_TC_SKB_EXT. I assumed this is a generic name for skb extensions in the tc subsystem, but unfortunately this is hardcoded for NET_CLS_ACT recirculation support.
>
>
> IMHO, Kconfig is not a problem here, we just need to deal with the
> necessary dependency if we really need to use it.
>
> Like I said above, I don't see the problem of using netem_skb_cb after
> enqueuing to the same qdisc, this is the only reason why I don't see the
> need to changing it to either tc_skb_cb or skb_ext.
>
> Thanks for your review!
If tc_skb_ext is used, then the original behavior of enqueue from root can be preserved and multiple duplicating netems can now exist in the tree in any hierarchy. No potential user setups or expectations will be broken.
I am not sure how others would feel about propagating this amount of change though just for this one specific use case though (but I am of the opinion that not hardcoding NET_TC_SKB_EXT for one specific use case is a good thing regardless).
Best,
William
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-07-15 18:41 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-13 21:47 [Patch v3 net 0/4] netem: Fix skb duplication logic and prevent infinite loops Cong Wang
2025-07-13 21:47 ` [Patch v3 net 1/4] net_sched: Implement the right netem duplication behavior Cong Wang
2025-07-13 22:01 ` Cong Wang
2025-07-13 22:12 ` Stephen Hemminger
2025-07-15 17:48 ` Cong Wang
2025-07-14 2:30 ` William Liu
2025-07-15 18:03 ` Cong Wang
2025-07-15 18:41 ` William Liu
2025-07-13 21:47 ` [Patch v3 net 2/4] selftests/tc-testing: Add a nested netem duplicate test Cong Wang
2025-07-13 21:47 ` [Patch v3 net 3/4] selftests/tc-testing: Add a test case for piro with netem duplicate Cong Wang
2025-07-13 21:47 ` [Patch v3 net 4/4] selftests/tc-testing: Add a test case for mq " Cong Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).