* [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency
@ 2025-10-06 19:30 Eric Dumazet
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
` (5 more replies)
0 siblings, 6 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:30 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
In this series, I replace the busylock spinlock we have in
__dev_queue_xmit() and use lockless list (llist) to reduce
spinlock contention to the minimum.
Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.
After this series, we get a 300 % (4x) improvement on heavy TX workloads,
sending twice the number of packets per second, for half the cpu cycles.
Eric Dumazet (5):
net: add add indirect call wrapper in skb_release_head_state()
net/sched: act_mirred: add loop detection
Revert "net/sched: Fix mirred deadlock on device recursion"
net: sched: claim one cache line in Qdisc
net: dev_queue_xmit() llist adoption
include/linux/netdevice_xmit.h | 9 +++-
include/net/sch_generic.h | 23 ++++-----
net/core/dev.c | 91 ++++++++++++++++++----------------
net/core/skbuff.c | 4 +-
net/sched/act_mirred.c | 62 +++++++++--------------
net/sched/sch_generic.c | 7 ---
6 files changed, 93 insertions(+), 103 deletions(-)
--
2.51.0.618.g983fd99d29-goog
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
@ 2025-10-06 19:30 ` Eric Dumazet
2025-10-06 19:38 ` Eric Dumazet
2025-10-07 15:26 ` Alexander Lobakin
2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
` (4 subsequent siblings)
5 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:30 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
While stress testing UDP senders on a host with expensive indirect
calls, I found cpus processing TX completions where showing
a very high cost (20%) in sock_wfree() due to
CONFIG_MITIGATION_RETPOLINE=y.
Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/skbuff.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
skb_dst_drop(skb);
if (skb->destructor) {
DEBUG_NET_WARN_ON_ONCE(in_hardirq());
- skb->destructor(skb);
+ INDIRECT_CALL_3(skb->destructor,
+ tcp_wfree, __sock_wfree, sock_wfree,
+ skb);
}
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
nf_conntrack_put(skb_nfct(skb));
--
2.51.0.618.g983fd99d29-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
2025-10-12 15:22 ` Jamal Hadi Salim
2025-10-06 19:31 ` [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
` (3 subsequent siblings)
5 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
We want to revert commit 0f022d32c3ec ("net/sched: Fix mirred deadlock
on device recursion") because it adds code in the fast path, even when
act_mirred is not used.
Use an additional device pointers array in struct netdev_xmit
and implement loop detection in tcf_mirred_is_act_redirect().
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/netdevice_xmit.h | 9 ++++-
net/sched/act_mirred.c | 62 +++++++++++++---------------------
2 files changed, 31 insertions(+), 40 deletions(-)
diff --git a/include/linux/netdevice_xmit.h b/include/linux/netdevice_xmit.h
index 813a19122ebbb2c6a04176330b1055b7c2b9c902..cc232508e695eefe95ea6e55a21978be11d5da83 100644
--- a/include/linux/netdevice_xmit.h
+++ b/include/linux/netdevice_xmit.h
@@ -2,6 +2,12 @@
#ifndef _LINUX_NETDEVICE_XMIT_H
#define _LINUX_NETDEVICE_XMIT_H
+#if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
+#define MIRRED_NEST_LIMIT 4
+#endif
+
+struct net_device;
+
struct netdev_xmit {
u16 recursion;
u8 more;
@@ -9,7 +15,8 @@ struct netdev_xmit {
u8 skip_txqueue;
#endif
#if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
- u8 sched_mirred_nest;
+ u8 sched_mirred_nest;
+ struct net_device *sched_mirred_dev[MIRRED_NEST_LIMIT];
#endif
#if IS_ENABLED(CONFIG_NF_DUP_NETDEV)
u8 nf_dup_skb_recursion;
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5f01f567c934d3669d9a3058cff861a8fe5f88b6..f27b583def78e4afecc7112854b93d59c2520201 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -29,31 +29,6 @@
static LIST_HEAD(mirred_list);
static DEFINE_SPINLOCK(mirred_list_lock);
-#define MIRRED_NEST_LIMIT 4
-
-#ifndef CONFIG_PREEMPT_RT
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
- return __this_cpu_inc_return(softnet_data.xmit.sched_mirred_nest);
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
- __this_cpu_dec(softnet_data.xmit.sched_mirred_nest);
-}
-
-#else
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
- return current->net_xmit.sched_mirred_nest++;
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
- current->net_xmit.sched_mirred_nest--;
-}
-#endif
-
static bool tcf_mirred_is_act_redirect(int action)
{
return action == TCA_EGRESS_REDIR || action == TCA_INGRESS_REDIR;
@@ -439,44 +414,53 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
{
struct tcf_mirred *m = to_mirred(a);
int retval = READ_ONCE(m->tcf_action);
- unsigned int nest_level;
+ struct netdev_xmit *xmit;
bool m_mac_header_xmit;
struct net_device *dev;
- int m_eaction;
+ int i, m_eaction;
u32 blockid;
- nest_level = tcf_mirred_nest_level_inc_return();
- if (unlikely(nest_level > MIRRED_NEST_LIMIT)) {
+#ifdef CONFIG_PREEMPT_RT
+ xmit = ¤t->net_xmit;
+#else
+ xmit = this_cpu_ptr(&softnet_data.xmit);
+#endif
+ if (unlikely(xmit->sched_mirred_nest >= MIRRED_NEST_LIMIT)) {
net_warn_ratelimited("Packet exceeded mirred recursion limit on dev %s\n",
netdev_name(skb->dev));
- retval = TC_ACT_SHOT;
- goto dec_nest_level;
+ return TC_ACT_SHOT;
}
tcf_lastuse_update(&m->tcf_tm);
tcf_action_update_bstats(&m->common, skb);
blockid = READ_ONCE(m->tcfm_blockid);
- if (blockid) {
- retval = tcf_blockcast(skb, m, blockid, res, retval);
- goto dec_nest_level;
- }
+ if (blockid)
+ return tcf_blockcast(skb, m, blockid, res, retval);
dev = rcu_dereference_bh(m->tcfm_dev);
if (unlikely(!dev)) {
pr_notice_once("tc mirred: target device is gone\n");
tcf_action_inc_overlimit_qstats(&m->common);
- goto dec_nest_level;
+ return retval;
}
+ for (i = 0; i < xmit->sched_mirred_nest; i++) {
+ if (xmit->sched_mirred_dev[i] != dev)
+ continue;
+ pr_notice_once("tc mirred: loop on device %s\n",
+ netdev_name(dev));
+ tcf_action_inc_overlimit_qstats(&m->common);
+ return retval;
+ }
+
+ xmit->sched_mirred_dev[xmit->sched_mirred_nest++] = dev;
m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
m_eaction = READ_ONCE(m->tcfm_eaction);
retval = tcf_mirred_to_dev(skb, m, dev, m_mac_header_xmit, m_eaction,
retval);
-
-dec_nest_level:
- tcf_mirred_nest_level_dec();
+ xmit->sched_mirred_nest--;
return retval;
}
--
2.51.0.618.g983fd99d29-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion"
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
` (2 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
and 44180feaccf266d9b0b28cc4ceaac019817deb5c.
Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sch_generic.h | 1 -
net/core/dev.c | 6 ------
net/sched/sch_generic.c | 2 --
3 files changed, 9 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 738cd5b13c62fb8501619625880b87991f3dc17c..32e9961570b467b6066f1bb2c00ff1a270e342bc 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -117,7 +117,6 @@ struct Qdisc {
struct qdisc_skb_head q;
struct gnet_stats_basic_sync bstats;
struct gnet_stats_queue qstats;
- int owner;
unsigned long state;
unsigned long state2; /* must be written under qdisc spinlock */
struct Qdisc *next_sched;
diff --git a/net/core/dev.c b/net/core/dev.c
index a64cef2c537e98ee87776e6f8d3ca3d98f0711b3..0ff178399b2d28ca2754b3f06d69a97f5d6dcf71 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4167,10 +4167,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
return rc;
}
- if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
- kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
- return NET_XMIT_DROP;
- }
/*
* Heuristic to force contended enqueues to serialize on a
* separate lock before trying to get qdisc main lock.
@@ -4210,9 +4206,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
qdisc_run_end(q);
rc = NET_XMIT_SUCCESS;
} else {
- WRITE_ONCE(q->owner, smp_processor_id());
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
- WRITE_ONCE(q->owner, -1);
if (qdisc_run_begin(q)) {
if (unlikely(contended)) {
spin_unlock(&q->busylock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1e008a228ebdf846d4ef7f83d655ac1142ec3596..dfa8e8e667d24a435b0c9cb3c1f05c8075f63e89 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -679,7 +679,6 @@ struct Qdisc noop_qdisc = {
.qlen = 0,
.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
},
- .owner = -1,
};
EXPORT_SYMBOL(noop_qdisc);
@@ -985,7 +984,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
sch->enqueue = ops->enqueue;
sch->dequeue = ops->dequeue;
sch->dev_queue = dev_queue;
- sch->owner = -1;
netdev_hold(dev, &sch->dev_tracker, GFP_KERNEL);
refcount_set(&sch->refcnt, 1);
--
2.51.0.618.g983fd99d29-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
` (2 preceding siblings ...)
2025-10-06 19:31 ` [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-10-07 5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
5 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
Replace state2 field with a boolean.
Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sch_generic.h | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 32e9961570b467b6066f1bb2c00ff1a270e342bc..31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -41,13 +41,6 @@ enum qdisc_state_t {
__QDISC_STATE_DRAINING,
};
-enum qdisc_state2_t {
- /* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
- * Use qdisc_run_begin/end() or qdisc_is_running() instead.
- */
- __QDISC_STATE2_RUNNING,
-};
-
#define QDISC_STATE_MISSED BIT(__QDISC_STATE_MISSED)
#define QDISC_STATE_DRAINING BIT(__QDISC_STATE_DRAINING)
@@ -117,8 +110,8 @@ struct Qdisc {
struct qdisc_skb_head q;
struct gnet_stats_basic_sync bstats;
struct gnet_stats_queue qstats;
+ bool running; /* must be written under qdisc spinlock */
unsigned long state;
- unsigned long state2; /* must be written under qdisc spinlock */
struct Qdisc *next_sched;
struct sk_buff_head skb_bad_txq;
@@ -167,7 +160,7 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
{
if (qdisc->flags & TCQ_F_NOLOCK)
return spin_is_locked(&qdisc->seqlock);
- return test_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+ return READ_ONCE(qdisc->running);
}
static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
@@ -210,7 +203,10 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
*/
return spin_trylock(&qdisc->seqlock);
}
- return !__test_and_set_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+ if (READ_ONCE(qdisc->running))
+ return false;
+ WRITE_ONCE(qdisc->running, true);
+ return true;
}
static inline void qdisc_run_end(struct Qdisc *qdisc)
@@ -228,7 +224,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
&qdisc->state)))
__netif_schedule(qdisc);
} else {
- __clear_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+ WRITE_ONCE(qdisc->running, false);
}
}
--
2.51.0.618.g983fd99d29-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
` (3 preceding siblings ...)
2025-10-06 19:31 ` [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
2025-10-07 11:37 ` Eric Dumazet
` (2 more replies)
2025-10-07 5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
5 siblings, 3 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.
Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.
After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.
Tested:
- Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
Before:
16 Mpps (41 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
@__dev_queue_xmit_ns:
[256, 512) 21 | |
[512, 1K) 631 | |
[1K, 2K) 27328 |@ |
[2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
[4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32K, 64K) 19055 |@ |
[64K, 128K) 17240 |@ |
[128K, 256K) 25633 |@ |
[256K, 512K) 4 | |
After:
29 Mpps (57 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
@__dev_queue_xmit_ns:
[128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 483936 |@@@@@@@@@@ |
[1K, 2K) 265345 |@@@@@@ |
[2K, 4K) 145463 |@@@ |
[4K, 8K) 54571 |@ |
[8K, 16K) 10270 | |
[16K, 32K) 9385 | |
[32K, 64K) 7749 | |
[64K, 128K) 26799 | |
[128K, 256K) 2665 | |
[256K, 512K) 665 | |
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sch_generic.h | 4 +-
net/core/dev.c | 85 ++++++++++++++++++++++-----------------
net/sched/sch_generic.c | 5 ---
3 files changed, 52 insertions(+), 42 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -115,7 +115,9 @@ struct Qdisc {
struct Qdisc *next_sched;
struct sk_buff_head skb_bad_txq;
- spinlock_t busylock ____cacheline_aligned_in_smp;
+ atomic_long_t defer_count ____cacheline_aligned_in_smp;
+ struct llist_head defer_list;
+
spinlock_t seqlock;
struct rcu_head rcu;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
struct net_device *dev,
struct netdev_queue *txq)
{
+ struct sk_buff *next, *to_free = NULL;
spinlock_t *root_lock = qdisc_lock(q);
- struct sk_buff *to_free = NULL;
- bool contended;
+ struct llist_node *ll_list, *first_n;
+ unsigned long defer_count = 0;
int rc;
qdisc_calculate_pkt_len(skb, q);
@@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
return rc;
}
- /*
- * Heuristic to force contended enqueues to serialize on a
- * separate lock before trying to get qdisc main lock.
- * This permits qdisc->running owner to get the lock more
- * often and dequeue packets faster.
- * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
- * and then other tasks will only enqueue packets. The packets will be
- * sent after the qdisc owner is scheduled again. To prevent this
- * scenario the task always serialize on the lock.
+ /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
+ * In the try_cmpxchg() loop, we want to increment q->defer_count
+ * at most once to limit the number of skbs in defer_list.
+ * We perform the defer_count increment only if the list is not empty,
+ * because some arches have slow atomic_long_inc_return().
+ */
+ first_n = READ_ONCE(q->defer_list.first);
+ do {
+ if (first_n && !defer_count) {
+ defer_count = atomic_long_inc_return(&q->defer_count);
+ if (unlikely(defer_count > q->limit)) {
+ kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
+ return NET_XMIT_DROP;
+ }
+ }
+ skb->ll_node.next = first_n;
+ } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
+
+ /* If defer_list was not empty, we know the cpu which queued
+ * the first skb will process the whole list for us.
*/
- contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
- if (unlikely(contended))
- spin_lock(&q->busylock);
+ if (first_n)
+ return NET_XMIT_SUCCESS;
spin_lock(root_lock);
+
+ ll_list = llist_del_all(&q->defer_list);
+ /* There is a small race because we clear defer_count not atomically
+ * with the prior llist_del_all(). This means defer_list could grow
+ * over q->limit.
+ */
+ atomic_long_set(&q->defer_count, 0);
+
+ ll_list = llist_reverse_order(ll_list);
+
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
- __qdisc_drop(skb, &to_free);
+ llist_for_each_entry_safe(skb, next, ll_list, ll_node)
+ __qdisc_drop(skb, &to_free);
rc = NET_XMIT_DROP;
- } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
- qdisc_run_begin(q)) {
+ goto unlock;
+ }
+ rc = NET_XMIT_SUCCESS;
+ if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
+ !llist_next(ll_list) && qdisc_run_begin(q)) {
/*
* This is a work-conserving queue; there are no old skbs
* waiting to be sent out; and the qdisc is not running -
* xmit the skb directly.
*/
+ skb = llist_entry(ll_list, struct sk_buff, ll_node);
qdisc_bstats_update(q, skb);
-
- if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
- if (unlikely(contended)) {
- spin_unlock(&q->busylock);
- contended = false;
- }
+ if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
__qdisc_run(q);
- }
-
qdisc_run_end(q);
- rc = NET_XMIT_SUCCESS;
} else {
- rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
- if (qdisc_run_begin(q)) {
- if (unlikely(contended)) {
- spin_unlock(&q->busylock);
- contended = false;
- }
- __qdisc_run(q);
- qdisc_run_end(q);
+ llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
+ prefetch(next);
+ dev_qdisc_enqueue(skb, q, &to_free, txq);
}
+ qdisc_run(q);
}
+unlock:
spin_unlock(root_lock);
if (unlikely(to_free))
kfree_skb_list_reason(to_free,
tcf_get_drop_reason(to_free));
- if (unlikely(contended))
- spin_unlock(&q->busylock);
return rc;
}
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index dfa8e8e667d24a435b0c9cb3c1f05c8075f63e89..d9a98d02a55fc361a223f3201e37b6a2b698bb5e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -666,7 +666,6 @@ struct Qdisc noop_qdisc = {
.ops = &noop_qdisc_ops,
.q.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
.dev_queue = &noop_netdev_queue,
- .busylock = __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
.gso_skb = {
.next = (struct sk_buff *)&noop_qdisc.gso_skb,
.prev = (struct sk_buff *)&noop_qdisc.gso_skb,
@@ -970,10 +969,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
}
}
- spin_lock_init(&sch->busylock);
- lockdep_set_class(&sch->busylock,
- dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
-
/* seqlock has the same scope of busylock, for NOLOCK qdisc */
spin_lock_init(&sch->seqlock);
lockdep_set_class(&sch->seqlock,
--
2.51.0.618.g983fd99d29-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
@ 2025-10-06 19:38 ` Eric Dumazet
2025-10-07 15:26 ` Alexander Lobakin
1 sibling, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:38 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Mon, Oct 6, 2025 at 12:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> While stress testing UDP senders on a host with expensive indirect
> calls, I found cpus processing TX completions where showing
> a very high cost (20%) in sock_wfree() due to
> CONFIG_MITIGATION_RETPOLINE=y.
>
> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/skbuff.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
> skb_dst_drop(skb);
> if (skb->destructor) {
> DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> - skb->destructor(skb);
> + INDIRECT_CALL_3(skb->destructor,
> + tcp_wfree, __sock_wfree, sock_wfree,
> + skb);
This will probably not compile well with "# CONFIG_INET is not set".
I will fix this when submitting the non RFC verson.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [syzbot ci] Re: net: optimize TX throughput and efficiency
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
` (4 preceding siblings ...)
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-10-07 5:23 ` syzbot ci
2025-10-07 7:41 ` Eric Dumazet
5 siblings, 1 reply; 23+ messages in thread
From: syzbot ci @ 2025-10-07 5:23 UTC (permalink / raw)
To: davem, edumazet, eric.dumazet, horms, jhs, jiri, kuba, kuniyu,
netdev, pabeni, willemb, xiyou.wangcong
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] net: optimize TX throughput and efficiency
https://lore.kernel.org/all/20251006193103.2684156-1-edumazet@google.com
* [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
* [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
* [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion"
* [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc
* [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
and found the following issue:
KASAN: slab-use-after-free Read in netem_dequeue
Full report is available here:
https://ci.syzbot.org/series/e8660f67-35a0-406e-96ee-a401d3f30ff9
***
KASAN: slab-use-after-free Read in netem_dequeue
tree: net-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base: f1455695d2d99894b65db233877acac9a0e120b9
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/2db7ec92-610b-4887-bf33-d0b4c04760c8/config
syz repro: https://ci.syzbot.org/findings/3ca47f46-1b94-48b6-bab9-5996b7162c30/syz_repro
==================================================================
BUG: KASAN: slab-use-after-free in netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
Read of size 8 at addr ffff888020b65b30 by task ksoftirqd/1/23
CPU: 1 UID: 0 PID: 23 Comm: ksoftirqd/1 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
dequeue_skb net/sched/sch_generic.c:294 [inline]
qdisc_restart net/sched/sch_generic.c:399 [inline]
__qdisc_run+0x23c/0x15f0 net/sched/sch_generic.c:417
qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
net_tx_action+0x7c9/0x980 net/core/dev.c:5731
handle_softirqs+0x286/0x870 kernel/softirq.c:579
run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Allocated by task 5913:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
unpoison_slab_object mm/kasan/common.c:330 [inline]
__kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:356
kasan_slab_alloc include/linux/kasan.h:250 [inline]
slab_post_alloc_hook mm/slub.c:4191 [inline]
slab_alloc_node mm/slub.c:4240 [inline]
kmem_cache_alloc_node_noprof+0x1bb/0x3c0 mm/slub.c:4292
__alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
alloc_skb include/linux/skbuff.h:1383 [inline]
mld_newpack+0x13c/0xc40 net/ipv6/mcast.c:1775
add_grhead+0x5a/0x2a0 net/ipv6/mcast.c:1886
add_grec+0x1452/0x1740 net/ipv6/mcast.c:2025
mld_send_cr net/ipv6/mcast.c:2148 [inline]
mld_ifc_work+0x6ed/0xd60 net/ipv6/mcast.c:2693
process_one_work kernel/workqueue.c:3236 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3319
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3400
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Freed by task 23:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
poison_slab_object mm/kasan/common.c:243 [inline]
__kasan_slab_free+0x5b/0x80 mm/kasan/common.c:275
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2422 [inline]
slab_free mm/slub.c:4695 [inline]
kmem_cache_free+0x18f/0x400 mm/slub.c:4797
br_dev_xmit+0x11b3/0x1840 net/bridge/br_device.c:108
__netdev_start_xmit include/linux/netdevice.h:5248 [inline]
netdev_start_xmit include/linux/netdevice.h:5257 [inline]
xmit_one net/core/dev.c:3845 [inline]
dev_hard_start_xmit+0x2d7/0x830 net/core/dev.c:3861
sch_direct_xmit+0x241/0x4b0 net/sched/sch_generic.c:344
qdisc_restart net/sched/sch_generic.c:409 [inline]
__qdisc_run+0xb16/0x15f0 net/sched/sch_generic.c:417
qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
net_tx_action+0x7c9/0x980 net/core/dev.c:5731
handle_softirqs+0x286/0x870 kernel/softirq.c:579
run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
The buggy address belongs to the object at ffff888020b65b00
which belongs to the cache skbuff_head_cache of size 240
The buggy address is located 48 bytes inside of
freed 240-byte region [ffff888020b65b00, ffff888020b65bf0)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20b64
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
head: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
head: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
head: 00fff00000000001 ffffea000082d901 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 1, migratetype Unmovable, gfp_mask 0x72820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_MEMALLOC|__GFP_COMP), pid 0, tgid 0 (swapper/0), ts 96149950019, free_ts 42914869228
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851
prep_new_page mm/page_alloc.c:1859 [inline]
get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858
__alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148
alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
alloc_slab_page mm/slub.c:2492 [inline]
allocate_slab+0x8a/0x370 mm/slub.c:2660
new_slab mm/slub.c:2714 [inline]
___slab_alloc+0xbeb/0x1420 mm/slub.c:3901
__slab_alloc mm/slub.c:3992 [inline]
__slab_alloc_node mm/slub.c:4067 [inline]
slab_alloc_node mm/slub.c:4228 [inline]
kmem_cache_alloc_node_noprof+0x280/0x3c0 mm/slub.c:4292
__alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
__netdev_alloc_skb+0x108/0x970 net/core/skbuff.c:734
netdev_alloc_skb include/linux/skbuff.h:3484 [inline]
dev_alloc_skb include/linux/skbuff.h:3497 [inline]
__ieee80211_beacon_get+0xc06/0x1880 net/mac80211/tx.c:5652
ieee80211_beacon_get_tim+0xb4/0x2b0 net/mac80211/tx.c:5774
ieee80211_beacon_get include/net/mac80211.h:5667 [inline]
mac80211_hwsim_beacon_tx+0x3ce/0x860 drivers/net/wireless/virtual/mac80211_hwsim.c:2355
__iterate_interfaces+0x2ab/0x590 net/mac80211/util.c:761
ieee80211_iterate_active_interfaces_atomic+0xdb/0x180 net/mac80211/util.c:797
mac80211_hwsim_beacon+0xbb/0x1c0 drivers/net/wireless/virtual/mac80211_hwsim.c:2389
__run_hrtimer kernel/time/hrtimer.c:1761 [inline]
__hrtimer_run_queues+0x52c/0xc60 kernel/time/hrtimer.c:1825
page last free pid 0 tgid 0 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1395 [inline]
__free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2895
rcu_do_batch kernel/rcu/tree.c:2605 [inline]
rcu_core+0xcab/0x1770 kernel/rcu/tree.c:2861
handle_softirqs+0x286/0x870 kernel/softirq.c:579
__do_softirq kernel/softirq.c:613 [inline]
invoke_softirq kernel/softirq.c:453 [inline]
__irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
Memory state around the buggy address:
ffff888020b65a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
ffff888020b65a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888020b65b00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888020b65b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
ffff888020b65c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [syzbot ci] Re: net: optimize TX throughput and efficiency
2025-10-07 5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
@ 2025-10-07 7:41 ` Eric Dumazet
0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-07 7:41 UTC (permalink / raw)
To: syzbot ci
Cc: davem, eric.dumazet, horms, jhs, jiri, kuba, kuniyu, netdev,
pabeni, willemb, xiyou.wangcong, syzbot, syzkaller-bugs
On Mon, Oct 6, 2025 at 10:24 PM syzbot ci
<syzbot+ciad44046e74230deb@syzkaller.appspotmail.com> wrote:
>
> syzbot ci has tested the following series
>
> [v1] net: optimize TX throughput and efficiency
> https://lore.kernel.org/all/20251006193103.2684156-1-edumazet@google.com
> * [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
> * [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
> * [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion"
> * [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc
> * [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
>
> and found the following issue:
> KASAN: slab-use-after-free Read in netem_dequeue
>
> Full report is available here:
> https://ci.syzbot.org/series/e8660f67-35a0-406e-96ee-a401d3f30ff9
>
I was unsure if I needed to clear skb->next or not before calling
dev_qdisc_enqueue()
We could either do this in netem, or generically in dev_qdisc_enqueue()
diff --git a/net/core/dev.c b/net/core/dev.c
index 6094768bf3c028f0ad1e52b9b12b7258fa0ecff6..547efbfb63adb4a093ce4b4ea0934256c15e263b
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4115,6 +4115,7 @@ static int dev_qdisc_enqueue(struct sk_buff
*skb, struct Qdisc *q,
{
int rc;
+ skb_mark_not_on_list(skb);
rc = q->enqueue(skb, q, to_free) & NET_XMIT_MASK;
if (rc == NET_XMIT_SUCCESS)
trace_qdisc_enqueue(q, txq, skb);
> ***
>
> KASAN: slab-use-after-free Read in netem_dequeue
>
> tree: net-next
> URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
> base: f1455695d2d99894b65db233877acac9a0e120b9
> arch: amd64
> compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
> config: https://ci.syzbot.org/builds/2db7ec92-610b-4887-bf33-d0b4c04760c8/config
> syz repro: https://ci.syzbot.org/findings/3ca47f46-1b94-48b6-bab9-5996b7162c30/syz_repro
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
> Read of size 8 at addr ffff888020b65b30 by task ksoftirqd/1/23
>
> CPU: 1 UID: 0 PID: 23 Comm: ksoftirqd/1 Not tainted syzkaller #0 PREEMPT(full)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Call Trace:
> <TASK>
> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
> print_address_description mm/kasan/report.c:378 [inline]
> print_report+0xca/0x240 mm/kasan/report.c:482
> kasan_report+0x118/0x150 mm/kasan/report.c:595
> netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
> dequeue_skb net/sched/sch_generic.c:294 [inline]
> qdisc_restart net/sched/sch_generic.c:399 [inline]
> __qdisc_run+0x23c/0x15f0 net/sched/sch_generic.c:417
> qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
> net_tx_action+0x7c9/0x980 net/core/dev.c:5731
> handle_softirqs+0x286/0x870 kernel/softirq.c:579
> run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
> smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
> kthread+0x711/0x8a0 kernel/kthread.c:463
> ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> </TASK>
>
> Allocated by task 5913:
> kasan_save_stack mm/kasan/common.c:47 [inline]
> kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
> unpoison_slab_object mm/kasan/common.c:330 [inline]
> __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:356
> kasan_slab_alloc include/linux/kasan.h:250 [inline]
> slab_post_alloc_hook mm/slub.c:4191 [inline]
> slab_alloc_node mm/slub.c:4240 [inline]
> kmem_cache_alloc_node_noprof+0x1bb/0x3c0 mm/slub.c:4292
> __alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
> alloc_skb include/linux/skbuff.h:1383 [inline]
> mld_newpack+0x13c/0xc40 net/ipv6/mcast.c:1775
> add_grhead+0x5a/0x2a0 net/ipv6/mcast.c:1886
> add_grec+0x1452/0x1740 net/ipv6/mcast.c:2025
> mld_send_cr net/ipv6/mcast.c:2148 [inline]
> mld_ifc_work+0x6ed/0xd60 net/ipv6/mcast.c:2693
> process_one_work kernel/workqueue.c:3236 [inline]
> process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3319
> worker_thread+0x8a0/0xda0 kernel/workqueue.c:3400
> kthread+0x711/0x8a0 kernel/kthread.c:463
> ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>
> Freed by task 23:
> kasan_save_stack mm/kasan/common.c:47 [inline]
> kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
> kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
> poison_slab_object mm/kasan/common.c:243 [inline]
> __kasan_slab_free+0x5b/0x80 mm/kasan/common.c:275
> kasan_slab_free include/linux/kasan.h:233 [inline]
> slab_free_hook mm/slub.c:2422 [inline]
> slab_free mm/slub.c:4695 [inline]
> kmem_cache_free+0x18f/0x400 mm/slub.c:4797
> br_dev_xmit+0x11b3/0x1840 net/bridge/br_device.c:108
> __netdev_start_xmit include/linux/netdevice.h:5248 [inline]
> netdev_start_xmit include/linux/netdevice.h:5257 [inline]
> xmit_one net/core/dev.c:3845 [inline]
> dev_hard_start_xmit+0x2d7/0x830 net/core/dev.c:3861
> sch_direct_xmit+0x241/0x4b0 net/sched/sch_generic.c:344
> qdisc_restart net/sched/sch_generic.c:409 [inline]
> __qdisc_run+0xb16/0x15f0 net/sched/sch_generic.c:417
> qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
> net_tx_action+0x7c9/0x980 net/core/dev.c:5731
> handle_softirqs+0x286/0x870 kernel/softirq.c:579
> run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
> smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
> kthread+0x711/0x8a0 kernel/kthread.c:463
> ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>
> The buggy address belongs to the object at ffff888020b65b00
> which belongs to the cache skbuff_head_cache of size 240
> The buggy address is located 48 bytes inside of
> freed 240-byte region [ffff888020b65b00, ffff888020b65bf0)
>
> The buggy address belongs to the physical page:
> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20b64
> head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
> page_type: f5(slab)
> raw: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
> head: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
> head: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
> head: 00fff00000000001 ffffea000082d901 00000000ffffffff 00000000ffffffff
> head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 1, migratetype Unmovable, gfp_mask 0x72820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_MEMALLOC|__GFP_COMP), pid 0, tgid 0 (swapper/0), ts 96149950019, free_ts 42914869228
> set_page_owner include/linux/page_owner.h:32 [inline]
> post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851
> prep_new_page mm/page_alloc.c:1859 [inline]
> get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858
> __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148
> alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
> alloc_slab_page mm/slub.c:2492 [inline]
> allocate_slab+0x8a/0x370 mm/slub.c:2660
> new_slab mm/slub.c:2714 [inline]
> ___slab_alloc+0xbeb/0x1420 mm/slub.c:3901
> __slab_alloc mm/slub.c:3992 [inline]
> __slab_alloc_node mm/slub.c:4067 [inline]
> slab_alloc_node mm/slub.c:4228 [inline]
> kmem_cache_alloc_node_noprof+0x280/0x3c0 mm/slub.c:4292
> __alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
> __netdev_alloc_skb+0x108/0x970 net/core/skbuff.c:734
> netdev_alloc_skb include/linux/skbuff.h:3484 [inline]
> dev_alloc_skb include/linux/skbuff.h:3497 [inline]
> __ieee80211_beacon_get+0xc06/0x1880 net/mac80211/tx.c:5652
> ieee80211_beacon_get_tim+0xb4/0x2b0 net/mac80211/tx.c:5774
> ieee80211_beacon_get include/net/mac80211.h:5667 [inline]
> mac80211_hwsim_beacon_tx+0x3ce/0x860 drivers/net/wireless/virtual/mac80211_hwsim.c:2355
> __iterate_interfaces+0x2ab/0x590 net/mac80211/util.c:761
> ieee80211_iterate_active_interfaces_atomic+0xdb/0x180 net/mac80211/util.c:797
> mac80211_hwsim_beacon+0xbb/0x1c0 drivers/net/wireless/virtual/mac80211_hwsim.c:2389
> __run_hrtimer kernel/time/hrtimer.c:1761 [inline]
> __hrtimer_run_queues+0x52c/0xc60 kernel/time/hrtimer.c:1825
> page last free pid 0 tgid 0 stack trace:
> reset_page_owner include/linux/page_owner.h:25 [inline]
> free_pages_prepare mm/page_alloc.c:1395 [inline]
> __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2895
> rcu_do_batch kernel/rcu/tree.c:2605 [inline]
> rcu_core+0xcab/0x1770 kernel/rcu/tree.c:2861
> handle_softirqs+0x286/0x870 kernel/softirq.c:579
> __do_softirq kernel/softirq.c:613 [inline]
> invoke_softirq kernel/softirq.c:453 [inline]
> __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
> irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
> instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
> sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
> asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
>
> Memory state around the buggy address:
> ffff888020b65a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
> ffff888020b65a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >ffff888020b65b00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ^
> ffff888020b65b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
> ffff888020b65c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
>
>
> ***
>
> If these findings have caused you to resend the series or submit a
> separate fix, please add the following tag to your commit message:
> Tested-by: syzbot@syzkaller.appspotmail.com
>
> ---
> This report is generated by a bot. It may contain errors.
> syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-10-07 11:37 ` Eric Dumazet
2025-10-08 6:37 ` Paolo Abeni
2025-10-08 8:48 ` Toke Høiland-Jørgensen
2 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-07 11:37 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Mon, Oct 6, 2025 at 12:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
>
> +
> + ll_list = llist_del_all(&q->defer_list);
> + /* There is a small race because we clear defer_count not atomically
> + * with the prior llist_del_all(). This means defer_list could grow
> + * over q->limit.
> + */
> + atomic_long_set(&q->defer_count, 0);
> +
> + ll_list = llist_reverse_order(ll_list);
> +
> if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> - __qdisc_drop(skb, &to_free);
> + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> + __qdisc_drop(skb, &to_free);
> rc = NET_XMIT_DROP;
> - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> - qdisc_run_begin(q)) {
> + goto unlock;
> + }
> + rc = NET_XMIT_SUCCESS;
> + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> + !llist_next(ll_list) && qdisc_run_begin(q)) {
> /*
> * This is a work-conserving queue; there are no old skbs
> * waiting to be sent out; and the qdisc is not running -
> * xmit the skb directly.
> */
>
> + skb = llist_entry(ll_list, struct sk_buff, ll_node);
> qdisc_bstats_update(q, skb);
> -
> - if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
> - if (unlikely(contended)) {
> - spin_unlock(&q->busylock);
> - contended = false;
> - }
> + if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
> __qdisc_run(q);
> - }
> -
> qdisc_run_end(q);
> - rc = NET_XMIT_SUCCESS;
> } else {
> - rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
> - if (qdisc_run_begin(q)) {
> - if (unlikely(contended)) {
> - spin_unlock(&q->busylock);
> - contended = false;
> - }
> - __qdisc_run(q);
> - qdisc_run_end(q);
> + llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> + prefetch(next);
> + dev_qdisc_enqueue(skb, q, &to_free, txq);
Now is the good time to add batch support to some qdisc->enqueue()
where this would help.
For instance fq_enqueue() could take a single ktime_get() sample.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
2025-10-06 19:38 ` Eric Dumazet
@ 2025-10-07 15:26 ` Alexander Lobakin
2025-10-07 19:41 ` Maciej Fijalkowski
1 sibling, 1 reply; 23+ messages in thread
From: Alexander Lobakin @ 2025-10-07 15:26 UTC (permalink / raw)
To: Eric Dumazet
Cc: Maciej Fijalkowski, David S . Miller, Jakub Kicinski, Paolo Abeni,
Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 6 Oct 2025 19:30:59 +0000
> While stress testing UDP senders on a host with expensive indirect
> calls, I found cpus processing TX completions where showing
> a very high cost (20%) in sock_wfree() due to
> CONFIG_MITIGATION_RETPOLINE=y.
>
> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/core/skbuff.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
> skb_dst_drop(skb);
> if (skb->destructor) {
> DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> - skb->destructor(skb);
> + INDIRECT_CALL_3(skb->destructor,
> + tcp_wfree, __sock_wfree, sock_wfree,
> + skb);
Not sure, but maybe we could add generic XSk skb destructor here as
well? Or it's not that important as generic XSk is not the best way to
use XDP sockets?
Maciej, what do you think?
> }
> #if IS_ENABLED(CONFIG_NF_CONNTRACK)
> nf_conntrack_put(skb_nfct(skb));
Thanks,
Olek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
2025-10-07 15:26 ` Alexander Lobakin
@ 2025-10-07 19:41 ` Maciej Fijalkowski
2025-10-09 8:37 ` Jason Xing
0 siblings, 1 reply; 23+ messages in thread
From: Maciej Fijalkowski @ 2025-10-07 19:41 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Tue, Oct 07, 2025 at 05:26:46PM +0200, Alexander Lobakin wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon, 6 Oct 2025 19:30:59 +0000
>
> > While stress testing UDP senders on a host with expensive indirect
> > calls, I found cpus processing TX completions where showing
> > a very high cost (20%) in sock_wfree() due to
> > CONFIG_MITIGATION_RETPOLINE=y.
> >
> > Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > net/core/skbuff.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
> > skb_dst_drop(skb);
> > if (skb->destructor) {
> > DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> > - skb->destructor(skb);
> > + INDIRECT_CALL_3(skb->destructor,
> > + tcp_wfree, __sock_wfree, sock_wfree,
> > + skb);
>
> Not sure, but maybe we could add generic XSk skb destructor here as
> well? Or it's not that important as generic XSk is not the best way to
> use XDP sockets?
>
> Maciej, what do you think?
I would appreciate it as there has been various attempts to optmize xsk
generic xmit path.
>
> > }
> > #if IS_ENABLED(CONFIG_NF_CONNTRACK)
> > nf_conntrack_put(skb_nfct(skb));
>
> Thanks,
> Olek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-10-07 11:37 ` Eric Dumazet
@ 2025-10-08 6:37 ` Paolo Abeni
2025-10-08 7:32 ` Eric Dumazet
2025-10-08 8:48 ` Toke Høiland-Jørgensen
2 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2025-10-08 6:37 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
Hi,
On 10/6/25 9:31 PM, Eric Dumazet wrote:
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -115,7 +115,9 @@ struct Qdisc {
> struct Qdisc *next_sched;
> struct sk_buff_head skb_bad_txq;
>
> - spinlock_t busylock ____cacheline_aligned_in_smp;
> + atomic_long_t defer_count ____cacheline_aligned_in_smp;
> + struct llist_head defer_list;
> +
Dumb question: I guess the above brings back Qdisc to the original size
? (pre patch 4/5?) If so, perhaps is worth mentioning somewhere in the
commit message.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-08 6:37 ` Paolo Abeni
@ 2025-10-08 7:32 ` Eric Dumazet
2025-10-08 7:44 ` Paolo Abeni
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-08 7:32 UTC (permalink / raw)
To: Paolo Abeni
Cc: David S . Miller, Jakub Kicinski, Simon Horman, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
netdev, eric.dumazet
On Tue, Oct 7, 2025 at 11:37 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On 10/6/25 9:31 PM, Eric Dumazet wrote:
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -115,7 +115,9 @@ struct Qdisc {
> > struct Qdisc *next_sched;
> > struct sk_buff_head skb_bad_txq;
> >
> > - spinlock_t busylock ____cacheline_aligned_in_smp;
> > + atomic_long_t defer_count ____cacheline_aligned_in_smp;
> > + struct llist_head defer_list;
> > +
>
> Dumb question: I guess the above brings back Qdisc to the original size
> ? (pre patch 4/5?) If so, perhaps is worth mentioning somewhere in the
> commit message.
Hmm, I do not think this changes the size.
The cache line that was starting at busylock had holes.
You can see that even adding the long and the pointer, we still have room in it.
/* --- cacheline 3 boundary (192 bytes) --- */
struct gnet_stats_queue qstats; /* 0xc0 0x14 */
bool running; /* 0xd4 0x1 */
/* XXX 3 bytes hole, try to pack */
unsigned long state; /* 0xd8 0x8 */
struct Qdisc * next_sched; /* 0xe0 0x8 */
struct sk_buff_head skb_bad_txq; /* 0xe8 0x18 */
/* --- cacheline 4 boundary (256 bytes) --- */
atomic_long_t defer_count
__attribute__((__aligned__(64))); /* 0x100 0x8 */
struct llist_head defer_list; /* 0x108 0x8 */
spinlock_t seqlock; /* 0x110 0x4 */
/* XXX 4 bytes hole, try to pack */
struct callback_head rcu __attribute__((__aligned__(8))); /*
0x118 0x10 */
netdevice_tracker dev_tracker; /* 0x128 0x8 */
struct lock_class_key root_lock_key; /* 0x130 0 */
/* XXX 16 bytes hole, try to pack */
/* --- cacheline 5 boundary (320 bytes) --- */
long privdata[]
__attribute__((__aligned__(64))); /* 0x140 0 */
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-08 7:32 ` Eric Dumazet
@ 2025-10-08 7:44 ` Paolo Abeni
0 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-08 7:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Simon Horman, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
netdev, eric.dumazet
On 10/8/25 9:32 AM, Eric Dumazet wrote:
> On Tue, Oct 7, 2025 at 11:37 PM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> Hi,
>>
>> On 10/6/25 9:31 PM, Eric Dumazet wrote:
>>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>>> index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>>> --- a/include/net/sch_generic.h
>>> +++ b/include/net/sch_generic.h
>>> @@ -115,7 +115,9 @@ struct Qdisc {
>>> struct Qdisc *next_sched;
>>> struct sk_buff_head skb_bad_txq;
>>>
>>> - spinlock_t busylock ____cacheline_aligned_in_smp;
>>> + atomic_long_t defer_count ____cacheline_aligned_in_smp;
>>> + struct llist_head defer_list;
>>> +
>>
>> Dumb question: I guess the above brings back Qdisc to the original size
>> ? (pre patch 4/5?) If so, perhaps is worth mentioning somewhere in the
>> commit message.
>
>
> Hmm, I do not think this changes the size.
> The cache line that was starting at busylock had holes.
> You can see that even adding the long and the pointer, we still have room in it.
Ah, nice! I did not take in account the ____cacheline_aligned_in_smp on
a later field (privdata).
Thanks for the clarifying pahole output!
Paolo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-10-07 11:37 ` Eric Dumazet
2025-10-08 6:37 ` Paolo Abeni
@ 2025-10-08 8:48 ` Toke Høiland-Jørgensen
2025-10-08 9:32 ` Eric Dumazet
2 siblings, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-08 8:48 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
Eric Dumazet
Eric Dumazet <edumazet@google.com> writes:
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
> Before:
>
> 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
> 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
> 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
> 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
> 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
>
> Lock contention (1 second sample taken on 8 cores)
> perf lock record -C0-7 sleep 1; perf lock contention
> contended total wait max wait avg wait type caller
>
> 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
> 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
> 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
> 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
>
> If netperf threads are pinned, spinlock stress is very high.
> perf lock record -C0-7 sleep 1; perf lock contention
> contended total wait max wait avg wait type caller
>
> 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
> 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
> 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
>
> @__dev_queue_xmit_ns:
> [256, 512) 21 | |
> [512, 1K) 631 | |
> [1K, 2K) 27328 |@ |
> [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
> [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [32K, 64K) 19055 |@ |
> [64K, 128K) 17240 |@ |
> [128K, 256K) 25633 |@ |
> [256K, 512K) 4 | |
>
> After:
>
> 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
> 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
> 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
> 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
> 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
>
> perf lock record -C0-7 sleep 1; perf lock contention
> contended total wait max wait avg wait type caller
>
> 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
> 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
> 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
> 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
> 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
> 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
>
> If netperf threads are pinned (~54 Mpps)
> perf lock record -C0-7 sleep 1; perf lock contention
> 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
> 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
> 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
> 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
> 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
> 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
> 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
> 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
>
> @__dev_queue_xmit_ns:
> [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K) 483936 |@@@@@@@@@@ |
> [1K, 2K) 265345 |@@@@@@ |
> [2K, 4K) 145463 |@@@ |
> [4K, 8K) 54571 |@ |
> [8K, 16K) 10270 | |
> [16K, 32K) 9385 | |
> [32K, 64K) 7749 | |
> [64K, 128K) 26799 | |
> [128K, 256K) 2665 | |
> [256K, 512K) 665 | |
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
This is very cool! One question below, just to make sure I understand
this correctly:
> ---
> include/net/sch_generic.h | 4 +-
> net/core/dev.c | 85 ++++++++++++++++++++++-----------------
> net/sched/sch_generic.c | 5 ---
> 3 files changed, 52 insertions(+), 42 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -115,7 +115,9 @@ struct Qdisc {
> struct Qdisc *next_sched;
> struct sk_buff_head skb_bad_txq;
>
> - spinlock_t busylock ____cacheline_aligned_in_smp;
> + atomic_long_t defer_count ____cacheline_aligned_in_smp;
> + struct llist_head defer_list;
> +
> spinlock_t seqlock;
>
> struct rcu_head rcu;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> struct net_device *dev,
> struct netdev_queue *txq)
> {
> + struct sk_buff *next, *to_free = NULL;
> spinlock_t *root_lock = qdisc_lock(q);
> - struct sk_buff *to_free = NULL;
> - bool contended;
> + struct llist_node *ll_list, *first_n;
> + unsigned long defer_count = 0;
> int rc;
>
> qdisc_calculate_pkt_len(skb, q);
> @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> return rc;
> }
>
> - /*
> - * Heuristic to force contended enqueues to serialize on a
> - * separate lock before trying to get qdisc main lock.
> - * This permits qdisc->running owner to get the lock more
> - * often and dequeue packets faster.
> - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> - * and then other tasks will only enqueue packets. The packets will be
> - * sent after the qdisc owner is scheduled again. To prevent this
> - * scenario the task always serialize on the lock.
> + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> + * In the try_cmpxchg() loop, we want to increment q->defer_count
> + * at most once to limit the number of skbs in defer_list.
> + * We perform the defer_count increment only if the list is not empty,
> + * because some arches have slow atomic_long_inc_return().
> + */
> + first_n = READ_ONCE(q->defer_list.first);
> + do {
> + if (first_n && !defer_count) {
> + defer_count = atomic_long_inc_return(&q->defer_count);
> + if (unlikely(defer_count > q->limit)) {
> + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> + return NET_XMIT_DROP;
> + }
> + }
> + skb->ll_node.next = first_n;
> + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> +
> + /* If defer_list was not empty, we know the cpu which queued
> + * the first skb will process the whole list for us.
> */
> - contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> - if (unlikely(contended))
> - spin_lock(&q->busylock);
> + if (first_n)
> + return NET_XMIT_SUCCESS;
>
> spin_lock(root_lock);
> +
> + ll_list = llist_del_all(&q->defer_list);
> + /* There is a small race because we clear defer_count not atomically
> + * with the prior llist_del_all(). This means defer_list could grow
> + * over q->limit.
> + */
> + atomic_long_set(&q->defer_count, 0);
> +
> + ll_list = llist_reverse_order(ll_list);
> +
> if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> - __qdisc_drop(skb, &to_free);
> + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> + __qdisc_drop(skb, &to_free);
> rc = NET_XMIT_DROP;
> - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> - qdisc_run_begin(q)) {
> + goto unlock;
> + }
> + rc = NET_XMIT_SUCCESS;
> + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> + !llist_next(ll_list) && qdisc_run_begin(q)) {
> /*
> * This is a work-conserving queue; there are no old skbs
> * waiting to be sent out; and the qdisc is not running -
> * xmit the skb directly.
> */
>
> + skb = llist_entry(ll_list, struct sk_buff, ll_node);
I was puzzled by this^. But AFAICT, the idea is that in the
non-contended path we're in here (no other CPU enqueueing packets), we
will still have added the skb to the llist before taking the root_lock,
and here we pull it back off the list, right?
Even though this is technically not needed (we'll always get the same
skb back, I think?), so this is mostly for consistency(?)
Assuming I got this right:
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-08 8:48 ` Toke Høiland-Jørgensen
@ 2025-10-08 9:32 ` Eric Dumazet
2025-10-08 10:05 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-08 9:32 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet
On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > Remove busylock spinlock and use a lockless list (llist)
> > to reduce spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this patch, we get a 300 % improvement on heavy TX workloads.
> > - Sending twice the number of packets per second.
> > - While consuming 50 % less cycles.
> >
> > Tested:
> >
> > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> >
> > Before:
> >
> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
> >
> > Lock contention (1 second sample taken on 8 cores)
> > perf lock record -C0-7 sleep 1; perf lock contention
> > contended total wait max wait avg wait type caller
> >
> > 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
> > 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
> > 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
> > 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
> >
> > If netperf threads are pinned, spinlock stress is very high.
> > perf lock record -C0-7 sleep 1; perf lock contention
> > contended total wait max wait avg wait type caller
> >
> > 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
> > 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
> > 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
> >
> > @__dev_queue_xmit_ns:
> > [256, 512) 21 | |
> > [512, 1K) 631 | |
> > [1K, 2K) 27328 |@ |
> > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
> > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [32K, 64K) 19055 |@ |
> > [64K, 128K) 17240 |@ |
> > [128K, 256K) 25633 |@ |
> > [256K, 512K) 4 | |
> >
> > After:
> >
> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
> > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
> > 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
> > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
> > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
> >
> > perf lock record -C0-7 sleep 1; perf lock contention
> > contended total wait max wait avg wait type caller
> >
> > 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
> > 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
> > 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
> > 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
> > 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
> > 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
> >
> > If netperf threads are pinned (~54 Mpps)
> > perf lock record -C0-7 sleep 1; perf lock contention
> > 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
> > 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
> > 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
> > 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
> > 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
> > 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
> > 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
> > 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
> >
> > @__dev_queue_xmit_ns:
> > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [512, 1K) 483936 |@@@@@@@@@@ |
> > [1K, 2K) 265345 |@@@@@@ |
> > [2K, 4K) 145463 |@@@ |
> > [4K, 8K) 54571 |@ |
> > [8K, 16K) 10270 | |
> > [16K, 32K) 9385 | |
> > [32K, 64K) 7749 | |
> > [64K, 128K) 26799 | |
> > [128K, 256K) 2665 | |
> > [256K, 512K) 665 | |
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> This is very cool! One question below, just to make sure I understand
> this correctly:
>
> > ---
> > include/net/sch_generic.h | 4 +-
> > net/core/dev.c | 85 ++++++++++++++++++++++-----------------
> > net/sched/sch_generic.c | 5 ---
> > 3 files changed, 52 insertions(+), 42 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -115,7 +115,9 @@ struct Qdisc {
> > struct Qdisc *next_sched;
> > struct sk_buff_head skb_bad_txq;
> >
> > - spinlock_t busylock ____cacheline_aligned_in_smp;
> > + atomic_long_t defer_count ____cacheline_aligned_in_smp;
> > + struct llist_head defer_list;
> > +
> > spinlock_t seqlock;
> >
> > struct rcu_head rcu;
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > struct net_device *dev,
> > struct netdev_queue *txq)
> > {
> > + struct sk_buff *next, *to_free = NULL;
> > spinlock_t *root_lock = qdisc_lock(q);
> > - struct sk_buff *to_free = NULL;
> > - bool contended;
> > + struct llist_node *ll_list, *first_n;
> > + unsigned long defer_count = 0;
> > int rc;
> >
> > qdisc_calculate_pkt_len(skb, q);
> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > return rc;
> > }
> >
> > - /*
> > - * Heuristic to force contended enqueues to serialize on a
> > - * separate lock before trying to get qdisc main lock.
> > - * This permits qdisc->running owner to get the lock more
> > - * often and dequeue packets faster.
> > - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> > - * and then other tasks will only enqueue packets. The packets will be
> > - * sent after the qdisc owner is scheduled again. To prevent this
> > - * scenario the task always serialize on the lock.
> > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> > + * In the try_cmpxchg() loop, we want to increment q->defer_count
> > + * at most once to limit the number of skbs in defer_list.
> > + * We perform the defer_count increment only if the list is not empty,
> > + * because some arches have slow atomic_long_inc_return().
> > + */
> > + first_n = READ_ONCE(q->defer_list.first);
> > + do {
> > + if (first_n && !defer_count) {
> > + defer_count = atomic_long_inc_return(&q->defer_count);
> > + if (unlikely(defer_count > q->limit)) {
> > + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> > + return NET_XMIT_DROP;
> > + }
> > + }
> > + skb->ll_node.next = first_n;
> > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> > +
> > + /* If defer_list was not empty, we know the cpu which queued
> > + * the first skb will process the whole list for us.
> > */
> > - contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> > - if (unlikely(contended))
> > - spin_lock(&q->busylock);
> > + if (first_n)
> > + return NET_XMIT_SUCCESS;
> >
> > spin_lock(root_lock);
> > +
> > + ll_list = llist_del_all(&q->defer_list);
> > + /* There is a small race because we clear defer_count not atomically
> > + * with the prior llist_del_all(). This means defer_list could grow
> > + * over q->limit.
> > + */
> > + atomic_long_set(&q->defer_count, 0);
> > +
> > + ll_list = llist_reverse_order(ll_list);
> > +
> > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> > - __qdisc_drop(skb, &to_free);
> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> > + __qdisc_drop(skb, &to_free);
> > rc = NET_XMIT_DROP;
> > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > - qdisc_run_begin(q)) {
> > + goto unlock;
> > + }
> > + rc = NET_XMIT_SUCCESS;
> > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > + !llist_next(ll_list) && qdisc_run_begin(q)) {
> > /*
> > * This is a work-conserving queue; there are no old skbs
> > * waiting to be sent out; and the qdisc is not running -
> > * xmit the skb directly.
> > */
> >
> > + skb = llist_entry(ll_list, struct sk_buff, ll_node);
>
> I was puzzled by this^. But AFAICT, the idea is that in the
> non-contended path we're in here (no other CPU enqueueing packets), we
> will still have added the skb to the llist before taking the root_lock,
> and here we pull it back off the list, right?
>
> Even though this is technically not needed (we'll always get the same
> skb back, I think?), so this is mostly for consistency(?)
Exactly. I guess we could instead add an assert like
diff --git a/net/core/dev.c b/net/core/dev.c
index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
* xmit the skb directly.
*/
- skb = llist_entry(ll_list, struct sk_buff, ll_node);
+ DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
struct sk_buff, ll_node));
qdisc_bstats_update(q, skb);
if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
__qdisc_run(q);
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-08 9:32 ` Eric Dumazet
@ 2025-10-08 10:05 ` Toke Høiland-Jørgensen
2025-10-08 10:46 ` Eric Dumazet
0 siblings, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-08 10:05 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet
Eric Dumazet <edumazet@google.com> writes:
> On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Eric Dumazet <edumazet@google.com> writes:
>>
>> > Remove busylock spinlock and use a lockless list (llist)
>> > to reduce spinlock contention to the minimum.
>> >
>> > Idea is that only one cpu might spin on the qdisc spinlock,
>> > while others simply add their skb in the llist.
>> >
>> > After this patch, we get a 300 % improvement on heavy TX workloads.
>> > - Sending twice the number of packets per second.
>> > - While consuming 50 % less cycles.
>> >
>> > Tested:
>> >
>> > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
>> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
>> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
>> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>> >
>> > Before:
>> >
>> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>> >
>> > vmstat 2 5
>> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> > r b swpd free buff cache si so bi bo in cs us sy id wa st
>> > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
>> > 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
>> > 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
>> > 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
>> > 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
>> >
>> > Lock contention (1 second sample taken on 8 cores)
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> > contended total wait max wait avg wait type caller
>> >
>> > 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
>> > 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
>> > 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
>> > 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
>> >
>> > If netperf threads are pinned, spinlock stress is very high.
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> > contended total wait max wait avg wait type caller
>> >
>> > 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
>> > 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
>> > 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
>> >
>> > @__dev_queue_xmit_ns:
>> > [256, 512) 21 | |
>> > [512, 1K) 631 | |
>> > [1K, 2K) 27328 |@ |
>> > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
>> > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> > [32K, 64K) 19055 |@ |
>> > [64K, 128K) 17240 |@ |
>> > [128K, 256K) 25633 |@ |
>> > [256K, 512K) 4 | |
>> >
>> > After:
>> >
>> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>> >
>> > vmstat 2 5
>> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> > r b swpd free buff cache si so bi bo in cs us sy id wa st
>> > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
>> > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
>> > 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
>> > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
>> > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
>> >
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> > contended total wait max wait avg wait type caller
>> >
>> > 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
>> > 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
>> > 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
>> > 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
>> > 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
>> > 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
>> >
>> > If netperf threads are pinned (~54 Mpps)
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> > 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
>> > 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
>> > 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
>> > 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
>> > 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
>> > 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
>> > 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
>> > 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
>> >
>> > @__dev_queue_xmit_ns:
>> > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> > [512, 1K) 483936 |@@@@@@@@@@ |
>> > [1K, 2K) 265345 |@@@@@@ |
>> > [2K, 4K) 145463 |@@@ |
>> > [4K, 8K) 54571 |@ |
>> > [8K, 16K) 10270 | |
>> > [16K, 32K) 9385 | |
>> > [32K, 64K) 7749 | |
>> > [64K, 128K) 26799 | |
>> > [128K, 256K) 2665 | |
>> > [256K, 512K) 665 | |
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>>
>> This is very cool! One question below, just to make sure I understand
>> this correctly:
>>
>> > ---
>> > include/net/sch_generic.h | 4 +-
>> > net/core/dev.c | 85 ++++++++++++++++++++++-----------------
>> > net/sched/sch_generic.c | 5 ---
>> > 3 files changed, 52 insertions(+), 42 deletions(-)
>> >
>> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>> > --- a/include/net/sch_generic.h
>> > +++ b/include/net/sch_generic.h
>> > @@ -115,7 +115,9 @@ struct Qdisc {
>> > struct Qdisc *next_sched;
>> > struct sk_buff_head skb_bad_txq;
>> >
>> > - spinlock_t busylock ____cacheline_aligned_in_smp;
>> > + atomic_long_t defer_count ____cacheline_aligned_in_smp;
>> > + struct llist_head defer_list;
>> > +
>> > spinlock_t seqlock;
>> >
>> > struct rcu_head rcu;
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> > struct net_device *dev,
>> > struct netdev_queue *txq)
>> > {
>> > + struct sk_buff *next, *to_free = NULL;
>> > spinlock_t *root_lock = qdisc_lock(q);
>> > - struct sk_buff *to_free = NULL;
>> > - bool contended;
>> > + struct llist_node *ll_list, *first_n;
>> > + unsigned long defer_count = 0;
>> > int rc;
>> >
>> > qdisc_calculate_pkt_len(skb, q);
>> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> > return rc;
>> > }
>> >
>> > - /*
>> > - * Heuristic to force contended enqueues to serialize on a
>> > - * separate lock before trying to get qdisc main lock.
>> > - * This permits qdisc->running owner to get the lock more
>> > - * often and dequeue packets faster.
>> > - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
>> > - * and then other tasks will only enqueue packets. The packets will be
>> > - * sent after the qdisc owner is scheduled again. To prevent this
>> > - * scenario the task always serialize on the lock.
>> > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
>> > + * In the try_cmpxchg() loop, we want to increment q->defer_count
>> > + * at most once to limit the number of skbs in defer_list.
>> > + * We perform the defer_count increment only if the list is not empty,
>> > + * because some arches have slow atomic_long_inc_return().
>> > + */
>> > + first_n = READ_ONCE(q->defer_list.first);
>> > + do {
>> > + if (first_n && !defer_count) {
>> > + defer_count = atomic_long_inc_return(&q->defer_count);
>> > + if (unlikely(defer_count > q->limit)) {
>> > + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
>> > + return NET_XMIT_DROP;
>> > + }
>> > + }
>> > + skb->ll_node.next = first_n;
>> > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
>> > +
>> > + /* If defer_list was not empty, we know the cpu which queued
>> > + * the first skb will process the whole list for us.
>> > */
>> > - contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
>> > - if (unlikely(contended))
>> > - spin_lock(&q->busylock);
>> > + if (first_n)
>> > + return NET_XMIT_SUCCESS;
>> >
>> > spin_lock(root_lock);
>> > +
>> > + ll_list = llist_del_all(&q->defer_list);
>> > + /* There is a small race because we clear defer_count not atomically
>> > + * with the prior llist_del_all(). This means defer_list could grow
>> > + * over q->limit.
>> > + */
>> > + atomic_long_set(&q->defer_count, 0);
>> > +
>> > + ll_list = llist_reverse_order(ll_list);
>> > +
>> > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>> > - __qdisc_drop(skb, &to_free);
>> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
>> > + __qdisc_drop(skb, &to_free);
>> > rc = NET_XMIT_DROP;
>> > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> > - qdisc_run_begin(q)) {
>> > + goto unlock;
>> > + }
>> > + rc = NET_XMIT_SUCCESS;
>> > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> > + !llist_next(ll_list) && qdisc_run_begin(q)) {
>> > /*
>> > * This is a work-conserving queue; there are no old skbs
>> > * waiting to be sent out; and the qdisc is not running -
>> > * xmit the skb directly.
>> > */
>> >
>> > + skb = llist_entry(ll_list, struct sk_buff, ll_node);
>>
>> I was puzzled by this^. But AFAICT, the idea is that in the
>> non-contended path we're in here (no other CPU enqueueing packets), we
>> will still have added the skb to the llist before taking the root_lock,
>> and here we pull it back off the list, right?
>>
>> Even though this is technically not needed (we'll always get the same
>> skb back, I think?), so this is mostly for consistency(?)
>
> Exactly. I guess we could instead add an assert like
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
> * xmit the skb directly.
> */
>
> - skb = llist_entry(ll_list, struct sk_buff, ll_node);
> + DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
> struct sk_buff, ll_node));
> qdisc_bstats_update(q, skb);
> if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
> __qdisc_run(q);
OK, so thinking about this some more, isn't there a race between the
if (first_n)
return NET_XMIT_SUCCESS;
and taking the lock? I.e., two different CPUs can pass that check in
which case one of them will end up spinning on the lock, and by the time
it acquires it, there is no longer any guarantee that the skb on the
llist will be the same one that we started with? Or indeed that there
will be any skbs on the list at all?
In which case, shouldn't there be a:
if (unlikely(llist_empty(ll_list)))
goto unlock;
after the llist_del_all() assignment inside the lock? (and we should
retain the skb = llist_entry()) assignment I was confused about).
Or am I missing some reason this can't happen?
-Toke
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-08 10:05 ` Toke Høiland-Jørgensen
@ 2025-10-08 10:46 ` Eric Dumazet
2025-10-08 12:11 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-08 10:46 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet
On Wed, Oct 8, 2025 at 3:05 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Eric Dumazet <edumazet@google.com> writes:
> >>
> >> > Remove busylock spinlock and use a lockless list (llist)
> >> > to reduce spinlock contention to the minimum.
> >> >
> >> > Idea is that only one cpu might spin on the qdisc spinlock,
> >> > while others simply add their skb in the llist.
> >> >
> >> > After this patch, we get a 300 % improvement on heavy TX workloads.
> >> > - Sending twice the number of packets per second.
> >> > - While consuming 50 % less cycles.
> >> >
> >> > Tested:
> >> >
> >> > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
> >> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> >> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> >> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> >> >
> >> > Before:
> >> >
> >> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
> >> >
> >> > vmstat 2 5
> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> >> > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
> >> > 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
> >> > 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
> >> >
> >> > Lock contention (1 second sample taken on 8 cores)
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> > contended total wait max wait avg wait type caller
> >> >
> >> > 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
> >> > 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
> >> > 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
> >> > 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
> >> >
> >> > If netperf threads are pinned, spinlock stress is very high.
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> > contended total wait max wait avg wait type caller
> >> >
> >> > 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
> >> > 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
> >> > 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
> >> >
> >> > @__dev_queue_xmit_ns:
> >> > [256, 512) 21 | |
> >> > [512, 1K) 631 | |
> >> > [1K, 2K) 27328 |@ |
> >> > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
> >> > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> >> > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >> > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> >> > [32K, 64K) 19055 |@ |
> >> > [64K, 128K) 17240 |@ |
> >> > [128K, 256K) 25633 |@ |
> >> > [256K, 512K) 4 | |
> >> >
> >> > After:
> >> >
> >> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
> >> >
> >> > vmstat 2 5
> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> >> > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
> >> > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
> >> > 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
> >> > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
> >> > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
> >> >
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> > contended total wait max wait avg wait type caller
> >> >
> >> > 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
> >> > 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
> >> > 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
> >> > 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
> >> > 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
> >> > 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
> >> >
> >> > If netperf threads are pinned (~54 Mpps)
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> > 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
> >> > 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
> >> > 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
> >> > 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
> >> > 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
> >> > 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
> >> > 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
> >> > 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
> >> >
> >> > @__dev_queue_xmit_ns:
> >> > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> >> > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >> > [512, 1K) 483936 |@@@@@@@@@@ |
> >> > [1K, 2K) 265345 |@@@@@@ |
> >> > [2K, 4K) 145463 |@@@ |
> >> > [4K, 8K) 54571 |@ |
> >> > [8K, 16K) 10270 | |
> >> > [16K, 32K) 9385 | |
> >> > [32K, 64K) 7749 | |
> >> > [64K, 128K) 26799 | |
> >> > [128K, 256K) 2665 | |
> >> > [256K, 512K) 665 | |
> >> >
> >> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >>
> >> This is very cool! One question below, just to make sure I understand
> >> this correctly:
> >>
> >> > ---
> >> > include/net/sch_generic.h | 4 +-
> >> > net/core/dev.c | 85 ++++++++++++++++++++++-----------------
> >> > net/sched/sch_generic.c | 5 ---
> >> > 3 files changed, 52 insertions(+), 42 deletions(-)
> >> >
> >> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> >> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> >> > --- a/include/net/sch_generic.h
> >> > +++ b/include/net/sch_generic.h
> >> > @@ -115,7 +115,9 @@ struct Qdisc {
> >> > struct Qdisc *next_sched;
> >> > struct sk_buff_head skb_bad_txq;
> >> >
> >> > - spinlock_t busylock ____cacheline_aligned_in_smp;
> >> > + atomic_long_t defer_count ____cacheline_aligned_in_smp;
> >> > + struct llist_head defer_list;
> >> > +
> >> > spinlock_t seqlock;
> >> >
> >> > struct rcu_head rcu;
> >> > diff --git a/net/core/dev.c b/net/core/dev.c
> >> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> >> > --- a/net/core/dev.c
> >> > +++ b/net/core/dev.c
> >> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >> > struct net_device *dev,
> >> > struct netdev_queue *txq)
> >> > {
> >> > + struct sk_buff *next, *to_free = NULL;
> >> > spinlock_t *root_lock = qdisc_lock(q);
> >> > - struct sk_buff *to_free = NULL;
> >> > - bool contended;
> >> > + struct llist_node *ll_list, *first_n;
> >> > + unsigned long defer_count = 0;
> >> > int rc;
> >> >
> >> > qdisc_calculate_pkt_len(skb, q);
> >> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >> > return rc;
> >> > }
> >> >
> >> > - /*
> >> > - * Heuristic to force contended enqueues to serialize on a
> >> > - * separate lock before trying to get qdisc main lock.
> >> > - * This permits qdisc->running owner to get the lock more
> >> > - * often and dequeue packets faster.
> >> > - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> >> > - * and then other tasks will only enqueue packets. The packets will be
> >> > - * sent after the qdisc owner is scheduled again. To prevent this
> >> > - * scenario the task always serialize on the lock.
> >> > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> >> > + * In the try_cmpxchg() loop, we want to increment q->defer_count
> >> > + * at most once to limit the number of skbs in defer_list.
> >> > + * We perform the defer_count increment only if the list is not empty,
> >> > + * because some arches have slow atomic_long_inc_return().
> >> > + */
> >> > + first_n = READ_ONCE(q->defer_list.first);
> >> > + do {
> >> > + if (first_n && !defer_count) {
> >> > + defer_count = atomic_long_inc_return(&q->defer_count);
> >> > + if (unlikely(defer_count > q->limit)) {
> >> > + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> >> > + return NET_XMIT_DROP;
> >> > + }
> >> > + }
> >> > + skb->ll_node.next = first_n;
> >> > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> >> > +
> >> > + /* If defer_list was not empty, we know the cpu which queued
> >> > + * the first skb will process the whole list for us.
> >> > */
> >> > - contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> >> > - if (unlikely(contended))
> >> > - spin_lock(&q->busylock);
> >> > + if (first_n)
> >> > + return NET_XMIT_SUCCESS;
> >> >
> >> > spin_lock(root_lock);
> >> > +
> >> > + ll_list = llist_del_all(&q->defer_list);
> >> > + /* There is a small race because we clear defer_count not atomically
> >> > + * with the prior llist_del_all(). This means defer_list could grow
> >> > + * over q->limit.
> >> > + */
> >> > + atomic_long_set(&q->defer_count, 0);
> >> > +
> >> > + ll_list = llist_reverse_order(ll_list);
> >> > +
> >> > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> >> > - __qdisc_drop(skb, &to_free);
> >> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> >> > + __qdisc_drop(skb, &to_free);
> >> > rc = NET_XMIT_DROP;
> >> > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> >> > - qdisc_run_begin(q)) {
> >> > + goto unlock;
> >> > + }
> >> > + rc = NET_XMIT_SUCCESS;
> >> > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> >> > + !llist_next(ll_list) && qdisc_run_begin(q)) {
> >> > /*
> >> > * This is a work-conserving queue; there are no old skbs
> >> > * waiting to be sent out; and the qdisc is not running -
> >> > * xmit the skb directly.
> >> > */
> >> >
> >> > + skb = llist_entry(ll_list, struct sk_buff, ll_node);
> >>
> >> I was puzzled by this^. But AFAICT, the idea is that in the
> >> non-contended path we're in here (no other CPU enqueueing packets), we
> >> will still have added the skb to the llist before taking the root_lock,
> >> and here we pull it back off the list, right?
> >>
> >> Even though this is technically not needed (we'll always get the same
> >> skb back, I think?), so this is mostly for consistency(?)
> >
> > Exactly. I guess we could instead add an assert like
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> > * xmit the skb directly.
> > */
> >
> > - skb = llist_entry(ll_list, struct sk_buff, ll_node);
> > + DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
> > struct sk_buff, ll_node));
> > qdisc_bstats_update(q, skb);
> > if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
> > __qdisc_run(q);
>
> OK, so thinking about this some more, isn't there a race between the
>
> if (first_n)
> return NET_XMIT_SUCCESS;
>
> and taking the lock? I.e., two different CPUs can pass that check in
> which case one of them will end up spinning on the lock, and by the time
> it acquires it, there is no longer any guarantee that the skb on the
> llist will be the same one that we started with? Or indeed that there
> will be any skbs on the list at all?
Only one cpu can observe the list was empty.
This (spinlock)less list still has atomic guarantees, thanks to cmpxchg().
So after this only cpu passes the spinlock() barrier, the list has at
least one skb in it.
If there is a single skb in the list after the xchg() got it
atomically, then it must be its own skb.
>
> In which case, shouldn't there be a:
>
> if (unlikely(llist_empty(ll_list)))
> goto unlock;
>
> after the llist_del_all() assignment inside the lock? (and we should
> retain the skb = llist_entry()) assignment I was confused about).
>
> Or am I missing some reason this can't happen?
>
> -Toke
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
2025-10-08 10:46 ` Eric Dumazet
@ 2025-10-08 12:11 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-08 12:11 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
Willem de Bruijn, netdev, eric.dumazet
Eric Dumazet <edumazet@google.com> writes:
> On Wed, Oct 8, 2025 at 3:05 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Eric Dumazet <edumazet@google.com> writes:
>>
>> > On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Eric Dumazet <edumazet@google.com> writes:
>> >>
>> >> > Remove busylock spinlock and use a lockless list (llist)
>> >> > to reduce spinlock contention to the minimum.
>> >> >
>> >> > Idea is that only one cpu might spin on the qdisc spinlock,
>> >> > while others simply add their skb in the llist.
>> >> >
>> >> > After this patch, we get a 300 % improvement on heavy TX workloads.
>> >> > - Sending twice the number of packets per second.
>> >> > - While consuming 50 % less cycles.
>> >> >
>> >> > Tested:
>> >> >
>> >> > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
>> >> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
>> >> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
>> >> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>> >> >
>> >> > Before:
>> >> >
>> >> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>> >> >
>> >> > vmstat 2 5
>> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >> > r b swpd free buff cache si so bi bo in cs us sy id wa st
>> >> > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
>> >> >
>> >> > Lock contention (1 second sample taken on 8 cores)
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > contended total wait max wait avg wait type caller
>> >> >
>> >> > 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
>> >> > 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
>> >> > 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
>> >> > 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
>> >> >
>> >> > If netperf threads are pinned, spinlock stress is very high.
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > contended total wait max wait avg wait type caller
>> >> >
>> >> > 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
>> >> > 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
>> >> > 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
>> >> >
>> >> > @__dev_queue_xmit_ns:
>> >> > [256, 512) 21 | |
>> >> > [512, 1K) 631 | |
>> >> > [1K, 2K) 27328 |@ |
>> >> > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
>> >> > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> >> > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> >> > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> >> > [32K, 64K) 19055 |@ |
>> >> > [64K, 128K) 17240 |@ |
>> >> > [128K, 256K) 25633 |@ |
>> >> > [256K, 512K) 4 | |
>> >> >
>> >> > After:
>> >> >
>> >> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>> >> >
>> >> > vmstat 2 5
>> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >> > r b swpd free buff cache si so bi bo in cs us sy id wa st
>> >> > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
>> >> > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
>> >> > 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
>> >> > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
>> >> > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
>> >> >
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > contended total wait max wait avg wait type caller
>> >> >
>> >> > 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
>> >> > 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
>> >> > 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
>> >> > 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
>> >> > 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
>> >> > 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
>> >> >
>> >> > If netperf threads are pinned (~54 Mpps)
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
>> >> > 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
>> >> > 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
>> >> > 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
>> >> > 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
>> >> > 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
>> >> > 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
>> >> > 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
>> >> >
>> >> > @__dev_queue_xmit_ns:
>> >> > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> >> > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> >> > [512, 1K) 483936 |@@@@@@@@@@ |
>> >> > [1K, 2K) 265345 |@@@@@@ |
>> >> > [2K, 4K) 145463 |@@@ |
>> >> > [4K, 8K) 54571 |@ |
>> >> > [8K, 16K) 10270 | |
>> >> > [16K, 32K) 9385 | |
>> >> > [32K, 64K) 7749 | |
>> >> > [64K, 128K) 26799 | |
>> >> > [128K, 256K) 2665 | |
>> >> > [256K, 512K) 665 | |
>> >> >
>> >> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> >>
>> >> This is very cool! One question below, just to make sure I understand
>> >> this correctly:
>> >>
>> >> > ---
>> >> > include/net/sch_generic.h | 4 +-
>> >> > net/core/dev.c | 85 ++++++++++++++++++++++-----------------
>> >> > net/sched/sch_generic.c | 5 ---
>> >> > 3 files changed, 52 insertions(+), 42 deletions(-)
>> >> >
>> >> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> >> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>> >> > --- a/include/net/sch_generic.h
>> >> > +++ b/include/net/sch_generic.h
>> >> > @@ -115,7 +115,9 @@ struct Qdisc {
>> >> > struct Qdisc *next_sched;
>> >> > struct sk_buff_head skb_bad_txq;
>> >> >
>> >> > - spinlock_t busylock ____cacheline_aligned_in_smp;
>> >> > + atomic_long_t defer_count ____cacheline_aligned_in_smp;
>> >> > + struct llist_head defer_list;
>> >> > +
>> >> > spinlock_t seqlock;
>> >> >
>> >> > struct rcu_head rcu;
>> >> > diff --git a/net/core/dev.c b/net/core/dev.c
>> >> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
>> >> > --- a/net/core/dev.c
>> >> > +++ b/net/core/dev.c
>> >> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >> > struct net_device *dev,
>> >> > struct netdev_queue *txq)
>> >> > {
>> >> > + struct sk_buff *next, *to_free = NULL;
>> >> > spinlock_t *root_lock = qdisc_lock(q);
>> >> > - struct sk_buff *to_free = NULL;
>> >> > - bool contended;
>> >> > + struct llist_node *ll_list, *first_n;
>> >> > + unsigned long defer_count = 0;
>> >> > int rc;
>> >> >
>> >> > qdisc_calculate_pkt_len(skb, q);
>> >> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >> > return rc;
>> >> > }
>> >> >
>> >> > - /*
>> >> > - * Heuristic to force contended enqueues to serialize on a
>> >> > - * separate lock before trying to get qdisc main lock.
>> >> > - * This permits qdisc->running owner to get the lock more
>> >> > - * often and dequeue packets faster.
>> >> > - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
>> >> > - * and then other tasks will only enqueue packets. The packets will be
>> >> > - * sent after the qdisc owner is scheduled again. To prevent this
>> >> > - * scenario the task always serialize on the lock.
>> >> > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
>> >> > + * In the try_cmpxchg() loop, we want to increment q->defer_count
>> >> > + * at most once to limit the number of skbs in defer_list.
>> >> > + * We perform the defer_count increment only if the list is not empty,
>> >> > + * because some arches have slow atomic_long_inc_return().
>> >> > + */
>> >> > + first_n = READ_ONCE(q->defer_list.first);
>> >> > + do {
>> >> > + if (first_n && !defer_count) {
>> >> > + defer_count = atomic_long_inc_return(&q->defer_count);
>> >> > + if (unlikely(defer_count > q->limit)) {
>> >> > + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
>> >> > + return NET_XMIT_DROP;
>> >> > + }
>> >> > + }
>> >> > + skb->ll_node.next = first_n;
>> >> > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
>> >> > +
>> >> > + /* If defer_list was not empty, we know the cpu which queued
>> >> > + * the first skb will process the whole list for us.
>> >> > */
>> >> > - contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
>> >> > - if (unlikely(contended))
>> >> > - spin_lock(&q->busylock);
>> >> > + if (first_n)
>> >> > + return NET_XMIT_SUCCESS;
>> >> >
>> >> > spin_lock(root_lock);
>> >> > +
>> >> > + ll_list = llist_del_all(&q->defer_list);
>> >> > + /* There is a small race because we clear defer_count not atomically
>> >> > + * with the prior llist_del_all(). This means defer_list could grow
>> >> > + * over q->limit.
>> >> > + */
>> >> > + atomic_long_set(&q->defer_count, 0);
>> >> > +
>> >> > + ll_list = llist_reverse_order(ll_list);
>> >> > +
>> >> > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>> >> > - __qdisc_drop(skb, &to_free);
>> >> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
>> >> > + __qdisc_drop(skb, &to_free);
>> >> > rc = NET_XMIT_DROP;
>> >> > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> >> > - qdisc_run_begin(q)) {
>> >> > + goto unlock;
>> >> > + }
>> >> > + rc = NET_XMIT_SUCCESS;
>> >> > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> >> > + !llist_next(ll_list) && qdisc_run_begin(q)) {
>> >> > /*
>> >> > * This is a work-conserving queue; there are no old skbs
>> >> > * waiting to be sent out; and the qdisc is not running -
>> >> > * xmit the skb directly.
>> >> > */
>> >> >
>> >> > + skb = llist_entry(ll_list, struct sk_buff, ll_node);
>> >>
>> >> I was puzzled by this^. But AFAICT, the idea is that in the
>> >> non-contended path we're in here (no other CPU enqueueing packets), we
>> >> will still have added the skb to the llist before taking the root_lock,
>> >> and here we pull it back off the list, right?
>> >>
>> >> Even though this is technically not needed (we'll always get the same
>> >> skb back, I think?), so this is mostly for consistency(?)
>> >
>> > Exactly. I guess we could instead add an assert like
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
>> > 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
>> > *skb, struct Qdisc *q,
>> > * xmit the skb directly.
>> > */
>> >
>> > - skb = llist_entry(ll_list, struct sk_buff, ll_node);
>> > + DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
>> > struct sk_buff, ll_node));
>> > qdisc_bstats_update(q, skb);
>> > if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>> > __qdisc_run(q);
>>
>> OK, so thinking about this some more, isn't there a race between the
>>
>> if (first_n)
>> return NET_XMIT_SUCCESS;
>>
>> and taking the lock? I.e., two different CPUs can pass that check in
>> which case one of them will end up spinning on the lock, and by the time
>> it acquires it, there is no longer any guarantee that the skb on the
>> llist will be the same one that we started with? Or indeed that there
>> will be any skbs on the list at all?
>
> Only one cpu can observe the list was empty.
>
> This (spinlock)less list still has atomic guarantees, thanks to cmpxchg().
Ah, right, missed the bit where try_cmpxchg would update first_n if it
doesn't succeed...
> So after this only cpu passes the spinlock() barrier, the list has at
> least one skb in it.
>
> If there is a single skb in the list after the xchg() got it
> atomically, then it must be its own skb.
Gotcha! In that case I agree the DEBUG_NET_WARN_ON_ONCE() is clearer
than reassigning the skb pointer.
-Toke
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
2025-10-07 19:41 ` Maciej Fijalkowski
@ 2025-10-09 8:37 ` Jason Xing
0 siblings, 0 replies; 23+ messages in thread
From: Jason Xing @ 2025-10-09 8:37 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Alexander Lobakin, Eric Dumazet, David S . Miller, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jamal Hadi Salim, Cong Wang,
Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet
On Wed, Oct 8, 2025 at 3:42 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Tue, Oct 07, 2025 at 05:26:46PM +0200, Alexander Lobakin wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Mon, 6 Oct 2025 19:30:59 +0000
> >
> > > While stress testing UDP senders on a host with expensive indirect
> > > calls, I found cpus processing TX completions where showing
> > > a very high cost (20%) in sock_wfree() due to
> > > CONFIG_MITIGATION_RETPOLINE=y.
> > >
> > > Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > > net/core/skbuff.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
> > > skb_dst_drop(skb);
> > > if (skb->destructor) {
> > > DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> > > - skb->destructor(skb);
> > > + INDIRECT_CALL_3(skb->destructor,
> > > + tcp_wfree, __sock_wfree, sock_wfree,
> > > + skb);
> >
> > Not sure, but maybe we could add generic XSk skb destructor here as
> > well?
I added the following snippet[1] and only saw a stable ~1% improvement
when sending 64 size packets with xdpsock.
I'm not so sure it deserves a follow-up patch to Eric's series. Better
than nothing? Any ideas on this one?
[1]
INDIRECT_CALL_4(skb->destructor, tcp_wfree, __sock_wfree, sock_wfree,
xsk_destruct_skb, skb);
>> Or it's not that important as generic XSk is not the best way to
> > use XDP sockets?
Yes, it surely matters. At least, virtio_net and veth need this copy
mode. And I've been working on batch xmit to ramp up the generic path.
> >
> > Maciej, what do you think?
>
> I would appreciate it as there has been various attempts to optmize xsk
> generic xmit path.
So do I!
Thanks,
Jason
>
> >
> > > }
> > > #if IS_ENABLED(CONFIG_NF_CONNTRACK)
> > > nf_conntrack_put(skb_nfct(skb));
> >
> > Thanks,
> > Olek
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
@ 2025-10-12 15:22 ` Jamal Hadi Salim
2025-10-12 18:34 ` Eric Dumazet
0 siblings, 1 reply; 23+ messages in thread
From: Jamal Hadi Salim @ 2025-10-12 15:22 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
netdev, eric.dumazet
On Mon, Oct 6, 2025 at 3:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> We want to revert commit 0f022d32c3ec ("net/sched: Fix mirred deadlock
> on device recursion") because it adds code in the fast path, even when
> act_mirred is not used.
>
> Use an additional device pointers array in struct netdev_xmit
> and implement loop detection in tcf_mirred_is_act_redirect().
Patch series looks good!
This has the potential of (later on) fixing issue that are currently
broken after the TTL bits were taken away.
Small suggestion, the commit message was a bit confusing to me. How about:
Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device
recursion") it adds code in the fast path, even when act_mirred is not
used. We revert in the next patch.
Prepare by adding an additional device pointers array in struct
netdev_xmit and implement loop detection in
tcf_mirred_is_act_redirect().
Please give us time to run tests on this set!
cheers,
jamal
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> include/linux/netdevice_xmit.h | 9 ++++-
> net/sched/act_mirred.c | 62 +++++++++++++---------------------
> 2 files changed, 31 insertions(+), 40 deletions(-)
>
> diff --git a/include/linux/netdevice_xmit.h b/include/linux/netdevice_xmit.h
> index 813a19122ebbb2c6a04176330b1055b7c2b9c902..cc232508e695eefe95ea6e55a21978be11d5da83 100644
> --- a/include/linux/netdevice_xmit.h
> +++ b/include/linux/netdevice_xmit.h
> @@ -2,6 +2,12 @@
> #ifndef _LINUX_NETDEVICE_XMIT_H
> #define _LINUX_NETDEVICE_XMIT_H
>
> +#if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
> +#define MIRRED_NEST_LIMIT 4
> +#endif
> +
> +struct net_device;
> +
> struct netdev_xmit {
> u16 recursion;
> u8 more;
> @@ -9,7 +15,8 @@ struct netdev_xmit {
> u8 skip_txqueue;
> #endif
> #if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
> - u8 sched_mirred_nest;
> + u8 sched_mirred_nest;
> + struct net_device *sched_mirred_dev[MIRRED_NEST_LIMIT];
> #endif
> #if IS_ENABLED(CONFIG_NF_DUP_NETDEV)
> u8 nf_dup_skb_recursion;
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index 5f01f567c934d3669d9a3058cff861a8fe5f88b6..f27b583def78e4afecc7112854b93d59c2520201 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -29,31 +29,6 @@
> static LIST_HEAD(mirred_list);
> static DEFINE_SPINLOCK(mirred_list_lock);
>
> -#define MIRRED_NEST_LIMIT 4
> -
> -#ifndef CONFIG_PREEMPT_RT
> -static u8 tcf_mirred_nest_level_inc_return(void)
> -{
> - return __this_cpu_inc_return(softnet_data.xmit.sched_mirred_nest);
> -}
> -
> -static void tcf_mirred_nest_level_dec(void)
> -{
> - __this_cpu_dec(softnet_data.xmit.sched_mirred_nest);
> -}
> -
> -#else
> -static u8 tcf_mirred_nest_level_inc_return(void)
> -{
> - return current->net_xmit.sched_mirred_nest++;
> -}
> -
> -static void tcf_mirred_nest_level_dec(void)
> -{
> - current->net_xmit.sched_mirred_nest--;
> -}
> -#endif
> -
> static bool tcf_mirred_is_act_redirect(int action)
> {
> return action == TCA_EGRESS_REDIR || action == TCA_INGRESS_REDIR;
> @@ -439,44 +414,53 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
> {
> struct tcf_mirred *m = to_mirred(a);
> int retval = READ_ONCE(m->tcf_action);
> - unsigned int nest_level;
> + struct netdev_xmit *xmit;
> bool m_mac_header_xmit;
> struct net_device *dev;
> - int m_eaction;
> + int i, m_eaction;
> u32 blockid;
>
> - nest_level = tcf_mirred_nest_level_inc_return();
> - if (unlikely(nest_level > MIRRED_NEST_LIMIT)) {
> +#ifdef CONFIG_PREEMPT_RT
> + xmit = ¤t->net_xmit;
> +#else
> + xmit = this_cpu_ptr(&softnet_data.xmit);
> +#endif
> + if (unlikely(xmit->sched_mirred_nest >= MIRRED_NEST_LIMIT)) {
> net_warn_ratelimited("Packet exceeded mirred recursion limit on dev %s\n",
> netdev_name(skb->dev));
> - retval = TC_ACT_SHOT;
> - goto dec_nest_level;
> + return TC_ACT_SHOT;
> }
>
> tcf_lastuse_update(&m->tcf_tm);
> tcf_action_update_bstats(&m->common, skb);
>
> blockid = READ_ONCE(m->tcfm_blockid);
> - if (blockid) {
> - retval = tcf_blockcast(skb, m, blockid, res, retval);
> - goto dec_nest_level;
> - }
> + if (blockid)
> + return tcf_blockcast(skb, m, blockid, res, retval);
>
> dev = rcu_dereference_bh(m->tcfm_dev);
> if (unlikely(!dev)) {
> pr_notice_once("tc mirred: target device is gone\n");
> tcf_action_inc_overlimit_qstats(&m->common);
> - goto dec_nest_level;
> + return retval;
> }
> + for (i = 0; i < xmit->sched_mirred_nest; i++) {
> + if (xmit->sched_mirred_dev[i] != dev)
> + continue;
> + pr_notice_once("tc mirred: loop on device %s\n",
> + netdev_name(dev));
> + tcf_action_inc_overlimit_qstats(&m->common);
> + return retval;
> + }
> +
> + xmit->sched_mirred_dev[xmit->sched_mirred_nest++] = dev;
>
> m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
> m_eaction = READ_ONCE(m->tcfm_eaction);
>
> retval = tcf_mirred_to_dev(skb, m, dev, m_mac_header_xmit, m_eaction,
> retval);
> -
> -dec_nest_level:
> - tcf_mirred_nest_level_dec();
> + xmit->sched_mirred_nest--;
>
> return retval;
> }
> --
> 2.51.0.618.g983fd99d29-goog
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
2025-10-12 15:22 ` Jamal Hadi Salim
@ 2025-10-12 18:34 ` Eric Dumazet
0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-12 18:34 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
netdev, eric.dumazet
On Sun, Oct 12, 2025 at 8:23 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Mon, Oct 6, 2025 at 3:31 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > We want to revert commit 0f022d32c3ec ("net/sched: Fix mirred deadlock
> > on device recursion") because it adds code in the fast path, even when
> > act_mirred is not used.
> >
> > Use an additional device pointers array in struct netdev_xmit
> > and implement loop detection in tcf_mirred_is_act_redirect().
>
> Patch series looks good!
> This has the potential of (later on) fixing issue that are currently
> broken after the TTL bits were taken away.
> Small suggestion, the commit message was a bit confusing to me. How about:
>
> Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device
> recursion") it adds code in the fast path, even when act_mirred is not
> used. We revert in the next patch.
>
> Prepare by adding an additional device pointers array in struct
> netdev_xmit and implement loop detection in
> tcf_mirred_is_act_redirect().
>
> Please give us time to run tests on this set!
SGTM, I will send when net-next reopens, with an amended changelog
net/sched: act_mirred: add loop detection
Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
added code in the fast path, even when act_mirred is not used.
Prepare its revert by implementing loop detection in act_mirred.
Adds an array of device pointers in struct netdev_xmit.
tcf_mirred_is_act_redirect() can detect if the array
already contains the target device.
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-10-12 18:35 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
2025-10-06 19:38 ` Eric Dumazet
2025-10-07 15:26 ` Alexander Lobakin
2025-10-07 19:41 ` Maciej Fijalkowski
2025-10-09 8:37 ` Jason Xing
2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
2025-10-12 15:22 ` Jamal Hadi Salim
2025-10-12 18:34 ` Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-10-07 11:37 ` Eric Dumazet
2025-10-08 6:37 ` Paolo Abeni
2025-10-08 7:32 ` Eric Dumazet
2025-10-08 7:44 ` Paolo Abeni
2025-10-08 8:48 ` Toke Høiland-Jørgensen
2025-10-08 9:32 ` Eric Dumazet
2025-10-08 10:05 ` Toke Høiland-Jørgensen
2025-10-08 10:46 ` Eric Dumazet
2025-10-08 12:11 ` Toke Høiland-Jørgensen
2025-10-07 5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
2025-10-07 7:41 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).