[PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency
@ 2025-10-06 19:30 Eric Dumazet
  2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

In this series, I replace the busylock spinlock we have in
__dev_queue_xmit() and use lockless list (llist) to reduce
spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this series, we get a 300 % (4x) improvement on heavy TX workloads,
sending twice the number of packets per second, for half the cpu cycles.

Eric Dumazet (5):
  net: add add indirect call wrapper in skb_release_head_state()
  net/sched: act_mirred: add loop detection
  Revert "net/sched: Fix mirred deadlock on device recursion"
  net: sched: claim one cache line in Qdisc
  net: dev_queue_xmit() llist adoption

 include/linux/netdevice_xmit.h |  9 +++-
 include/net/sch_generic.h      | 23 ++++-----
 net/core/dev.c                 | 91 ++++++++++++++++++----------------
 net/core/skbuff.c              |  4 +-
 net/sched/act_mirred.c         | 62 +++++++++--------------
 net/sched/sch_generic.c        |  7 ---
 6 files changed, 93 insertions(+), 103 deletions(-)

-- 
2.51.0.618.g983fd99d29-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
@ 2025-10-06 19:30 ` Eric Dumazet
  2025-10-06 19:38   ` Eric Dumazet
  2025-10-07 15:26   ` Alexander Lobakin
  2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:30 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

While stress testing UDP senders on a host with expensive indirect
calls, I found cpus processing TX completions where showing
a very high cost (20%) in sock_wfree() due to
CONFIG_MITIGATION_RETPOLINE=y.

Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
 	skb_dst_drop(skb);
 	if (skb->destructor) {
 		DEBUG_NET_WARN_ON_ONCE(in_hardirq());
-		skb->destructor(skb);
+		INDIRECT_CALL_3(skb->destructor,
+				tcp_wfree, __sock_wfree, sock_wfree,
+				skb);
 	}
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	nf_conntrack_put(skb_nfct(skb));
-- 
2.51.0.618.g983fd99d29-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
  2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
  2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
  2025-10-12 15:22   ` Jamal Hadi Salim
  2025-10-06 19:31 ` [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

We want to revert commit 0f022d32c3ec ("net/sched: Fix mirred deadlock
on device recursion") because it adds code in the fast path, even when
act_mirred is not used.

Use an additional device pointers array in struct netdev_xmit
and implement loop detection in tcf_mirred_is_act_redirect().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice_xmit.h |  9 ++++-
 net/sched/act_mirred.c         | 62 +++++++++++++---------------------
 2 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice_xmit.h b/include/linux/netdevice_xmit.h
index 813a19122ebbb2c6a04176330b1055b7c2b9c902..cc232508e695eefe95ea6e55a21978be11d5da83 100644
--- a/include/linux/netdevice_xmit.h
+++ b/include/linux/netdevice_xmit.h
@@ -2,6 +2,12 @@
 #ifndef _LINUX_NETDEVICE_XMIT_H
 #define _LINUX_NETDEVICE_XMIT_H
 
+#if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
+#define MIRRED_NEST_LIMIT	4
+#endif
+
+struct net_device;
+
 struct netdev_xmit {
 	u16 recursion;
 	u8  more;
@@ -9,7 +15,8 @@ struct netdev_xmit {
 	u8  skip_txqueue;
 #endif
 #if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
-	u8 sched_mirred_nest;
+	u8			sched_mirred_nest;
+	struct net_device	*sched_mirred_dev[MIRRED_NEST_LIMIT];
 #endif
 #if IS_ENABLED(CONFIG_NF_DUP_NETDEV)
 	u8 nf_dup_skb_recursion;
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5f01f567c934d3669d9a3058cff861a8fe5f88b6..f27b583def78e4afecc7112854b93d59c2520201 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -29,31 +29,6 @@
 static LIST_HEAD(mirred_list);
 static DEFINE_SPINLOCK(mirred_list_lock);
 
-#define MIRRED_NEST_LIMIT    4
-
-#ifndef CONFIG_PREEMPT_RT
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
-	return __this_cpu_inc_return(softnet_data.xmit.sched_mirred_nest);
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
-	__this_cpu_dec(softnet_data.xmit.sched_mirred_nest);
-}
-
-#else
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
-	return current->net_xmit.sched_mirred_nest++;
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
-	current->net_xmit.sched_mirred_nest--;
-}
-#endif
-
 static bool tcf_mirred_is_act_redirect(int action)
 {
 	return action == TCA_EGRESS_REDIR || action == TCA_INGRESS_REDIR;
@@ -439,44 +414,53 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
 {
 	struct tcf_mirred *m = to_mirred(a);
 	int retval = READ_ONCE(m->tcf_action);
-	unsigned int nest_level;
+	struct netdev_xmit *xmit;
 	bool m_mac_header_xmit;
 	struct net_device *dev;
-	int m_eaction;
+	int i, m_eaction;
 	u32 blockid;
 
-	nest_level = tcf_mirred_nest_level_inc_return();
-	if (unlikely(nest_level > MIRRED_NEST_LIMIT)) {
+#ifdef CONFIG_PREEMPT_RT
+	xmit = &current->net_xmit;
+#else
+	xmit = this_cpu_ptr(&softnet_data.xmit);
+#endif
+	if (unlikely(xmit->sched_mirred_nest >= MIRRED_NEST_LIMIT)) {
 		net_warn_ratelimited("Packet exceeded mirred recursion limit on dev %s\n",
 				     netdev_name(skb->dev));
-		retval = TC_ACT_SHOT;
-		goto dec_nest_level;
+		return TC_ACT_SHOT;
 	}
 
 	tcf_lastuse_update(&m->tcf_tm);
 	tcf_action_update_bstats(&m->common, skb);
 
 	blockid = READ_ONCE(m->tcfm_blockid);
-	if (blockid) {
-		retval = tcf_blockcast(skb, m, blockid, res, retval);
-		goto dec_nest_level;
-	}
+	if (blockid)
+		return tcf_blockcast(skb, m, blockid, res, retval);
 
 	dev = rcu_dereference_bh(m->tcfm_dev);
 	if (unlikely(!dev)) {
 		pr_notice_once("tc mirred: target device is gone\n");
 		tcf_action_inc_overlimit_qstats(&m->common);
-		goto dec_nest_level;
+		return retval;
 	}
+	for (i = 0; i < xmit->sched_mirred_nest; i++) {
+		if (xmit->sched_mirred_dev[i] != dev)
+			continue;
+		pr_notice_once("tc mirred: loop on device %s\n",
+			       netdev_name(dev));
+		tcf_action_inc_overlimit_qstats(&m->common);
+		return retval;
+	}
+
+	xmit->sched_mirred_dev[xmit->sched_mirred_nest++] = dev;
 
 	m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
 	m_eaction = READ_ONCE(m->tcfm_eaction);
 
 	retval = tcf_mirred_to_dev(skb, m, dev, m_mac_header_xmit, m_eaction,
 				   retval);
-
-dec_nest_level:
-	tcf_mirred_nest_level_dec();
+	xmit->sched_mirred_nest--;
 
 	return retval;
 }
-- 
2.51.0.618.g983fd99d29-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion"
  2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
  2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
  2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
  2025-10-06 19:31 ` [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
and 44180feaccf266d9b0b28cc4ceaac019817deb5c.

Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h | 1 -
 net/core/dev.c            | 6 ------
 net/sched/sch_generic.c   | 2 --
 3 files changed, 9 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 738cd5b13c62fb8501619625880b87991f3dc17c..32e9961570b467b6066f1bb2c00ff1a270e342bc 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -117,7 +117,6 @@ struct Qdisc {
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue	qstats;
-	int                     owner;
 	unsigned long		state;
 	unsigned long		state2; /* must be written under qdisc spinlock */
 	struct Qdisc            *next_sched;
diff --git a/net/core/dev.c b/net/core/dev.c
index a64cef2c537e98ee87776e6f8d3ca3d98f0711b3..0ff178399b2d28ca2754b3f06d69a97f5d6dcf71 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4167,10 +4167,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
-	if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
-		return NET_XMIT_DROP;
-	}
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
 	 * separate lock before trying to get qdisc main lock.
@@ -4210,9 +4206,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		WRITE_ONCE(q->owner, smp_processor_id());
 		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-		WRITE_ONCE(q->owner, -1);
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1e008a228ebdf846d4ef7f83d655ac1142ec3596..dfa8e8e667d24a435b0c9cb3c1f05c8075f63e89 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -679,7 +679,6 @@ struct Qdisc noop_qdisc = {
 		.qlen = 0,
 		.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
 	},
-	.owner = -1,
 };
 EXPORT_SYMBOL(noop_qdisc);
 
@@ -985,7 +984,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;
 	sch->dev_queue = dev_queue;
-	sch->owner = -1;
 	netdev_hold(dev, &sch->dev_tracker, GFP_KERNEL);
 	refcount_set(&sch->refcnt, 1);
 
-- 
2.51.0.618.g983fd99d29-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc
  2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-10-06 19:31 ` [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
  2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
  2025-10-07  5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
  5 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Replace state2 field with a boolean.

Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 32e9961570b467b6066f1bb2c00ff1a270e342bc..31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -41,13 +41,6 @@ enum qdisc_state_t {
 	__QDISC_STATE_DRAINING,
 };
 
-enum qdisc_state2_t {
-	/* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
-	 * Use qdisc_run_begin/end() or qdisc_is_running() instead.
-	 */
-	__QDISC_STATE2_RUNNING,
-};
-
 #define QDISC_STATE_MISSED	BIT(__QDISC_STATE_MISSED)
 #define QDISC_STATE_DRAINING	BIT(__QDISC_STATE_DRAINING)
 
@@ -117,8 +110,8 @@ struct Qdisc {
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue	qstats;
+	bool			running; /* must be written under qdisc spinlock */
 	unsigned long		state;
-	unsigned long		state2; /* must be written under qdisc spinlock */
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
 
@@ -167,7 +160,7 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK)
 		return spin_is_locked(&qdisc->seqlock);
-	return test_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+	return READ_ONCE(qdisc->running);
 }
 
 static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
@@ -210,7 +203,10 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 		 */
 		return spin_trylock(&qdisc->seqlock);
 	}
-	return !__test_and_set_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+	if (READ_ONCE(qdisc->running))
+		return false;
+	WRITE_ONCE(qdisc->running, true);
+	return true;
 }
 
 static inline void qdisc_run_end(struct Qdisc *qdisc)
@@ -228,7 +224,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 				      &qdisc->state)))
 			__netif_schedule(qdisc);
 	} else {
-		__clear_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+		WRITE_ONCE(qdisc->running, false);
 	}
 }
 
-- 
2.51.0.618.g983fd99d29-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (3 preceding siblings ...)
  2025-10-06 19:31 ` [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
@ 2025-10-06 19:31 ` Eric Dumazet
  2025-10-07 11:37   ` Eric Dumazet
                     ` (2 more replies)
  2025-10-07  5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
  5 siblings, 3 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:31 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.

Tested:

- Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Before:

16 Mpps (41 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0

Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
      5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
       244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
        13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8

If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
       201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
        12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b

@__dev_queue_xmit_ns:
[256, 512)            21 |                                                    |
[512, 1K)            631 |                                                    |
[1K, 2K)           27328 |@                                                   |
[2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
[4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         19055 |@                                                   |
[64K, 128K)        17240 |@                                                   |
[128K, 256K)       25633 |@                                                   |
[256K, 512K)           4 |                                                    |

After:

29 Mpps (57 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0

perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

      2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
      5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
      2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
       923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
       121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
         6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b

If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
     32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
      4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
      2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
      3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
       233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
       153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
        84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
       140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0

@__dev_queue_xmit_ns:
[128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)         483936 |@@@@@@@@@@                                          |
[1K, 2K)          265345 |@@@@@@                                              |
[2K, 4K)          145463 |@@@                                                 |
[4K, 8K)           54571 |@                                                   |
[8K, 16K)          10270 |                                                    |
[16K, 32K)          9385 |                                                    |
[32K, 64K)          7749 |                                                    |
[64K, 128K)        26799 |                                                    |
[128K, 256K)        2665 |                                                    |
[256K, 512K)         665 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h |  4 +-
 net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
 net/sched/sch_generic.c   |  5 ---
 3 files changed, 52 insertions(+), 42 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -115,7 +115,9 @@ struct Qdisc {
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
 
-	spinlock_t		busylock ____cacheline_aligned_in_smp;
+	atomic_long_t		defer_count ____cacheline_aligned_in_smp;
+	struct llist_head	defer_list;
+
 	spinlock_t		seqlock;
 
 	struct rcu_head		rcu;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				 struct net_device *dev,
 				 struct netdev_queue *txq)
 {
+	struct sk_buff *next, *to_free = NULL;
 	spinlock_t *root_lock = qdisc_lock(q);
-	struct sk_buff *to_free = NULL;
-	bool contended;
+	struct llist_node *ll_list, *first_n;
+	unsigned long defer_count = 0;
 	int rc;
 
 	qdisc_calculate_pkt_len(skb, q);
@@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
-	/*
-	 * Heuristic to force contended enqueues to serialize on a
-	 * separate lock before trying to get qdisc main lock.
-	 * This permits qdisc->running owner to get the lock more
-	 * often and dequeue packets faster.
-	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
-	 * and then other tasks will only enqueue packets. The packets will be
-	 * sent after the qdisc owner is scheduled again. To prevent this
-	 * scenario the task always serialize on the lock.
+	/* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
+	 * In the try_cmpxchg() loop, we want to increment q->defer_count
+	 * at most once to limit the number of skbs in defer_list.
+	 * We perform the defer_count increment only if the list is not empty,
+	 * because some arches have slow atomic_long_inc_return().
+	 */
+	first_n = READ_ONCE(q->defer_list.first);
+	do {
+		if (first_n && !defer_count) {
+			defer_count = atomic_long_inc_return(&q->defer_count);
+			if (unlikely(defer_count > q->limit)) {
+				kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
+				return NET_XMIT_DROP;
+			}
+		}
+		skb->ll_node.next = first_n;
+	} while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
+
+	/* If defer_list was not empty, we know the cpu which queued
+	 * the first skb will process the whole list for us.
 	 */
-	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
-	if (unlikely(contended))
-		spin_lock(&q->busylock);
+	if (first_n)
+		return NET_XMIT_SUCCESS;
 
 	spin_lock(root_lock);
+
+	ll_list = llist_del_all(&q->defer_list);
+	/* There is a small race because we clear defer_count not atomically
+	 * with the prior llist_del_all(). This means defer_list could grow
+	 * over q->limit.
+	 */
+	atomic_long_set(&q->defer_count, 0);
+
+	ll_list = llist_reverse_order(ll_list);
+
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
-		__qdisc_drop(skb, &to_free);
+		llist_for_each_entry_safe(skb, next, ll_list, ll_node)
+			__qdisc_drop(skb, &to_free);
 		rc = NET_XMIT_DROP;
-	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
-		   qdisc_run_begin(q)) {
+		goto unlock;
+	}
+	rc = NET_XMIT_SUCCESS;
+	if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
+	    !llist_next(ll_list) && qdisc_run_begin(q)) {
 		/*
 		 * This is a work-conserving queue; there are no old skbs
 		 * waiting to be sent out; and the qdisc is not running -
 		 * xmit the skb directly.
 		 */
 
+		skb = llist_entry(ll_list, struct sk_buff, ll_node);
 		qdisc_bstats_update(q, skb);
-
-		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
+		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
 			__qdisc_run(q);
-		}
-
 		qdisc_run_end(q);
-		rc = NET_XMIT_SUCCESS;
 	} else {
-		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-		if (qdisc_run_begin(q)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
-			__qdisc_run(q);
-			qdisc_run_end(q);
+		llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
+			prefetch(next);
+			dev_qdisc_enqueue(skb, q, &to_free, txq);
 		}
+		qdisc_run(q);
 	}
+unlock:
 	spin_unlock(root_lock);
 	if (unlikely(to_free))
 		kfree_skb_list_reason(to_free,
 				      tcf_get_drop_reason(to_free));
-	if (unlikely(contended))
-		spin_unlock(&q->busylock);
 	return rc;
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index dfa8e8e667d24a435b0c9cb3c1f05c8075f63e89..d9a98d02a55fc361a223f3201e37b6a2b698bb5e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -666,7 +666,6 @@ struct Qdisc noop_qdisc = {
 	.ops		=	&noop_qdisc_ops,
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
 	.dev_queue	=	&noop_netdev_queue,
-	.busylock	=	__SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
 	.gso_skb = {
 		.next = (struct sk_buff *)&noop_qdisc.gso_skb,
 		.prev = (struct sk_buff *)&noop_qdisc.gso_skb,
@@ -970,10 +969,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 		}
 	}
 
-	spin_lock_init(&sch->busylock);
-	lockdep_set_class(&sch->busylock,
-			  dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
-
 	/* seqlock has the same scope of busylock, for NOLOCK qdisc */
 	spin_lock_init(&sch->seqlock);
 	lockdep_set_class(&sch->seqlock,
-- 
2.51.0.618.g983fd99d29-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
@ 2025-10-06 19:38   ` Eric Dumazet
  2025-10-07 15:26   ` Alexander Lobakin
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-06 19:38 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Mon, Oct 6, 2025 at 12:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> While stress testing UDP senders on a host with expensive indirect
> calls, I found cpus processing TX completions where showing
> a very high cost (20%) in sock_wfree() due to
> CONFIG_MITIGATION_RETPOLINE=y.
>
> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/skbuff.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
>         skb_dst_drop(skb);
>         if (skb->destructor) {
>                 DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> -               skb->destructor(skb);
> +               INDIRECT_CALL_3(skb->destructor,
> +                               tcp_wfree, __sock_wfree, sock_wfree,
> +                               skb);

This will probably not compile well with  "# CONFIG_INET is not set".

I will fix this when submitting the non RFC verson.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [syzbot ci] Re: net: optimize TX throughput and efficiency
  2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (4 preceding siblings ...)
  2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-10-07  5:23 ` syzbot ci
  2025-10-07  7:41   ` Eric Dumazet
  5 siblings, 1 reply; 23+ messages in thread
From: syzbot ci @ 2025-10-07  5:23 UTC (permalink / raw)
  To: davem, edumazet, eric.dumazet, horms, jhs, jiri, kuba, kuniyu,
	netdev, pabeni, willemb, xiyou.wangcong
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] net: optimize TX throughput and efficiency
https://lore.kernel.org/all/20251006193103.2684156-1-edumazet@google.com
* [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
* [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
* [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion"
* [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc
* [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption

and found the following issue:
KASAN: slab-use-after-free Read in netem_dequeue

Full report is available here:
https://ci.syzbot.org/series/e8660f67-35a0-406e-96ee-a401d3f30ff9

***

KASAN: slab-use-after-free Read in netem_dequeue

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      f1455695d2d99894b65db233877acac9a0e120b9
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/2db7ec92-610b-4887-bf33-d0b4c04760c8/config
syz repro: https://ci.syzbot.org/findings/3ca47f46-1b94-48b6-bab9-5996b7162c30/syz_repro

==================================================================
BUG: KASAN: slab-use-after-free in netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
Read of size 8 at addr ffff888020b65b30 by task ksoftirqd/1/23

CPU: 1 UID: 0 PID: 23 Comm: ksoftirqd/1 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xca/0x240 mm/kasan/report.c:482
 kasan_report+0x118/0x150 mm/kasan/report.c:595
 netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
 dequeue_skb net/sched/sch_generic.c:294 [inline]
 qdisc_restart net/sched/sch_generic.c:399 [inline]
 __qdisc_run+0x23c/0x15f0 net/sched/sch_generic.c:417
 qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
 net_tx_action+0x7c9/0x980 net/core/dev.c:5731
 handle_softirqs+0x286/0x870 kernel/softirq.c:579
 run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
 kthread+0x711/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 5913:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 unpoison_slab_object mm/kasan/common.c:330 [inline]
 __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:356
 kasan_slab_alloc include/linux/kasan.h:250 [inline]
 slab_post_alloc_hook mm/slub.c:4191 [inline]
 slab_alloc_node mm/slub.c:4240 [inline]
 kmem_cache_alloc_node_noprof+0x1bb/0x3c0 mm/slub.c:4292
 __alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
 alloc_skb include/linux/skbuff.h:1383 [inline]
 mld_newpack+0x13c/0xc40 net/ipv6/mcast.c:1775
 add_grhead+0x5a/0x2a0 net/ipv6/mcast.c:1886
 add_grec+0x1452/0x1740 net/ipv6/mcast.c:2025
 mld_send_cr net/ipv6/mcast.c:2148 [inline]
 mld_ifc_work+0x6ed/0xd60 net/ipv6/mcast.c:2693
 process_one_work kernel/workqueue.c:3236 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3319
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3400
 kthread+0x711/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Freed by task 23:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:243 [inline]
 __kasan_slab_free+0x5b/0x80 mm/kasan/common.c:275
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2422 [inline]
 slab_free mm/slub.c:4695 [inline]
 kmem_cache_free+0x18f/0x400 mm/slub.c:4797
 br_dev_xmit+0x11b3/0x1840 net/bridge/br_device.c:108
 __netdev_start_xmit include/linux/netdevice.h:5248 [inline]
 netdev_start_xmit include/linux/netdevice.h:5257 [inline]
 xmit_one net/core/dev.c:3845 [inline]
 dev_hard_start_xmit+0x2d7/0x830 net/core/dev.c:3861
 sch_direct_xmit+0x241/0x4b0 net/sched/sch_generic.c:344
 qdisc_restart net/sched/sch_generic.c:409 [inline]
 __qdisc_run+0xb16/0x15f0 net/sched/sch_generic.c:417
 qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
 net_tx_action+0x7c9/0x980 net/core/dev.c:5731
 handle_softirqs+0x286/0x870 kernel/softirq.c:579
 run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
 kthread+0x711/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

The buggy address belongs to the object at ffff888020b65b00
 which belongs to the cache skbuff_head_cache of size 240
The buggy address is located 48 bytes inside of
 freed 240-byte region [ffff888020b65b00, ffff888020b65bf0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20b64
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
head: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
head: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
head: 00fff00000000001 ffffea000082d901 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 1, migratetype Unmovable, gfp_mask 0x72820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_MEMALLOC|__GFP_COMP), pid 0, tgid 0 (swapper/0), ts 96149950019, free_ts 42914869228
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851
 prep_new_page mm/page_alloc.c:1859 [inline]
 get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858
 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148
 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
 alloc_slab_page mm/slub.c:2492 [inline]
 allocate_slab+0x8a/0x370 mm/slub.c:2660
 new_slab mm/slub.c:2714 [inline]
 ___slab_alloc+0xbeb/0x1420 mm/slub.c:3901
 __slab_alloc mm/slub.c:3992 [inline]
 __slab_alloc_node mm/slub.c:4067 [inline]
 slab_alloc_node mm/slub.c:4228 [inline]
 kmem_cache_alloc_node_noprof+0x280/0x3c0 mm/slub.c:4292
 __alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
 __netdev_alloc_skb+0x108/0x970 net/core/skbuff.c:734
 netdev_alloc_skb include/linux/skbuff.h:3484 [inline]
 dev_alloc_skb include/linux/skbuff.h:3497 [inline]
 __ieee80211_beacon_get+0xc06/0x1880 net/mac80211/tx.c:5652
 ieee80211_beacon_get_tim+0xb4/0x2b0 net/mac80211/tx.c:5774
 ieee80211_beacon_get include/net/mac80211.h:5667 [inline]
 mac80211_hwsim_beacon_tx+0x3ce/0x860 drivers/net/wireless/virtual/mac80211_hwsim.c:2355
 __iterate_interfaces+0x2ab/0x590 net/mac80211/util.c:761
 ieee80211_iterate_active_interfaces_atomic+0xdb/0x180 net/mac80211/util.c:797
 mac80211_hwsim_beacon+0xbb/0x1c0 drivers/net/wireless/virtual/mac80211_hwsim.c:2389
 __run_hrtimer kernel/time/hrtimer.c:1761 [inline]
 __hrtimer_run_queues+0x52c/0xc60 kernel/time/hrtimer.c:1825
page last free pid 0 tgid 0 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 free_pages_prepare mm/page_alloc.c:1395 [inline]
 __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2895
 rcu_do_batch kernel/rcu/tree.c:2605 [inline]
 rcu_core+0xcab/0x1770 kernel/rcu/tree.c:2861
 handle_softirqs+0x286/0x870 kernel/softirq.c:579
 __do_softirq kernel/softirq.c:613 [inline]
 invoke_softirq kernel/softirq.c:453 [inline]
 __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
 sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702

Memory state around the buggy address:
 ffff888020b65a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
 ffff888020b65a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888020b65b00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                     ^
 ffff888020b65b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
 ffff888020b65c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [syzbot ci] Re: net: optimize TX throughput and efficiency
  2025-10-07  5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
@ 2025-10-07  7:41   ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-07  7:41 UTC (permalink / raw)
  To: syzbot ci
  Cc: davem, eric.dumazet, horms, jhs, jiri, kuba, kuniyu, netdev,
	pabeni, willemb, xiyou.wangcong, syzbot, syzkaller-bugs

On Mon, Oct 6, 2025 at 10:24 PM syzbot ci
<syzbot+ciad44046e74230deb@syzkaller.appspotmail.com> wrote:
>
> syzbot ci has tested the following series
>
> [v1] net: optimize TX throughput and efficiency
> https://lore.kernel.org/all/20251006193103.2684156-1-edumazet@google.com
> * [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
> * [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
> * [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion"
> * [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc
> * [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
>
> and found the following issue:
> KASAN: slab-use-after-free Read in netem_dequeue
>
> Full report is available here:
> https://ci.syzbot.org/series/e8660f67-35a0-406e-96ee-a401d3f30ff9
>

I was unsure if I needed to clear skb->next or not before calling
dev_qdisc_enqueue()

We could either do this in netem, or generically in dev_qdisc_enqueue()

diff --git a/net/core/dev.c b/net/core/dev.c
index 6094768bf3c028f0ad1e52b9b12b7258fa0ecff6..547efbfb63adb4a093ce4b4ea0934256c15e263b
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4115,6 +4115,7 @@ static int dev_qdisc_enqueue(struct sk_buff
*skb, struct Qdisc *q,
 {
        int rc;

+       skb_mark_not_on_list(skb);
        rc = q->enqueue(skb, q, to_free) & NET_XMIT_MASK;
        if (rc == NET_XMIT_SUCCESS)
                trace_qdisc_enqueue(q, txq, skb);



> ***
>
> KASAN: slab-use-after-free Read in netem_dequeue
>
> tree:      net-next
> URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
> base:      f1455695d2d99894b65db233877acac9a0e120b9
> arch:      amd64
> compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
> config:    https://ci.syzbot.org/builds/2db7ec92-610b-4887-bf33-d0b4c04760c8/config
> syz repro: https://ci.syzbot.org/findings/3ca47f46-1b94-48b6-bab9-5996b7162c30/syz_repro
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
> Read of size 8 at addr ffff888020b65b30 by task ksoftirqd/1/23
>
> CPU: 1 UID: 0 PID: 23 Comm: ksoftirqd/1 Not tainted syzkaller #0 PREEMPT(full)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
>  print_address_description mm/kasan/report.c:378 [inline]
>  print_report+0xca/0x240 mm/kasan/report.c:482
>  kasan_report+0x118/0x150 mm/kasan/report.c:595
>  netem_dequeue+0x4e7/0x1430 net/sched/sch_netem.c:720
>  dequeue_skb net/sched/sch_generic.c:294 [inline]
>  qdisc_restart net/sched/sch_generic.c:399 [inline]
>  __qdisc_run+0x23c/0x15f0 net/sched/sch_generic.c:417
>  qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
>  net_tx_action+0x7c9/0x980 net/core/dev.c:5731
>  handle_softirqs+0x286/0x870 kernel/softirq.c:579
>  run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
>  smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
>  kthread+0x711/0x8a0 kernel/kthread.c:463
>  ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
>
> Allocated by task 5913:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
>  unpoison_slab_object mm/kasan/common.c:330 [inline]
>  __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:356
>  kasan_slab_alloc include/linux/kasan.h:250 [inline]
>  slab_post_alloc_hook mm/slub.c:4191 [inline]
>  slab_alloc_node mm/slub.c:4240 [inline]
>  kmem_cache_alloc_node_noprof+0x1bb/0x3c0 mm/slub.c:4292
>  __alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
>  alloc_skb include/linux/skbuff.h:1383 [inline]
>  mld_newpack+0x13c/0xc40 net/ipv6/mcast.c:1775
>  add_grhead+0x5a/0x2a0 net/ipv6/mcast.c:1886
>  add_grec+0x1452/0x1740 net/ipv6/mcast.c:2025
>  mld_send_cr net/ipv6/mcast.c:2148 [inline]
>  mld_ifc_work+0x6ed/0xd60 net/ipv6/mcast.c:2693
>  process_one_work kernel/workqueue.c:3236 [inline]
>  process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3319
>  worker_thread+0x8a0/0xda0 kernel/workqueue.c:3400
>  kthread+0x711/0x8a0 kernel/kthread.c:463
>  ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>
> Freed by task 23:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
>  kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
>  poison_slab_object mm/kasan/common.c:243 [inline]
>  __kasan_slab_free+0x5b/0x80 mm/kasan/common.c:275
>  kasan_slab_free include/linux/kasan.h:233 [inline]
>  slab_free_hook mm/slub.c:2422 [inline]
>  slab_free mm/slub.c:4695 [inline]
>  kmem_cache_free+0x18f/0x400 mm/slub.c:4797
>  br_dev_xmit+0x11b3/0x1840 net/bridge/br_device.c:108
>  __netdev_start_xmit include/linux/netdevice.h:5248 [inline]
>  netdev_start_xmit include/linux/netdevice.h:5257 [inline]
>  xmit_one net/core/dev.c:3845 [inline]
>  dev_hard_start_xmit+0x2d7/0x830 net/core/dev.c:3861
>  sch_direct_xmit+0x241/0x4b0 net/sched/sch_generic.c:344
>  qdisc_restart net/sched/sch_generic.c:409 [inline]
>  __qdisc_run+0xb16/0x15f0 net/sched/sch_generic.c:417
>  qdisc_run+0xc5/0x290 include/net/pkt_sched.h:126
>  net_tx_action+0x7c9/0x980 net/core/dev.c:5731
>  handle_softirqs+0x286/0x870 kernel/softirq.c:579
>  run_ksoftirqd+0x9b/0x100 kernel/softirq.c:968
>  smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
>  kthread+0x711/0x8a0 kernel/kthread.c:463
>  ret_from_fork+0x439/0x7d0 arch/x86/kernel/process.c:148
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>
> The buggy address belongs to the object at ffff888020b65b00
>  which belongs to the cache skbuff_head_cache of size 240
> The buggy address is located 48 bytes inside of
>  freed 240-byte region [ffff888020b65b00, ffff888020b65bf0)
>
> The buggy address belongs to the physical page:
> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20b64
> head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
> page_type: f5(slab)
> raw: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
> head: 00fff00000000040 ffff88801cedf8c0 dead000000000122 0000000000000000
> head: 0000000000000000 0000000000150015 00000000f5000000 0000000000000000
> head: 00fff00000000001 ffffea000082d901 00000000ffffffff 00000000ffffffff
> head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 1, migratetype Unmovable, gfp_mask 0x72820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_MEMALLOC|__GFP_COMP), pid 0, tgid 0 (swapper/0), ts 96149950019, free_ts 42914869228
>  set_page_owner include/linux/page_owner.h:32 [inline]
>  post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851
>  prep_new_page mm/page_alloc.c:1859 [inline]
>  get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858
>  __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148
>  alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
>  alloc_slab_page mm/slub.c:2492 [inline]
>  allocate_slab+0x8a/0x370 mm/slub.c:2660
>  new_slab mm/slub.c:2714 [inline]
>  ___slab_alloc+0xbeb/0x1420 mm/slub.c:3901
>  __slab_alloc mm/slub.c:3992 [inline]
>  __slab_alloc_node mm/slub.c:4067 [inline]
>  slab_alloc_node mm/slub.c:4228 [inline]
>  kmem_cache_alloc_node_noprof+0x280/0x3c0 mm/slub.c:4292
>  __alloc_skb+0x112/0x2d0 net/core/skbuff.c:660
>  __netdev_alloc_skb+0x108/0x970 net/core/skbuff.c:734
>  netdev_alloc_skb include/linux/skbuff.h:3484 [inline]
>  dev_alloc_skb include/linux/skbuff.h:3497 [inline]
>  __ieee80211_beacon_get+0xc06/0x1880 net/mac80211/tx.c:5652
>  ieee80211_beacon_get_tim+0xb4/0x2b0 net/mac80211/tx.c:5774
>  ieee80211_beacon_get include/net/mac80211.h:5667 [inline]
>  mac80211_hwsim_beacon_tx+0x3ce/0x860 drivers/net/wireless/virtual/mac80211_hwsim.c:2355
>  __iterate_interfaces+0x2ab/0x590 net/mac80211/util.c:761
>  ieee80211_iterate_active_interfaces_atomic+0xdb/0x180 net/mac80211/util.c:797
>  mac80211_hwsim_beacon+0xbb/0x1c0 drivers/net/wireless/virtual/mac80211_hwsim.c:2389
>  __run_hrtimer kernel/time/hrtimer.c:1761 [inline]
>  __hrtimer_run_queues+0x52c/0xc60 kernel/time/hrtimer.c:1825
> page last free pid 0 tgid 0 stack trace:
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  free_pages_prepare mm/page_alloc.c:1395 [inline]
>  __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2895
>  rcu_do_batch kernel/rcu/tree.c:2605 [inline]
>  rcu_core+0xcab/0x1770 kernel/rcu/tree.c:2861
>  handle_softirqs+0x286/0x870 kernel/softirq.c:579
>  __do_softirq kernel/softirq.c:613 [inline]
>  invoke_softirq kernel/softirq.c:453 [inline]
>  __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
>  irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
>  instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
>  sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
>
> Memory state around the buggy address:
>  ffff888020b65a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
>  ffff888020b65a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >ffff888020b65b00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                                      ^
>  ffff888020b65b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
>  ffff888020b65c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
>
>
> ***
>
> If these findings have caused you to resend the series or submit a
> separate fix, please add the following tag to your commit message:
>   Tested-by: syzbot@syzkaller.appspotmail.com
>
> ---
> This report is generated by a bot. It may contain errors.
> syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-10-07 11:37   ` Eric Dumazet
  2025-10-08  6:37   ` Paolo Abeni
  2025-10-08  8:48   ` Toke Høiland-Jørgensen
  2 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-07 11:37 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Mon, Oct 6, 2025 at 12:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
>

> +
> +       ll_list = llist_del_all(&q->defer_list);
> +       /* There is a small race because we clear defer_count not atomically
> +        * with the prior llist_del_all(). This means defer_list could grow
> +        * over q->limit.
> +        */
> +       atomic_long_set(&q->defer_count, 0);
> +
> +       ll_list = llist_reverse_order(ll_list);
> +
>         if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> -               __qdisc_drop(skb, &to_free);
> +               llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> +                       __qdisc_drop(skb, &to_free);
>                 rc = NET_XMIT_DROP;
> -       } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> -                  qdisc_run_begin(q)) {
> +               goto unlock;
> +       }
> +       rc = NET_XMIT_SUCCESS;
> +       if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> +           !llist_next(ll_list) && qdisc_run_begin(q)) {
>                 /*
>                  * This is a work-conserving queue; there are no old skbs
>                  * waiting to be sent out; and the qdisc is not running -
>                  * xmit the skb directly.
>                  */
>
> +               skb = llist_entry(ll_list, struct sk_buff, ll_node);
>                 qdisc_bstats_update(q, skb);
> -
> -               if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
> -                       if (unlikely(contended)) {
> -                               spin_unlock(&q->busylock);
> -                               contended = false;
> -                       }
> +               if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>                         __qdisc_run(q);
> -               }
> -
>                 qdisc_run_end(q);
> -               rc = NET_XMIT_SUCCESS;
>         } else {
> -               rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
> -               if (qdisc_run_begin(q)) {
> -                       if (unlikely(contended)) {
> -                               spin_unlock(&q->busylock);
> -                               contended = false;
> -                       }
> -                       __qdisc_run(q);
> -                       qdisc_run_end(q);
> +               llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> +                       prefetch(next);
> +                       dev_qdisc_enqueue(skb, q, &to_free, txq);

Now is the good time to add batch support to some qdisc->enqueue()
where this would help.

For instance fq_enqueue() could take a single ktime_get() sample.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
  2025-10-06 19:38   ` Eric Dumazet
@ 2025-10-07 15:26   ` Alexander Lobakin
  2025-10-07 19:41     ` Maciej Fijalkowski
  1 sibling, 1 reply; 23+ messages in thread
From: Alexander Lobakin @ 2025-10-07 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Maciej Fijalkowski, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Mon,  6 Oct 2025 19:30:59 +0000

> While stress testing UDP senders on a host with expensive indirect
> calls, I found cpus processing TX completions where showing
> a very high cost (20%) in sock_wfree() due to
> CONFIG_MITIGATION_RETPOLINE=y.
> 
> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/skbuff.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
>  	skb_dst_drop(skb);
>  	if (skb->destructor) {
>  		DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> -		skb->destructor(skb);
> +		INDIRECT_CALL_3(skb->destructor,
> +				tcp_wfree, __sock_wfree, sock_wfree,
> +				skb);

Not sure, but maybe we could add generic XSk skb destructor here as
well? Or it's not that important as generic XSk is not the best way to
use XDP sockets?

Maciej, what do you think?

>  	}
>  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
>  	nf_conntrack_put(skb_nfct(skb));

Thanks,
Olek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-07 15:26   ` Alexander Lobakin
@ 2025-10-07 19:41     ` Maciej Fijalkowski
  2025-10-09  8:37       ` Jason Xing
  0 siblings, 1 reply; 23+ messages in thread
From: Maciej Fijalkowski @ 2025-10-07 19:41 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Tue, Oct 07, 2025 at 05:26:46PM +0200, Alexander Lobakin wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon,  6 Oct 2025 19:30:59 +0000
> 
> > While stress testing UDP senders on a host with expensive indirect
> > calls, I found cpus processing TX completions where showing
> > a very high cost (20%) in sock_wfree() due to
> > CONFIG_MITIGATION_RETPOLINE=y.
> > 
> > Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/core/skbuff.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
> >  	skb_dst_drop(skb);
> >  	if (skb->destructor) {
> >  		DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> > -		skb->destructor(skb);
> > +		INDIRECT_CALL_3(skb->destructor,
> > +				tcp_wfree, __sock_wfree, sock_wfree,
> > +				skb);
> 
> Not sure, but maybe we could add generic XSk skb destructor here as
> well? Or it's not that important as generic XSk is not the best way to
> use XDP sockets?
> 
> Maciej, what do you think?

I would appreciate it as there has been various attempts to optmize xsk
generic xmit path.

> 
> >  	}
> >  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
> >  	nf_conntrack_put(skb_nfct(skb));
> 
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
  2025-10-07 11:37   ` Eric Dumazet
@ 2025-10-08  6:37   ` Paolo Abeni
  2025-10-08  7:32     ` Eric Dumazet
  2025-10-08  8:48   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2025-10-08  6:37 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

Hi,

On 10/6/25 9:31 PM, Eric Dumazet wrote:
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -115,7 +115,9 @@ struct Qdisc {
>  	struct Qdisc            *next_sched;
>  	struct sk_buff_head	skb_bad_txq;
>  
> -	spinlock_t		busylock ____cacheline_aligned_in_smp;
> +	atomic_long_t		defer_count ____cacheline_aligned_in_smp;
> +	struct llist_head	defer_list;
> +

Dumb question: I guess the above brings back Qdisc to the original size
? (pre patch 4/5?) If so, perhaps is worth mentioning somewhere in the
commit message.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-08  6:37   ` Paolo Abeni
@ 2025-10-08  7:32     ` Eric Dumazet
  2025-10-08  7:44       ` Paolo Abeni
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-08  7:32 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, Simon Horman, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Tue, Oct 7, 2025 at 11:37 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On 10/6/25 9:31 PM, Eric Dumazet wrote:
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -115,7 +115,9 @@ struct Qdisc {
> >       struct Qdisc            *next_sched;
> >       struct sk_buff_head     skb_bad_txq;
> >
> > -     spinlock_t              busylock ____cacheline_aligned_in_smp;
> > +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
> > +     struct llist_head       defer_list;
> > +
>
> Dumb question: I guess the above brings back Qdisc to the original size
> ? (pre patch 4/5?) If so, perhaps is worth mentioning somewhere in the
> commit message.


Hmm, I do not think this changes the size.
The cache line that was starting at busylock had holes.
You can see that even adding the long and the pointer, we still have room in it.

/* --- cacheline 3 boundary (192 bytes) --- */
struct gnet_stats_queue    qstats;               /*  0xc0  0x14 */
bool                       running;              /*  0xd4   0x1 */

/* XXX 3 bytes hole, try to pack */

unsigned long              state;                /*  0xd8   0x8 */
struct Qdisc *             next_sched;           /*  0xe0   0x8 */
struct sk_buff_head        skb_bad_txq;          /*  0xe8  0x18 */
/* --- cacheline 4 boundary (256 bytes) --- */
atomic_long_t              defer_count
__attribute__((__aligned__(64))); /* 0x100   0x8 */
struct llist_head          defer_list;           /* 0x108   0x8 */
spinlock_t                 seqlock;              /* 0x110   0x4 */

/* XXX 4 bytes hole, try to pack */

struct callback_head       rcu __attribute__((__aligned__(8))); /*
0x118  0x10 */
netdevice_tracker          dev_tracker;          /* 0x128   0x8 */
struct lock_class_key      root_lock_key;        /* 0x130     0 */

/* XXX 16 bytes hole, try to pack */

/* --- cacheline 5 boundary (320 bytes) --- */
long                       privdata[]
__attribute__((__aligned__(64))); /* 0x140     0 */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-08  7:32     ` Eric Dumazet
@ 2025-10-08  7:44       ` Paolo Abeni
  0 siblings, 0 replies; 23+ messages in thread
From: Paolo Abeni @ 2025-10-08  7:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Simon Horman, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On 10/8/25 9:32 AM, Eric Dumazet wrote:
> On Tue, Oct 7, 2025 at 11:37 PM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> Hi,
>>
>> On 10/6/25 9:31 PM, Eric Dumazet wrote:
>>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>>> index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>>> --- a/include/net/sch_generic.h
>>> +++ b/include/net/sch_generic.h
>>> @@ -115,7 +115,9 @@ struct Qdisc {
>>>       struct Qdisc            *next_sched;
>>>       struct sk_buff_head     skb_bad_txq;
>>>
>>> -     spinlock_t              busylock ____cacheline_aligned_in_smp;
>>> +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
>>> +     struct llist_head       defer_list;
>>> +
>>
>> Dumb question: I guess the above brings back Qdisc to the original size
>> ? (pre patch 4/5?) If so, perhaps is worth mentioning somewhere in the
>> commit message.
> 
> 
> Hmm, I do not think this changes the size.
> The cache line that was starting at busylock had holes.
> You can see that even adding the long and the pointer, we still have room in it.

Ah, nice! I did not take in account the ____cacheline_aligned_in_smp on
a later field (privdata).

Thanks for the clarifying pahole output!

Paolo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
  2025-10-07 11:37   ` Eric Dumazet
  2025-10-08  6:37   ` Paolo Abeni
@ 2025-10-08  8:48   ` Toke Høiland-Jørgensen
  2025-10-08  9:32     ` Eric Dumazet
  2 siblings, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-08  8:48 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
> Before:
>
> 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
> 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
>
> Lock contention (1 second sample taken on 8 cores)
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
>       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
>        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
>         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
>
> If netperf threads are pinned, spinlock stress is very high.
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
>        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
>         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
>
> @__dev_queue_xmit_ns:
> [256, 512)            21 |                                                    |
> [512, 1K)            631 |                                                    |
> [1K, 2K)           27328 |@                                                   |
> [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
> [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
> [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [32K, 64K)         19055 |@                                                   |
> [64K, 128K)        17240 |@                                                   |
> [128K, 256K)       25633 |@                                                   |
> [256K, 512K)           4 |                                                    |
>
> After:
>
> 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
> 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
> 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
> 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
> 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
>
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
>       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
>       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
>        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
>        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
>          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
>
> If netperf threads are pinned (~54 Mpps)
> perf lock record -C0-7 sleep 1; perf lock contention
>      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
>       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
>       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
>       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
>        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
>        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
>         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
>        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
>
> @__dev_queue_xmit_ns:
> [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K)         483936 |@@@@@@@@@@                                          |
> [1K, 2K)          265345 |@@@@@@                                              |
> [2K, 4K)          145463 |@@@                                                 |
> [4K, 8K)           54571 |@                                                   |
> [8K, 16K)          10270 |                                                    |
> [16K, 32K)          9385 |                                                    |
> [32K, 64K)          7749 |                                                    |
> [64K, 128K)        26799 |                                                    |
> [128K, 256K)        2665 |                                                    |
> [256K, 512K)         665 |                                                    |
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

This is very cool! One question below, just to make sure I understand
this correctly:

> ---
>  include/net/sch_generic.h |  4 +-
>  net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
>  net/sched/sch_generic.c   |  5 ---
>  3 files changed, 52 insertions(+), 42 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -115,7 +115,9 @@ struct Qdisc {
>  	struct Qdisc            *next_sched;
>  	struct sk_buff_head	skb_bad_txq;
>  
> -	spinlock_t		busylock ____cacheline_aligned_in_smp;
> +	atomic_long_t		defer_count ____cacheline_aligned_in_smp;
> +	struct llist_head	defer_list;
> +
>  	spinlock_t		seqlock;
>  
>  	struct rcu_head		rcu;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  				 struct net_device *dev,
>  				 struct netdev_queue *txq)
>  {
> +	struct sk_buff *next, *to_free = NULL;
>  	spinlock_t *root_lock = qdisc_lock(q);
> -	struct sk_buff *to_free = NULL;
> -	bool contended;
> +	struct llist_node *ll_list, *first_n;
> +	unsigned long defer_count = 0;
>  	int rc;
>  
>  	qdisc_calculate_pkt_len(skb, q);
> @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  		return rc;
>  	}
>  
> -	/*
> -	 * Heuristic to force contended enqueues to serialize on a
> -	 * separate lock before trying to get qdisc main lock.
> -	 * This permits qdisc->running owner to get the lock more
> -	 * often and dequeue packets faster.
> -	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> -	 * and then other tasks will only enqueue packets. The packets will be
> -	 * sent after the qdisc owner is scheduled again. To prevent this
> -	 * scenario the task always serialize on the lock.
> +	/* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> +	 * In the try_cmpxchg() loop, we want to increment q->defer_count
> +	 * at most once to limit the number of skbs in defer_list.
> +	 * We perform the defer_count increment only if the list is not empty,
> +	 * because some arches have slow atomic_long_inc_return().
> +	 */
> +	first_n = READ_ONCE(q->defer_list.first);
> +	do {
> +		if (first_n && !defer_count) {
> +			defer_count = atomic_long_inc_return(&q->defer_count);
> +			if (unlikely(defer_count > q->limit)) {
> +				kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> +				return NET_XMIT_DROP;
> +			}
> +		}
> +		skb->ll_node.next = first_n;
> +	} while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> +
> +	/* If defer_list was not empty, we know the cpu which queued
> +	 * the first skb will process the whole list for us.
>  	 */
> -	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> -	if (unlikely(contended))
> -		spin_lock(&q->busylock);
> +	if (first_n)
> +		return NET_XMIT_SUCCESS;
>  
>  	spin_lock(root_lock);
> +
> +	ll_list = llist_del_all(&q->defer_list);
> +	/* There is a small race because we clear defer_count not atomically
> +	 * with the prior llist_del_all(). This means defer_list could grow
> +	 * over q->limit.
> +	 */
> +	atomic_long_set(&q->defer_count, 0);
> +
> +	ll_list = llist_reverse_order(ll_list);
> +
>  	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> -		__qdisc_drop(skb, &to_free);
> +		llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> +			__qdisc_drop(skb, &to_free);
>  		rc = NET_XMIT_DROP;
> -	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> -		   qdisc_run_begin(q)) {
> +		goto unlock;
> +	}
> +	rc = NET_XMIT_SUCCESS;
> +	if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> +	    !llist_next(ll_list) && qdisc_run_begin(q)) {
>  		/*
>  		 * This is a work-conserving queue; there are no old skbs
>  		 * waiting to be sent out; and the qdisc is not running -
>  		 * xmit the skb directly.
>  		 */
>  
> +		skb = llist_entry(ll_list, struct sk_buff, ll_node);

I was puzzled by this^. But AFAICT, the idea is that in the
non-contended path we're in here (no other CPU enqueueing packets), we
will still have added the skb to the llist before taking the root_lock,
and here we pull it back off the list, right?

Even though this is technically not needed (we'll always get the same
skb back, I think?), so this is mostly for consistency(?)

Assuming I got this right:

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-08  8:48   ` Toke Høiland-Jørgensen
@ 2025-10-08  9:32     ` Eric Dumazet
  2025-10-08 10:05       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-08  9:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > Remove busylock spinlock and use a lockless list (llist)
> > to reduce spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this patch, we get a 300 % improvement on heavy TX workloads.
> > - Sending twice the number of packets per second.
> > - While consuming 50 % less cycles.
> >
> > Tested:
> >
> > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> >
> > Before:
> >
> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
> >
> > Lock contention (1 second sample taken on 8 cores)
> > perf lock record -C0-7 sleep 1; perf lock contention
> >  contended   total wait     max wait     avg wait         type   caller
> >
> >     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
> >       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
> >        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
> >         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
> >
> > If netperf threads are pinned, spinlock stress is very high.
> > perf lock record -C0-7 sleep 1; perf lock contention
> >  contended   total wait     max wait     avg wait         type   caller
> >
> >     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
> >        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
> >         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
> >
> > @__dev_queue_xmit_ns:
> > [256, 512)            21 |                                                    |
> > [512, 1K)            631 |                                                    |
> > [1K, 2K)           27328 |@                                                   |
> > [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
> > [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
> > [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> > [32K, 64K)         19055 |@                                                   |
> > [64K, 128K)        17240 |@                                                   |
> > [128K, 256K)       25633 |@                                                   |
> > [256K, 512K)           4 |                                                    |
> >
> > After:
> >
> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
> > 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
> > 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
> > 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
> > 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
> >
> > perf lock record -C0-7 sleep 1; perf lock contention
> >  contended   total wait     max wait     avg wait         type   caller
> >
> >       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
> >       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
> >       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
> >        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
> >        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
> >          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
> >
> > If netperf threads are pinned (~54 Mpps)
> > perf lock record -C0-7 sleep 1; perf lock contention
> >      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
> >       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
> >       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
> >       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
> >        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
> >        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
> >         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
> >        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
> >
> > @__dev_queue_xmit_ns:
> > [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> > [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [512, 1K)         483936 |@@@@@@@@@@                                          |
> > [1K, 2K)          265345 |@@@@@@                                              |
> > [2K, 4K)          145463 |@@@                                                 |
> > [4K, 8K)           54571 |@                                                   |
> > [8K, 16K)          10270 |                                                    |
> > [16K, 32K)          9385 |                                                    |
> > [32K, 64K)          7749 |                                                    |
> > [64K, 128K)        26799 |                                                    |
> > [128K, 256K)        2665 |                                                    |
> > [256K, 512K)         665 |                                                    |
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> This is very cool! One question below, just to make sure I understand
> this correctly:
>
> > ---
> >  include/net/sch_generic.h |  4 +-
> >  net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
> >  net/sched/sch_generic.c   |  5 ---
> >  3 files changed, 52 insertions(+), 42 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -115,7 +115,9 @@ struct Qdisc {
> >       struct Qdisc            *next_sched;
> >       struct sk_buff_head     skb_bad_txq;
> >
> > -     spinlock_t              busylock ____cacheline_aligned_in_smp;
> > +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
> > +     struct llist_head       defer_list;
> > +
> >       spinlock_t              seqlock;
> >
> >       struct rcu_head         rcu;
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >                                struct net_device *dev,
> >                                struct netdev_queue *txq)
> >  {
> > +     struct sk_buff *next, *to_free = NULL;
> >       spinlock_t *root_lock = qdisc_lock(q);
> > -     struct sk_buff *to_free = NULL;
> > -     bool contended;
> > +     struct llist_node *ll_list, *first_n;
> > +     unsigned long defer_count = 0;
> >       int rc;
> >
> >       qdisc_calculate_pkt_len(skb, q);
> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >               return rc;
> >       }
> >
> > -     /*
> > -      * Heuristic to force contended enqueues to serialize on a
> > -      * separate lock before trying to get qdisc main lock.
> > -      * This permits qdisc->running owner to get the lock more
> > -      * often and dequeue packets faster.
> > -      * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> > -      * and then other tasks will only enqueue packets. The packets will be
> > -      * sent after the qdisc owner is scheduled again. To prevent this
> > -      * scenario the task always serialize on the lock.
> > +     /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> > +      * In the try_cmpxchg() loop, we want to increment q->defer_count
> > +      * at most once to limit the number of skbs in defer_list.
> > +      * We perform the defer_count increment only if the list is not empty,
> > +      * because some arches have slow atomic_long_inc_return().
> > +      */
> > +     first_n = READ_ONCE(q->defer_list.first);
> > +     do {
> > +             if (first_n && !defer_count) {
> > +                     defer_count = atomic_long_inc_return(&q->defer_count);
> > +                     if (unlikely(defer_count > q->limit)) {
> > +                             kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> > +                             return NET_XMIT_DROP;
> > +                     }
> > +             }
> > +             skb->ll_node.next = first_n;
> > +     } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> > +
> > +     /* If defer_list was not empty, we know the cpu which queued
> > +      * the first skb will process the whole list for us.
> >        */
> > -     contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> > -     if (unlikely(contended))
> > -             spin_lock(&q->busylock);
> > +     if (first_n)
> > +             return NET_XMIT_SUCCESS;
> >
> >       spin_lock(root_lock);
> > +
> > +     ll_list = llist_del_all(&q->defer_list);
> > +     /* There is a small race because we clear defer_count not atomically
> > +      * with the prior llist_del_all(). This means defer_list could grow
> > +      * over q->limit.
> > +      */
> > +     atomic_long_set(&q->defer_count, 0);
> > +
> > +     ll_list = llist_reverse_order(ll_list);
> > +
> >       if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> > -             __qdisc_drop(skb, &to_free);
> > +             llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> > +                     __qdisc_drop(skb, &to_free);
> >               rc = NET_XMIT_DROP;
> > -     } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > -                qdisc_run_begin(q)) {
> > +             goto unlock;
> > +     }
> > +     rc = NET_XMIT_SUCCESS;
> > +     if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > +         !llist_next(ll_list) && qdisc_run_begin(q)) {
> >               /*
> >                * This is a work-conserving queue; there are no old skbs
> >                * waiting to be sent out; and the qdisc is not running -
> >                * xmit the skb directly.
> >                */
> >
> > +             skb = llist_entry(ll_list, struct sk_buff, ll_node);
>
> I was puzzled by this^. But AFAICT, the idea is that in the
> non-contended path we're in here (no other CPU enqueueing packets), we
> will still have added the skb to the llist before taking the root_lock,
> and here we pull it back off the list, right?
>
> Even though this is technically not needed (we'll always get the same
> skb back, I think?), so this is mostly for consistency(?)

Exactly. I guess we could instead add an assert like

diff --git a/net/core/dev.c b/net/core/dev.c
index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                 * xmit the skb directly.
                 */

-               skb = llist_entry(ll_list, struct sk_buff, ll_node);
+               DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
struct sk_buff, ll_node));
                qdisc_bstats_update(q, skb);
                if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
                        __qdisc_run(q);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-08  9:32     ` Eric Dumazet
@ 2025-10-08 10:05       ` Toke Høiland-Jørgensen
  2025-10-08 10:46         ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-08 10:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

Eric Dumazet <edumazet@google.com> writes:

> On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Eric Dumazet <edumazet@google.com> writes:
>>
>> > Remove busylock spinlock and use a lockless list (llist)
>> > to reduce spinlock contention to the minimum.
>> >
>> > Idea is that only one cpu might spin on the qdisc spinlock,
>> > while others simply add their skb in the llist.
>> >
>> > After this patch, we get a 300 % improvement on heavy TX workloads.
>> > - Sending twice the number of packets per second.
>> > - While consuming 50 % less cycles.
>> >
>> > Tested:
>> >
>> > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
>> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
>> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
>> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>> >
>> > Before:
>> >
>> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>> >
>> > vmstat 2 5
>> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>> > 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
>> > 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
>> >
>> > Lock contention (1 second sample taken on 8 cores)
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >  contended   total wait     max wait     avg wait         type   caller
>> >
>> >     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
>> >       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
>> >        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
>> >         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
>> >
>> > If netperf threads are pinned, spinlock stress is very high.
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >  contended   total wait     max wait     avg wait         type   caller
>> >
>> >     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
>> >        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
>> >         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
>> >
>> > @__dev_queue_xmit_ns:
>> > [256, 512)            21 |                                                    |
>> > [512, 1K)            631 |                                                    |
>> > [1K, 2K)           27328 |@                                                   |
>> > [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
>> > [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
>> > [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> > [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>> > [32K, 64K)         19055 |@                                                   |
>> > [64K, 128K)        17240 |@                                                   |
>> > [128K, 256K)       25633 |@                                                   |
>> > [256K, 512K)           4 |                                                    |
>> >
>> > After:
>> >
>> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>> >
>> > vmstat 2 5
>> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>> > 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
>> > 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
>> > 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
>> > 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
>> > 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
>> >
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >  contended   total wait     max wait     avg wait         type   caller
>> >
>> >       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
>> >       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
>> >       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
>> >        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
>> >        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
>> >          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
>> >
>> > If netperf threads are pinned (~54 Mpps)
>> > perf lock record -C0-7 sleep 1; perf lock contention
>> >      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
>> >       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
>> >       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
>> >       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
>> >        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
>> >        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
>> >         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
>> >        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
>> >
>> > @__dev_queue_xmit_ns:
>> > [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
>> > [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> > [512, 1K)         483936 |@@@@@@@@@@                                          |
>> > [1K, 2K)          265345 |@@@@@@                                              |
>> > [2K, 4K)          145463 |@@@                                                 |
>> > [4K, 8K)           54571 |@                                                   |
>> > [8K, 16K)          10270 |                                                    |
>> > [16K, 32K)          9385 |                                                    |
>> > [32K, 64K)          7749 |                                                    |
>> > [64K, 128K)        26799 |                                                    |
>> > [128K, 256K)        2665 |                                                    |
>> > [256K, 512K)         665 |                                                    |
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>>
>> This is very cool! One question below, just to make sure I understand
>> this correctly:
>>
>> > ---
>> >  include/net/sch_generic.h |  4 +-
>> >  net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
>> >  net/sched/sch_generic.c   |  5 ---
>> >  3 files changed, 52 insertions(+), 42 deletions(-)
>> >
>> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>> > --- a/include/net/sch_generic.h
>> > +++ b/include/net/sch_generic.h
>> > @@ -115,7 +115,9 @@ struct Qdisc {
>> >       struct Qdisc            *next_sched;
>> >       struct sk_buff_head     skb_bad_txq;
>> >
>> > -     spinlock_t              busylock ____cacheline_aligned_in_smp;
>> > +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
>> > +     struct llist_head       defer_list;
>> > +
>> >       spinlock_t              seqlock;
>> >
>> >       struct rcu_head         rcu;
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >                                struct net_device *dev,
>> >                                struct netdev_queue *txq)
>> >  {
>> > +     struct sk_buff *next, *to_free = NULL;
>> >       spinlock_t *root_lock = qdisc_lock(q);
>> > -     struct sk_buff *to_free = NULL;
>> > -     bool contended;
>> > +     struct llist_node *ll_list, *first_n;
>> > +     unsigned long defer_count = 0;
>> >       int rc;
>> >
>> >       qdisc_calculate_pkt_len(skb, q);
>> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >               return rc;
>> >       }
>> >
>> > -     /*
>> > -      * Heuristic to force contended enqueues to serialize on a
>> > -      * separate lock before trying to get qdisc main lock.
>> > -      * This permits qdisc->running owner to get the lock more
>> > -      * often and dequeue packets faster.
>> > -      * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
>> > -      * and then other tasks will only enqueue packets. The packets will be
>> > -      * sent after the qdisc owner is scheduled again. To prevent this
>> > -      * scenario the task always serialize on the lock.
>> > +     /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
>> > +      * In the try_cmpxchg() loop, we want to increment q->defer_count
>> > +      * at most once to limit the number of skbs in defer_list.
>> > +      * We perform the defer_count increment only if the list is not empty,
>> > +      * because some arches have slow atomic_long_inc_return().
>> > +      */
>> > +     first_n = READ_ONCE(q->defer_list.first);
>> > +     do {
>> > +             if (first_n && !defer_count) {
>> > +                     defer_count = atomic_long_inc_return(&q->defer_count);
>> > +                     if (unlikely(defer_count > q->limit)) {
>> > +                             kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
>> > +                             return NET_XMIT_DROP;
>> > +                     }
>> > +             }
>> > +             skb->ll_node.next = first_n;
>> > +     } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
>> > +
>> > +     /* If defer_list was not empty, we know the cpu which queued
>> > +      * the first skb will process the whole list for us.
>> >        */
>> > -     contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
>> > -     if (unlikely(contended))
>> > -             spin_lock(&q->busylock);
>> > +     if (first_n)
>> > +             return NET_XMIT_SUCCESS;
>> >
>> >       spin_lock(root_lock);
>> > +
>> > +     ll_list = llist_del_all(&q->defer_list);
>> > +     /* There is a small race because we clear defer_count not atomically
>> > +      * with the prior llist_del_all(). This means defer_list could grow
>> > +      * over q->limit.
>> > +      */
>> > +     atomic_long_set(&q->defer_count, 0);
>> > +
>> > +     ll_list = llist_reverse_order(ll_list);
>> > +
>> >       if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>> > -             __qdisc_drop(skb, &to_free);
>> > +             llist_for_each_entry_safe(skb, next, ll_list, ll_node)
>> > +                     __qdisc_drop(skb, &to_free);
>> >               rc = NET_XMIT_DROP;
>> > -     } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> > -                qdisc_run_begin(q)) {
>> > +             goto unlock;
>> > +     }
>> > +     rc = NET_XMIT_SUCCESS;
>> > +     if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> > +         !llist_next(ll_list) && qdisc_run_begin(q)) {
>> >               /*
>> >                * This is a work-conserving queue; there are no old skbs
>> >                * waiting to be sent out; and the qdisc is not running -
>> >                * xmit the skb directly.
>> >                */
>> >
>> > +             skb = llist_entry(ll_list, struct sk_buff, ll_node);
>>
>> I was puzzled by this^. But AFAICT, the idea is that in the
>> non-contended path we're in here (no other CPU enqueueing packets), we
>> will still have added the skb to the llist before taking the root_lock,
>> and here we pull it back off the list, right?
>>
>> Even though this is technically not needed (we'll always get the same
>> skb back, I think?), so this is mostly for consistency(?)
>
> Exactly. I guess we could instead add an assert like
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>                  * xmit the skb directly.
>                  */
>
> -               skb = llist_entry(ll_list, struct sk_buff, ll_node);
> +               DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
> struct sk_buff, ll_node));
>                 qdisc_bstats_update(q, skb);
>                 if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>                         __qdisc_run(q);

OK, so thinking about this some more, isn't there a race between the

	if (first_n)
		return NET_XMIT_SUCCESS;

and taking the lock? I.e., two different CPUs can pass that check in
which case one of them will end up spinning on the lock, and by the time
it acquires it, there is no longer any guarantee that the skb on the
llist will be the same one that we started with? Or indeed that there
will be any skbs on the list at all?

In which case, shouldn't there be a:

if (unlikely(llist_empty(ll_list)))
	goto unlock;

after the llist_del_all() assignment inside the lock? (and we should
retain the skb = llist_entry()) assignment I was confused about).

Or am I missing some reason this can't happen?

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-08 10:05       ` Toke Høiland-Jørgensen
@ 2025-10-08 10:46         ` Eric Dumazet
  2025-10-08 12:11           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-10-08 10:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 3:05 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Eric Dumazet <edumazet@google.com> writes:
> >>
> >> > Remove busylock spinlock and use a lockless list (llist)
> >> > to reduce spinlock contention to the minimum.
> >> >
> >> > Idea is that only one cpu might spin on the qdisc spinlock,
> >> > while others simply add their skb in the llist.
> >> >
> >> > After this patch, we get a 300 % improvement on heavy TX workloads.
> >> > - Sending twice the number of packets per second.
> >> > - While consuming 50 % less cycles.
> >> >
> >> > Tested:
> >> >
> >> > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> >> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> >> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> >> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> >> >
> >> > Before:
> >> >
> >> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
> >> >
> >> > vmstat 2 5
> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> >> > 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
> >> > 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
> >> > 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
> >> > 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
> >> > 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
> >> >
> >> > Lock contention (1 second sample taken on 8 cores)
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> >  contended   total wait     max wait     avg wait         type   caller
> >> >
> >> >     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
> >> >       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
> >> >        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
> >> >         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
> >> >
> >> > If netperf threads are pinned, spinlock stress is very high.
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> >  contended   total wait     max wait     avg wait         type   caller
> >> >
> >> >     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
> >> >        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
> >> >         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
> >> >
> >> > @__dev_queue_xmit_ns:
> >> > [256, 512)            21 |                                                    |
> >> > [512, 1K)            631 |                                                    |
> >> > [1K, 2K)           27328 |@                                                   |
> >> > [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
> >> > [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
> >> > [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >> > [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> >> > [32K, 64K)         19055 |@                                                   |
> >> > [64K, 128K)        17240 |@                                                   |
> >> > [128K, 256K)       25633 |@                                                   |
> >> > [256K, 512K)           4 |                                                    |
> >> >
> >> > After:
> >> >
> >> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
> >> >
> >> > vmstat 2 5
> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> >> > 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
> >> > 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
> >> > 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
> >> > 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
> >> > 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
> >> >
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> >  contended   total wait     max wait     avg wait         type   caller
> >> >
> >> >       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
> >> >       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
> >> >       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
> >> >        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
> >> >        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
> >> >          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
> >> >
> >> > If netperf threads are pinned (~54 Mpps)
> >> > perf lock record -C0-7 sleep 1; perf lock contention
> >> >      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
> >> >       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
> >> >       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
> >> >       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
> >> >        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
> >> >        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
> >> >         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
> >> >        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
> >> >
> >> > @__dev_queue_xmit_ns:
> >> > [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> >> > [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >> > [512, 1K)         483936 |@@@@@@@@@@                                          |
> >> > [1K, 2K)          265345 |@@@@@@                                              |
> >> > [2K, 4K)          145463 |@@@                                                 |
> >> > [4K, 8K)           54571 |@                                                   |
> >> > [8K, 16K)          10270 |                                                    |
> >> > [16K, 32K)          9385 |                                                    |
> >> > [32K, 64K)          7749 |                                                    |
> >> > [64K, 128K)        26799 |                                                    |
> >> > [128K, 256K)        2665 |                                                    |
> >> > [256K, 512K)         665 |                                                    |
> >> >
> >> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >>
> >> This is very cool! One question below, just to make sure I understand
> >> this correctly:
> >>
> >> > ---
> >> >  include/net/sch_generic.h |  4 +-
> >> >  net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
> >> >  net/sched/sch_generic.c   |  5 ---
> >> >  3 files changed, 52 insertions(+), 42 deletions(-)
> >> >
> >> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> >> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> >> > --- a/include/net/sch_generic.h
> >> > +++ b/include/net/sch_generic.h
> >> > @@ -115,7 +115,9 @@ struct Qdisc {
> >> >       struct Qdisc            *next_sched;
> >> >       struct sk_buff_head     skb_bad_txq;
> >> >
> >> > -     spinlock_t              busylock ____cacheline_aligned_in_smp;
> >> > +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
> >> > +     struct llist_head       defer_list;
> >> > +
> >> >       spinlock_t              seqlock;
> >> >
> >> >       struct rcu_head         rcu;
> >> > diff --git a/net/core/dev.c b/net/core/dev.c
> >> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> >> > --- a/net/core/dev.c
> >> > +++ b/net/core/dev.c
> >> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >> >                                struct net_device *dev,
> >> >                                struct netdev_queue *txq)
> >> >  {
> >> > +     struct sk_buff *next, *to_free = NULL;
> >> >       spinlock_t *root_lock = qdisc_lock(q);
> >> > -     struct sk_buff *to_free = NULL;
> >> > -     bool contended;
> >> > +     struct llist_node *ll_list, *first_n;
> >> > +     unsigned long defer_count = 0;
> >> >       int rc;
> >> >
> >> >       qdisc_calculate_pkt_len(skb, q);
> >> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >> >               return rc;
> >> >       }
> >> >
> >> > -     /*
> >> > -      * Heuristic to force contended enqueues to serialize on a
> >> > -      * separate lock before trying to get qdisc main lock.
> >> > -      * This permits qdisc->running owner to get the lock more
> >> > -      * often and dequeue packets faster.
> >> > -      * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> >> > -      * and then other tasks will only enqueue packets. The packets will be
> >> > -      * sent after the qdisc owner is scheduled again. To prevent this
> >> > -      * scenario the task always serialize on the lock.
> >> > +     /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> >> > +      * In the try_cmpxchg() loop, we want to increment q->defer_count
> >> > +      * at most once to limit the number of skbs in defer_list.
> >> > +      * We perform the defer_count increment only if the list is not empty,
> >> > +      * because some arches have slow atomic_long_inc_return().
> >> > +      */
> >> > +     first_n = READ_ONCE(q->defer_list.first);
> >> > +     do {
> >> > +             if (first_n && !defer_count) {
> >> > +                     defer_count = atomic_long_inc_return(&q->defer_count);
> >> > +                     if (unlikely(defer_count > q->limit)) {
> >> > +                             kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> >> > +                             return NET_XMIT_DROP;
> >> > +                     }
> >> > +             }
> >> > +             skb->ll_node.next = first_n;
> >> > +     } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> >> > +
> >> > +     /* If defer_list was not empty, we know the cpu which queued
> >> > +      * the first skb will process the whole list for us.
> >> >        */
> >> > -     contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> >> > -     if (unlikely(contended))
> >> > -             spin_lock(&q->busylock);
> >> > +     if (first_n)
> >> > +             return NET_XMIT_SUCCESS;
> >> >
> >> >       spin_lock(root_lock);
> >> > +
> >> > +     ll_list = llist_del_all(&q->defer_list);
> >> > +     /* There is a small race because we clear defer_count not atomically
> >> > +      * with the prior llist_del_all(). This means defer_list could grow
> >> > +      * over q->limit.
> >> > +      */
> >> > +     atomic_long_set(&q->defer_count, 0);
> >> > +
> >> > +     ll_list = llist_reverse_order(ll_list);
> >> > +
> >> >       if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> >> > -             __qdisc_drop(skb, &to_free);
> >> > +             llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> >> > +                     __qdisc_drop(skb, &to_free);
> >> >               rc = NET_XMIT_DROP;
> >> > -     } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> >> > -                qdisc_run_begin(q)) {
> >> > +             goto unlock;
> >> > +     }
> >> > +     rc = NET_XMIT_SUCCESS;
> >> > +     if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> >> > +         !llist_next(ll_list) && qdisc_run_begin(q)) {
> >> >               /*
> >> >                * This is a work-conserving queue; there are no old skbs
> >> >                * waiting to be sent out; and the qdisc is not running -
> >> >                * xmit the skb directly.
> >> >                */
> >> >
> >> > +             skb = llist_entry(ll_list, struct sk_buff, ll_node);
> >>
> >> I was puzzled by this^. But AFAICT, the idea is that in the
> >> non-contended path we're in here (no other CPU enqueueing packets), we
> >> will still have added the skb to the llist before taking the root_lock,
> >> and here we pull it back off the list, right?
> >>
> >> Even though this is technically not needed (we'll always get the same
> >> skb back, I think?), so this is mostly for consistency(?)
> >
> > Exactly. I guess we could instead add an assert like
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >                  * xmit the skb directly.
> >                  */
> >
> > -               skb = llist_entry(ll_list, struct sk_buff, ll_node);
> > +               DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
> > struct sk_buff, ll_node));
> >                 qdisc_bstats_update(q, skb);
> >                 if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
> >                         __qdisc_run(q);
>
> OK, so thinking about this some more, isn't there a race between the
>
>         if (first_n)
>                 return NET_XMIT_SUCCESS;
>
> and taking the lock? I.e., two different CPUs can pass that check in
> which case one of them will end up spinning on the lock, and by the time
> it acquires it, there is no longer any guarantee that the skb on the
> llist will be the same one that we started with? Or indeed that there
> will be any skbs on the list at all?

Only one cpu can observe the list was empty.

This (spinlock)less list still has atomic guarantees, thanks to cmpxchg().

So after this only cpu passes the spinlock() barrier, the list has at
least one skb in it.

If there is a single skb in the list after the xchg() got it
atomically, then it must be its own skb.

>
> In which case, shouldn't there be a:
>
> if (unlikely(llist_empty(ll_list)))
>         goto unlock;
>
> after the llist_del_all() assignment inside the lock? (and we should
> retain the skb = llist_entry()) assignment I was confused about).
>
> Or am I missing some reason this can't happen?
>
> -Toke
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-08 10:46         ` Eric Dumazet
@ 2025-10-08 12:11           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-08 12:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

Eric Dumazet <edumazet@google.com> writes:

> On Wed, Oct 8, 2025 at 3:05 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Eric Dumazet <edumazet@google.com> writes:
>>
>> > On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Eric Dumazet <edumazet@google.com> writes:
>> >>
>> >> > Remove busylock spinlock and use a lockless list (llist)
>> >> > to reduce spinlock contention to the minimum.
>> >> >
>> >> > Idea is that only one cpu might spin on the qdisc spinlock,
>> >> > while others simply add their skb in the llist.
>> >> >
>> >> > After this patch, we get a 300 % improvement on heavy TX workloads.
>> >> > - Sending twice the number of packets per second.
>> >> > - While consuming 50 % less cycles.
>> >> >
>> >> > Tested:
>> >> >
>> >> > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
>> >> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
>> >> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
>> >> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>> >> >
>> >> > Before:
>> >> >
>> >> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>> >> >
>> >> > vmstat 2 5
>> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>> >> > 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
>> >> > 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
>> >> > 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
>> >> > 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
>> >> > 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
>> >> >
>> >> > Lock contention (1 second sample taken on 8 cores)
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> >  contended   total wait     max wait     avg wait         type   caller
>> >> >
>> >> >     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
>> >> >       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
>> >> >        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
>> >> >         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
>> >> >
>> >> > If netperf threads are pinned, spinlock stress is very high.
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> >  contended   total wait     max wait     avg wait         type   caller
>> >> >
>> >> >     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
>> >> >        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
>> >> >         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
>> >> >
>> >> > @__dev_queue_xmit_ns:
>> >> > [256, 512)            21 |                                                    |
>> >> > [512, 1K)            631 |                                                    |
>> >> > [1K, 2K)           27328 |@                                                   |
>> >> > [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
>> >> > [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
>> >> > [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> >> > [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>> >> > [32K, 64K)         19055 |@                                                   |
>> >> > [64K, 128K)        17240 |@                                                   |
>> >> > [128K, 256K)       25633 |@                                                   |
>> >> > [256K, 512K)           4 |                                                    |
>> >> >
>> >> > After:
>> >> >
>> >> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>> >> >
>> >> > vmstat 2 5
>> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>> >> > 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
>> >> > 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
>> >> > 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
>> >> > 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
>> >> > 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
>> >> >
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> >  contended   total wait     max wait     avg wait         type   caller
>> >> >
>> >> >       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
>> >> >       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
>> >> >       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
>> >> >        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
>> >> >        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
>> >> >          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
>> >> >
>> >> > If netperf threads are pinned (~54 Mpps)
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> >      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
>> >> >       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
>> >> >       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
>> >> >       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
>> >> >        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
>> >> >        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
>> >> >         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
>> >> >        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
>> >> >
>> >> > @__dev_queue_xmit_ns:
>> >> > [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
>> >> > [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> >> > [512, 1K)         483936 |@@@@@@@@@@                                          |
>> >> > [1K, 2K)          265345 |@@@@@@                                              |
>> >> > [2K, 4K)          145463 |@@@                                                 |
>> >> > [4K, 8K)           54571 |@                                                   |
>> >> > [8K, 16K)          10270 |                                                    |
>> >> > [16K, 32K)          9385 |                                                    |
>> >> > [32K, 64K)          7749 |                                                    |
>> >> > [64K, 128K)        26799 |                                                    |
>> >> > [128K, 256K)        2665 |                                                    |
>> >> > [256K, 512K)         665 |                                                    |
>> >> >
>> >> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> >>
>> >> This is very cool! One question below, just to make sure I understand
>> >> this correctly:
>> >>
>> >> > ---
>> >> >  include/net/sch_generic.h |  4 +-
>> >> >  net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
>> >> >  net/sched/sch_generic.c   |  5 ---
>> >> >  3 files changed, 52 insertions(+), 42 deletions(-)
>> >> >
>> >> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> >> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>> >> > --- a/include/net/sch_generic.h
>> >> > +++ b/include/net/sch_generic.h
>> >> > @@ -115,7 +115,9 @@ struct Qdisc {
>> >> >       struct Qdisc            *next_sched;
>> >> >       struct sk_buff_head     skb_bad_txq;
>> >> >
>> >> > -     spinlock_t              busylock ____cacheline_aligned_in_smp;
>> >> > +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
>> >> > +     struct llist_head       defer_list;
>> >> > +
>> >> >       spinlock_t              seqlock;
>> >> >
>> >> >       struct rcu_head         rcu;
>> >> > diff --git a/net/core/dev.c b/net/core/dev.c
>> >> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
>> >> > --- a/net/core/dev.c
>> >> > +++ b/net/core/dev.c
>> >> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >> >                                struct net_device *dev,
>> >> >                                struct netdev_queue *txq)
>> >> >  {
>> >> > +     struct sk_buff *next, *to_free = NULL;
>> >> >       spinlock_t *root_lock = qdisc_lock(q);
>> >> > -     struct sk_buff *to_free = NULL;
>> >> > -     bool contended;
>> >> > +     struct llist_node *ll_list, *first_n;
>> >> > +     unsigned long defer_count = 0;
>> >> >       int rc;
>> >> >
>> >> >       qdisc_calculate_pkt_len(skb, q);
>> >> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >> >               return rc;
>> >> >       }
>> >> >
>> >> > -     /*
>> >> > -      * Heuristic to force contended enqueues to serialize on a
>> >> > -      * separate lock before trying to get qdisc main lock.
>> >> > -      * This permits qdisc->running owner to get the lock more
>> >> > -      * often and dequeue packets faster.
>> >> > -      * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
>> >> > -      * and then other tasks will only enqueue packets. The packets will be
>> >> > -      * sent after the qdisc owner is scheduled again. To prevent this
>> >> > -      * scenario the task always serialize on the lock.
>> >> > +     /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
>> >> > +      * In the try_cmpxchg() loop, we want to increment q->defer_count
>> >> > +      * at most once to limit the number of skbs in defer_list.
>> >> > +      * We perform the defer_count increment only if the list is not empty,
>> >> > +      * because some arches have slow atomic_long_inc_return().
>> >> > +      */
>> >> > +     first_n = READ_ONCE(q->defer_list.first);
>> >> > +     do {
>> >> > +             if (first_n && !defer_count) {
>> >> > +                     defer_count = atomic_long_inc_return(&q->defer_count);
>> >> > +                     if (unlikely(defer_count > q->limit)) {
>> >> > +                             kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
>> >> > +                             return NET_XMIT_DROP;
>> >> > +                     }
>> >> > +             }
>> >> > +             skb->ll_node.next = first_n;
>> >> > +     } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
>> >> > +
>> >> > +     /* If defer_list was not empty, we know the cpu which queued
>> >> > +      * the first skb will process the whole list for us.
>> >> >        */
>> >> > -     contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
>> >> > -     if (unlikely(contended))
>> >> > -             spin_lock(&q->busylock);
>> >> > +     if (first_n)
>> >> > +             return NET_XMIT_SUCCESS;
>> >> >
>> >> >       spin_lock(root_lock);
>> >> > +
>> >> > +     ll_list = llist_del_all(&q->defer_list);
>> >> > +     /* There is a small race because we clear defer_count not atomically
>> >> > +      * with the prior llist_del_all(). This means defer_list could grow
>> >> > +      * over q->limit.
>> >> > +      */
>> >> > +     atomic_long_set(&q->defer_count, 0);
>> >> > +
>> >> > +     ll_list = llist_reverse_order(ll_list);
>> >> > +
>> >> >       if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>> >> > -             __qdisc_drop(skb, &to_free);
>> >> > +             llist_for_each_entry_safe(skb, next, ll_list, ll_node)
>> >> > +                     __qdisc_drop(skb, &to_free);
>> >> >               rc = NET_XMIT_DROP;
>> >> > -     } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> >> > -                qdisc_run_begin(q)) {
>> >> > +             goto unlock;
>> >> > +     }
>> >> > +     rc = NET_XMIT_SUCCESS;
>> >> > +     if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> >> > +         !llist_next(ll_list) && qdisc_run_begin(q)) {
>> >> >               /*
>> >> >                * This is a work-conserving queue; there are no old skbs
>> >> >                * waiting to be sent out; and the qdisc is not running -
>> >> >                * xmit the skb directly.
>> >> >                */
>> >> >
>> >> > +             skb = llist_entry(ll_list, struct sk_buff, ll_node);
>> >>
>> >> I was puzzled by this^. But AFAICT, the idea is that in the
>> >> non-contended path we're in here (no other CPU enqueueing packets), we
>> >> will still have added the skb to the llist before taking the root_lock,
>> >> and here we pull it back off the list, right?
>> >>
>> >> Even though this is technically not needed (we'll always get the same
>> >> skb back, I think?), so this is mostly for consistency(?)
>> >
>> > Exactly. I guess we could instead add an assert like
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
>> > 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
>> > *skb, struct Qdisc *q,
>> >                  * xmit the skb directly.
>> >                  */
>> >
>> > -               skb = llist_entry(ll_list, struct sk_buff, ll_node);
>> > +               DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
>> > struct sk_buff, ll_node));
>> >                 qdisc_bstats_update(q, skb);
>> >                 if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>> >                         __qdisc_run(q);
>>
>> OK, so thinking about this some more, isn't there a race between the
>>
>>         if (first_n)
>>                 return NET_XMIT_SUCCESS;
>>
>> and taking the lock? I.e., two different CPUs can pass that check in
>> which case one of them will end up spinning on the lock, and by the time
>> it acquires it, there is no longer any guarantee that the skb on the
>> llist will be the same one that we started with? Or indeed that there
>> will be any skbs on the list at all?
>
> Only one cpu can observe the list was empty.
>
> This (spinlock)less list still has atomic guarantees, thanks to cmpxchg().

Ah, right, missed the bit where try_cmpxchg would update first_n if it
doesn't succeed...

> So after this only cpu passes the spinlock() barrier, the list has at
> least one skb in it.
>
> If there is a single skb in the list after the xchg() got it
> atomically, then it must be its own skb.

Gotcha! In that case I agree the DEBUG_NET_WARN_ON_ONCE() is clearer
than reassigning the skb pointer.

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-07 19:41     ` Maciej Fijalkowski
@ 2025-10-09  8:37       ` Jason Xing
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Xing @ 2025-10-09  8:37 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Lobakin, Eric Dumazet, David S . Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet

On Wed, Oct 8, 2025 at 3:42 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Tue, Oct 07, 2025 at 05:26:46PM +0200, Alexander Lobakin wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Mon,  6 Oct 2025 19:30:59 +0000
> >
> > > While stress testing UDP senders on a host with expensive indirect
> > > calls, I found cpus processing TX completions where showing
> > > a very high cost (20%) in sock_wfree() due to
> > > CONFIG_MITIGATION_RETPOLINE=y.
> > >
> > > Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > >  net/core/skbuff.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..c9c06f9a8d6085f8d0907b412e050a60c835a6e8 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -1136,7 +1136,9 @@ void skb_release_head_state(struct sk_buff *skb)
> > >     skb_dst_drop(skb);
> > >     if (skb->destructor) {
> > >             DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> > > -           skb->destructor(skb);
> > > +           INDIRECT_CALL_3(skb->destructor,
> > > +                           tcp_wfree, __sock_wfree, sock_wfree,
> > > +                           skb);
> >
> > Not sure, but maybe we could add generic XSk skb destructor here as
> > well?

I added the following snippet[1] and only saw a stable ~1% improvement
when sending 64 size packets with xdpsock.

I'm not so sure it deserves a follow-up patch to Eric's series. Better
than nothing? Any ideas on this one?

[1]
INDIRECT_CALL_4(skb->destructor, tcp_wfree, __sock_wfree, sock_wfree,
xsk_destruct_skb, skb);

>>  Or it's not that important as generic XSk is not the best way to
> > use XDP sockets?

Yes, it surely matters. At least, virtio_net and veth need this copy
mode. And I've been working on batch xmit to ramp up the generic path.

> >
> > Maciej, what do you think?
>
> I would appreciate it as there has been various attempts to optmize xsk
> generic xmit path.

So do I!

Thanks,
Jason


>
> >
> > >     }
> > >  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
> > >     nf_conntrack_put(skb_nfct(skb));
> >
> > Thanks,
> > Olek
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
  2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
@ 2025-10-12 15:22   ` Jamal Hadi Salim
  2025-10-12 18:34     ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Jamal Hadi Salim @ 2025-10-12 15:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Mon, Oct 6, 2025 at 3:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> We want to revert commit 0f022d32c3ec ("net/sched: Fix mirred deadlock
> on device recursion") because it adds code in the fast path, even when
> act_mirred is not used.
>
> Use an additional device pointers array in struct netdev_xmit
> and implement loop detection in tcf_mirred_is_act_redirect().

Patch series looks good!
This has the potential of (later on) fixing issue that are currently
broken after the TTL bits were taken away.
Small suggestion, the commit message was a bit confusing to me. How about:

Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device
recursion") it adds code in the fast path, even when act_mirred is not
used. We revert in the next patch.

Prepare by adding an additional device pointers array in struct
netdev_xmit and implement loop detection in
tcf_mirred_is_act_redirect().

Please give us time to run tests on this set!

cheers,
jamal

>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/netdevice_xmit.h |  9 ++++-
>  net/sched/act_mirred.c         | 62 +++++++++++++---------------------
>  2 files changed, 31 insertions(+), 40 deletions(-)
>
> diff --git a/include/linux/netdevice_xmit.h b/include/linux/netdevice_xmit.h
> index 813a19122ebbb2c6a04176330b1055b7c2b9c902..cc232508e695eefe95ea6e55a21978be11d5da83 100644
> --- a/include/linux/netdevice_xmit.h
> +++ b/include/linux/netdevice_xmit.h
> @@ -2,6 +2,12 @@
>  #ifndef _LINUX_NETDEVICE_XMIT_H
>  #define _LINUX_NETDEVICE_XMIT_H
>
> +#if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
> +#define MIRRED_NEST_LIMIT      4
> +#endif
> +
> +struct net_device;
> +
>  struct netdev_xmit {
>         u16 recursion;
>         u8  more;
> @@ -9,7 +15,8 @@ struct netdev_xmit {
>         u8  skip_txqueue;
>  #endif
>  #if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
> -       u8 sched_mirred_nest;
> +       u8                      sched_mirred_nest;
> +       struct net_device       *sched_mirred_dev[MIRRED_NEST_LIMIT];
>  #endif
>  #if IS_ENABLED(CONFIG_NF_DUP_NETDEV)
>         u8 nf_dup_skb_recursion;
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index 5f01f567c934d3669d9a3058cff861a8fe5f88b6..f27b583def78e4afecc7112854b93d59c2520201 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -29,31 +29,6 @@
>  static LIST_HEAD(mirred_list);
>  static DEFINE_SPINLOCK(mirred_list_lock);
>
> -#define MIRRED_NEST_LIMIT    4
> -
> -#ifndef CONFIG_PREEMPT_RT
> -static u8 tcf_mirred_nest_level_inc_return(void)
> -{
> -       return __this_cpu_inc_return(softnet_data.xmit.sched_mirred_nest);
> -}
> -
> -static void tcf_mirred_nest_level_dec(void)
> -{
> -       __this_cpu_dec(softnet_data.xmit.sched_mirred_nest);
> -}
> -
> -#else
> -static u8 tcf_mirred_nest_level_inc_return(void)
> -{
> -       return current->net_xmit.sched_mirred_nest++;
> -}
> -
> -static void tcf_mirred_nest_level_dec(void)
> -{
> -       current->net_xmit.sched_mirred_nest--;
> -}
> -#endif
> -
>  static bool tcf_mirred_is_act_redirect(int action)
>  {
>         return action == TCA_EGRESS_REDIR || action == TCA_INGRESS_REDIR;
> @@ -439,44 +414,53 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
>  {
>         struct tcf_mirred *m = to_mirred(a);
>         int retval = READ_ONCE(m->tcf_action);
> -       unsigned int nest_level;
> +       struct netdev_xmit *xmit;
>         bool m_mac_header_xmit;
>         struct net_device *dev;
> -       int m_eaction;
> +       int i, m_eaction;
>         u32 blockid;
>
> -       nest_level = tcf_mirred_nest_level_inc_return();
> -       if (unlikely(nest_level > MIRRED_NEST_LIMIT)) {
> +#ifdef CONFIG_PREEMPT_RT
> +       xmit = &current->net_xmit;
> +#else
> +       xmit = this_cpu_ptr(&softnet_data.xmit);
> +#endif
> +       if (unlikely(xmit->sched_mirred_nest >= MIRRED_NEST_LIMIT)) {
>                 net_warn_ratelimited("Packet exceeded mirred recursion limit on dev %s\n",
>                                      netdev_name(skb->dev));
> -               retval = TC_ACT_SHOT;
> -               goto dec_nest_level;
> +               return TC_ACT_SHOT;
>         }
>
>         tcf_lastuse_update(&m->tcf_tm);
>         tcf_action_update_bstats(&m->common, skb);
>
>         blockid = READ_ONCE(m->tcfm_blockid);
> -       if (blockid) {
> -               retval = tcf_blockcast(skb, m, blockid, res, retval);
> -               goto dec_nest_level;
> -       }
> +       if (blockid)
> +               return tcf_blockcast(skb, m, blockid, res, retval);
>
>         dev = rcu_dereference_bh(m->tcfm_dev);
>         if (unlikely(!dev)) {
>                 pr_notice_once("tc mirred: target device is gone\n");
>                 tcf_action_inc_overlimit_qstats(&m->common);
> -               goto dec_nest_level;
> +               return retval;
>         }
> +       for (i = 0; i < xmit->sched_mirred_nest; i++) {
> +               if (xmit->sched_mirred_dev[i] != dev)
> +                       continue;
> +               pr_notice_once("tc mirred: loop on device %s\n",
> +                              netdev_name(dev));
> +               tcf_action_inc_overlimit_qstats(&m->common);
> +               return retval;
> +       }
> +
> +       xmit->sched_mirred_dev[xmit->sched_mirred_nest++] = dev;
>
>         m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
>         m_eaction = READ_ONCE(m->tcfm_eaction);
>
>         retval = tcf_mirred_to_dev(skb, m, dev, m_mac_header_xmit, m_eaction,
>                                    retval);
> -
> -dec_nest_level:
> -       tcf_mirred_nest_level_dec();
> +       xmit->sched_mirred_nest--;
>
>         return retval;
>  }
> --
> 2.51.0.618.g983fd99d29-goog
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection
  2025-10-12 15:22   ` Jamal Hadi Salim
@ 2025-10-12 18:34     ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2025-10-12 18:34 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Sun, Oct 12, 2025 at 8:23 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Mon, Oct 6, 2025 at 3:31 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > We want to revert commit 0f022d32c3ec ("net/sched: Fix mirred deadlock
> > on device recursion") because it adds code in the fast path, even when
> > act_mirred is not used.
> >
> > Use an additional device pointers array in struct netdev_xmit
> > and implement loop detection in tcf_mirred_is_act_redirect().
>
> Patch series looks good!
> This has the potential of (later on) fixing issue that are currently
> broken after the TTL bits were taken away.
> Small suggestion, the commit message was a bit confusing to me. How about:
>
> Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device
> recursion") it adds code in the fast path, even when act_mirred is not
> used. We revert in the next patch.
>
> Prepare by adding an additional device pointers array in struct
> netdev_xmit and implement loop detection in
> tcf_mirred_is_act_redirect().
>
> Please give us time to run tests on this set!

SGTM, I will send when net-next reopens, with an amended changelog

    net/sched: act_mirred: add loop detection

     Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
    added code in the fast path, even when act_mirred is not used.

    Prepare its revert by implementing loop detection in act_mirred.

    Adds an array of device pointers in struct netdev_xmit.

    tcf_mirred_is_act_redirect() can detect if the array
    already contains the target device.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-10-12 18:35 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-06 19:30 [PATCH RFC net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
2025-10-06 19:30 ` [PATCH RFC net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
2025-10-06 19:38   ` Eric Dumazet
2025-10-07 15:26   ` Alexander Lobakin
2025-10-07 19:41     ` Maciej Fijalkowski
2025-10-09  8:37       ` Jason Xing
2025-10-06 19:31 ` [PATCH RFC net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
2025-10-12 15:22   ` Jamal Hadi Salim
2025-10-12 18:34     ` Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
2025-10-06 19:31 ` [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-10-07 11:37   ` Eric Dumazet
2025-10-08  6:37   ` Paolo Abeni
2025-10-08  7:32     ` Eric Dumazet
2025-10-08  7:44       ` Paolo Abeni
2025-10-08  8:48   ` Toke Høiland-Jørgensen
2025-10-08  9:32     ` Eric Dumazet
2025-10-08 10:05       ` Toke Høiland-Jørgensen
2025-10-08 10:46         ` Eric Dumazet
2025-10-08 12:11           ` Toke Høiland-Jørgensen
2025-10-07  5:23 ` [syzbot ci] Re: net: optimize TX throughput and efficiency syzbot ci
2025-10-07  7:41   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).