[PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency
@ 2025-10-14 17:19 Eric Dumazet
  2025-10-14 17:19 ` [PATCH v2 net-next 1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt Eric Dumazet
                   ` (7 more replies)
  0 siblings, 8 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-14 17:19 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

In this series, I replace the busylock spinlock we have in
__dev_queue_xmit() and use lockless list (llist) to reduce
spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this series, we get a 300 % (4x) improvement on heavy TX workloads,
sending twice the number of packets per second, for half the cpu cycles.

v2: deflake tcp_user_timeout_user-timeout-probe.pkt.
    Ability to return a different code than NET_XMIT_SUCCESS
    when __dev_xmit_skb() has a single skb to send.

Eric Dumazet (6):
  selftests/net: packetdrill: unflake
    tcp_user_timeout_user-timeout-probe.pkt
  net: add add indirect call wrapper in skb_release_head_state()
  net/sched: act_mirred: add loop detection
  Revert "net/sched: Fix mirred deadlock on device recursion"
  net: sched: claim one cache line in Qdisc
  net: dev_queue_xmit() llist adoption

 include/linux/netdevice_xmit.h                |  9 +-
 include/net/sch_generic.h                     | 23 ++---
 net/core/dev.c                                | 97 +++++++++++--------
 net/core/skbuff.c                             | 11 ++-
 net/sched/act_mirred.c                        | 62 +++++-------
 net/sched/sch_generic.c                       |  7 --
 .../tcp_user_timeout_user-timeout-probe.pkt   |  6 +-
 7 files changed, 111 insertions(+), 104 deletions(-)

-- 
2.51.0.788.g6d19910ace-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 net-next 1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
@ 2025-10-14 17:19 ` Eric Dumazet
  2025-10-15  5:54   ` Kuniyuki Iwashima
  2025-10-14 17:19 ` [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2025-10-14 17:19 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet, Soham Chakradeo

This test fails the first time I am running it after a fresh virtme-ng boot.

tcp_user_timeout_user-timeout-probe.pkt:33: runtime error in write call: Expected result -1 but got 24 with errno 2 (No such file or directory)

Tweaks the timings a bit, to reduce flakiness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soham Chakradeo <sohamch@google.com>
Cc: Willem de Bruijn <willemb@google.com>
---
 .../net/packetdrill/tcp_user_timeout_user-timeout-probe.pkt | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/packetdrill/tcp_user_timeout_user-timeout-probe.pkt b/tools/testing/selftests/net/packetdrill/tcp_user_timeout_user-timeout-probe.pkt
index 183051ba0cae..6882b8240a8a 100644
--- a/tools/testing/selftests/net/packetdrill/tcp_user_timeout_user-timeout-probe.pkt
+++ b/tools/testing/selftests/net/packetdrill/tcp_user_timeout_user-timeout-probe.pkt
@@ -23,14 +23,16 @@
 
 // install a qdisc dropping all packets
    +0 `tc qdisc delete dev tun0 root 2>/dev/null ; tc qdisc add dev tun0 root pfifo limit 0`
+
    +0 write(4, ..., 24) = 24
    // When qdisc is congested we retry every 500ms
    // (TCP_RESOURCE_PROBE_INTERVAL) and therefore
    // we retry 6 times before hitting 3s timeout.
    // First verify that the connection is alive:
-+3.250 write(4, ..., 24) = 24
++3 write(4, ..., 24) = 24
+
    // Now verify that shortly after that the socket is dead:
- +.100 write(4, ..., 24) = -1 ETIMEDOUT (Connection timed out)
++1 write(4, ..., 24) = -1 ETIMEDOUT (Connection timed out)
 
    +0 %{ assert tcpi_probes == 6, tcpi_probes; \
          assert tcpi_backoff == 0, tcpi_backoff }%
-- 
2.51.0.788.g6d19910ace-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
  2025-10-14 17:19 ` [PATCH v2 net-next 1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt Eric Dumazet
@ 2025-10-14 17:19 ` Eric Dumazet
  2025-10-15  3:57   ` Kuniyuki Iwashima
                     ` (2 more replies)
  2025-10-14 17:19 ` [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection Eric Dumazet
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-14 17:19 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

While stress testing UDP senders on a host with expensive indirect
calls, I found cpus processing TX completions where showing
a very high cost (20%) in sock_wfree() due to
CONFIG_MITIGATION_RETPOLINE=y.

Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bc12790017b0..692e3a70e75e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
 	skb_dst_drop(skb);
 	if (skb->destructor) {
 		DEBUG_NET_WARN_ON_ONCE(in_hardirq());
-		skb->destructor(skb);
+#ifdef CONFIG_INET
+		INDIRECT_CALL_3(skb->destructor,
+				tcp_wfree, __sock_wfree, sock_wfree,
+				skb);
+#else
+		INDIRECT_CALL_1(skb->destructor,
+				sock_wfree,
+				skb);
+
+#endif
 	}
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	nf_conntrack_put(skb_nfct(skb));
-- 
2.51.0.788.g6d19910ace-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
  2025-10-14 17:19 ` [PATCH v2 net-next 1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt Eric Dumazet
  2025-10-14 17:19 ` [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
@ 2025-10-14 17:19 ` Eric Dumazet
  2025-10-15  5:24   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
  2025-10-14 17:19 ` [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-14 17:19 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
added code in the fast path, even when act_mirred is not used.

Prepare its revert by implementing loop detection in act_mirred.

Adds an array of device pointers in struct netdev_xmit.

tcf_mirred_is_act_redirect() can detect if the array
already contains the target device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice_xmit.h |  9 ++++-
 net/sched/act_mirred.c         | 62 +++++++++++++---------------------
 2 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice_xmit.h b/include/linux/netdevice_xmit.h
index 813a19122ebb..cc232508e695 100644
--- a/include/linux/netdevice_xmit.h
+++ b/include/linux/netdevice_xmit.h
@@ -2,6 +2,12 @@
 #ifndef _LINUX_NETDEVICE_XMIT_H
 #define _LINUX_NETDEVICE_XMIT_H
 
+#if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
+#define MIRRED_NEST_LIMIT	4
+#endif
+
+struct net_device;
+
 struct netdev_xmit {
 	u16 recursion;
 	u8  more;
@@ -9,7 +15,8 @@ struct netdev_xmit {
 	u8  skip_txqueue;
 #endif
 #if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
-	u8 sched_mirred_nest;
+	u8			sched_mirred_nest;
+	struct net_device	*sched_mirred_dev[MIRRED_NEST_LIMIT];
 #endif
 #if IS_ENABLED(CONFIG_NF_DUP_NETDEV)
 	u8 nf_dup_skb_recursion;
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5f01f567c934..f27b583def78 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -29,31 +29,6 @@
 static LIST_HEAD(mirred_list);
 static DEFINE_SPINLOCK(mirred_list_lock);
 
-#define MIRRED_NEST_LIMIT    4
-
-#ifndef CONFIG_PREEMPT_RT
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
-	return __this_cpu_inc_return(softnet_data.xmit.sched_mirred_nest);
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
-	__this_cpu_dec(softnet_data.xmit.sched_mirred_nest);
-}
-
-#else
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
-	return current->net_xmit.sched_mirred_nest++;
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
-	current->net_xmit.sched_mirred_nest--;
-}
-#endif
-
 static bool tcf_mirred_is_act_redirect(int action)
 {
 	return action == TCA_EGRESS_REDIR || action == TCA_INGRESS_REDIR;
@@ -439,44 +414,53 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
 {
 	struct tcf_mirred *m = to_mirred(a);
 	int retval = READ_ONCE(m->tcf_action);
-	unsigned int nest_level;
+	struct netdev_xmit *xmit;
 	bool m_mac_header_xmit;
 	struct net_device *dev;
-	int m_eaction;
+	int i, m_eaction;
 	u32 blockid;
 
-	nest_level = tcf_mirred_nest_level_inc_return();
-	if (unlikely(nest_level > MIRRED_NEST_LIMIT)) {
+#ifdef CONFIG_PREEMPT_RT
+	xmit = &current->net_xmit;
+#else
+	xmit = this_cpu_ptr(&softnet_data.xmit);
+#endif
+	if (unlikely(xmit->sched_mirred_nest >= MIRRED_NEST_LIMIT)) {
 		net_warn_ratelimited("Packet exceeded mirred recursion limit on dev %s\n",
 				     netdev_name(skb->dev));
-		retval = TC_ACT_SHOT;
-		goto dec_nest_level;
+		return TC_ACT_SHOT;
 	}
 
 	tcf_lastuse_update(&m->tcf_tm);
 	tcf_action_update_bstats(&m->common, skb);
 
 	blockid = READ_ONCE(m->tcfm_blockid);
-	if (blockid) {
-		retval = tcf_blockcast(skb, m, blockid, res, retval);
-		goto dec_nest_level;
-	}
+	if (blockid)
+		return tcf_blockcast(skb, m, blockid, res, retval);
 
 	dev = rcu_dereference_bh(m->tcfm_dev);
 	if (unlikely(!dev)) {
 		pr_notice_once("tc mirred: target device is gone\n");
 		tcf_action_inc_overlimit_qstats(&m->common);
-		goto dec_nest_level;
+		return retval;
 	}
+	for (i = 0; i < xmit->sched_mirred_nest; i++) {
+		if (xmit->sched_mirred_dev[i] != dev)
+			continue;
+		pr_notice_once("tc mirred: loop on device %s\n",
+			       netdev_name(dev));
+		tcf_action_inc_overlimit_qstats(&m->common);
+		return retval;
+	}
+
+	xmit->sched_mirred_dev[xmit->sched_mirred_nest++] = dev;
 
 	m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
 	m_eaction = READ_ONCE(m->tcfm_eaction);
 
 	retval = tcf_mirred_to_dev(skb, m, dev, m_mac_header_xmit, m_eaction,
 				   retval);
-
-dec_nest_level:
-	tcf_mirred_nest_level_dec();
+	xmit->sched_mirred_nest--;
 
 	return retval;
 }
-- 
2.51.0.788.g6d19910ace-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion"
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-10-14 17:19 ` [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection Eric Dumazet
@ 2025-10-14 17:19 ` Eric Dumazet
  2025-10-15  5:26   ` Kuniyuki Iwashima
                     ` (2 more replies)
  2025-10-14 17:19 ` [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc Eric Dumazet
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-14 17:19 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
and 44180feaccf266d9b0b28cc4ceaac019817deb5c.

Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h | 1 -
 net/core/dev.c            | 6 ------
 net/sched/sch_generic.c   | 2 --
 3 files changed, 9 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 738cd5b13c62..32e9961570b4 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -117,7 +117,6 @@ struct Qdisc {
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue	qstats;
-	int                     owner;
 	unsigned long		state;
 	unsigned long		state2; /* must be written under qdisc spinlock */
 	struct Qdisc            *next_sched;
diff --git a/net/core/dev.c b/net/core/dev.c
index a64cef2c537e..0ff178399b2d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4167,10 +4167,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
-	if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
-		return NET_XMIT_DROP;
-	}
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
 	 * separate lock before trying to get qdisc main lock.
@@ -4210,9 +4206,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		WRITE_ONCE(q->owner, smp_processor_id());
 		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-		WRITE_ONCE(q->owner, -1);
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1e008a228ebd..dfa8e8e667d2 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -679,7 +679,6 @@ struct Qdisc noop_qdisc = {
 		.qlen = 0,
 		.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
 	},
-	.owner = -1,
 };
 EXPORT_SYMBOL(noop_qdisc);
 
@@ -985,7 +984,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;
 	sch->dev_queue = dev_queue;
-	sch->owner = -1;
 	netdev_hold(dev, &sch->dev_tracker, GFP_KERNEL);
 	refcount_set(&sch->refcnt, 1);
 
-- 
2.51.0.788.g6d19910ace-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (3 preceding siblings ...)
  2025-10-14 17:19 ` [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
@ 2025-10-14 17:19 ` Eric Dumazet
  2025-10-15  5:42   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
  2025-10-14 17:19 ` [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption Eric Dumazet
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-14 17:19 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Replace state2 field with a boolean.

Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 32e9961570b4..31561291bc92 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -41,13 +41,6 @@ enum qdisc_state_t {
 	__QDISC_STATE_DRAINING,
 };
 
-enum qdisc_state2_t {
-	/* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
-	 * Use qdisc_run_begin/end() or qdisc_is_running() instead.
-	 */
-	__QDISC_STATE2_RUNNING,
-};
-
 #define QDISC_STATE_MISSED	BIT(__QDISC_STATE_MISSED)
 #define QDISC_STATE_DRAINING	BIT(__QDISC_STATE_DRAINING)
 
@@ -117,8 +110,8 @@ struct Qdisc {
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue	qstats;
+	bool			running; /* must be written under qdisc spinlock */
 	unsigned long		state;
-	unsigned long		state2; /* must be written under qdisc spinlock */
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
 
@@ -167,7 +160,7 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK)
 		return spin_is_locked(&qdisc->seqlock);
-	return test_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+	return READ_ONCE(qdisc->running);
 }
 
 static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
@@ -210,7 +203,10 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 		 */
 		return spin_trylock(&qdisc->seqlock);
 	}
-	return !__test_and_set_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+	if (READ_ONCE(qdisc->running))
+		return false;
+	WRITE_ONCE(qdisc->running, true);
+	return true;
 }
 
 static inline void qdisc_run_end(struct Qdisc *qdisc)
@@ -228,7 +224,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 				      &qdisc->state)))
 			__netif_schedule(qdisc);
 	} else {
-		__clear_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+		WRITE_ONCE(qdisc->running, false);
 	}
 }
 
-- 
2.51.0.788.g6d19910ace-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (4 preceding siblings ...)
  2025-10-14 17:19 ` [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc Eric Dumazet
@ 2025-10-14 17:19 ` Eric Dumazet
  2025-10-15  6:20   ` Kuniyuki Iwashima
  2025-10-15 22:00 ` [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Jamal Hadi Salim
  2025-10-16 23:50 ` patchwork-bot+netdevbpf
  7 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2025-10-14 17:19 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet, Toke Høiland-Jørgensen

Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.

Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.

Tested:

- Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Before:

16 Mpps (41 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0

Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
      5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
       244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
        13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8

If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
       201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
        12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b

@__dev_queue_xmit_ns:
[256, 512)            21 |                                                    |
[512, 1K)            631 |                                                    |
[1K, 2K)           27328 |@                                                   |
[2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
[4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         19055 |@                                                   |
[64K, 128K)        17240 |@                                                   |
[128K, 256K)       25633 |@                                                   |
[256K, 512K)           4 |                                                    |

After:

29 Mpps (57 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0

perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

      2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
      5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
      2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
       923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
       121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
         6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b

If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
     32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
      4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
      2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
      3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
       233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
       153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
        84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
       140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0

@__dev_queue_xmit_ns:
[128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)         483936 |@@@@@@@@@@                                          |
[1K, 2K)          265345 |@@@@@@                                              |
[2K, 4K)          145463 |@@@                                                 |
[4K, 8K)           54571 |@                                                   |
[8K, 16K)          10270 |                                                    |
[16K, 32K)          9385 |                                                    |
[32K, 64K)          7749 |                                                    |
[64K, 128K)        26799 |                                                    |
[128K, 256K)        2665 |                                                    |
[256K, 512K)         665 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/net/sch_generic.h |  4 +-
 net/core/dev.c            | 91 ++++++++++++++++++++++++---------------
 net/sched/sch_generic.c   |  5 ---
 3 files changed, 59 insertions(+), 41 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 31561291bc92..94966692ccdf 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -115,7 +115,9 @@ struct Qdisc {
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
 
-	spinlock_t		busylock ____cacheline_aligned_in_smp;
+	atomic_long_t		defer_count ____cacheline_aligned_in_smp;
+	struct llist_head	defer_list;
+
 	spinlock_t		seqlock;
 
 	struct rcu_head		rcu;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ff178399b2d..4ac9a8262157 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				 struct net_device *dev,
 				 struct netdev_queue *txq)
 {
+	struct sk_buff *next, *to_free = NULL;
 	spinlock_t *root_lock = qdisc_lock(q);
-	struct sk_buff *to_free = NULL;
-	bool contended;
+	struct llist_node *ll_list, *first_n;
+	unsigned long defer_count = 0;
 	int rc;
 
 	qdisc_calculate_pkt_len(skb, q);
@@ -4167,61 +4168,81 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
-	/*
-	 * Heuristic to force contended enqueues to serialize on a
-	 * separate lock before trying to get qdisc main lock.
-	 * This permits qdisc->running owner to get the lock more
-	 * often and dequeue packets faster.
-	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
-	 * and then other tasks will only enqueue packets. The packets will be
-	 * sent after the qdisc owner is scheduled again. To prevent this
-	 * scenario the task always serialize on the lock.
+	/* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
+	 * In the try_cmpxchg() loop, we want to increment q->defer_count
+	 * at most once to limit the number of skbs in defer_list.
+	 * We perform the defer_count increment only if the list is not empty,
+	 * because some arches have slow atomic_long_inc_return().
+	 */
+	first_n = READ_ONCE(q->defer_list.first);
+	do {
+		if (first_n && !defer_count) {
+			defer_count = atomic_long_inc_return(&q->defer_count);
+			if (unlikely(defer_count > q->limit)) {
+				kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
+				return NET_XMIT_DROP;
+			}
+		}
+		skb->ll_node.next = first_n;
+	} while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
+
+	/* If defer_list was not empty, we know the cpu which queued
+	 * the first skb will process the whole list for us.
 	 */
-	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
-	if (unlikely(contended))
-		spin_lock(&q->busylock);
+	if (first_n)
+		return NET_XMIT_SUCCESS;
 
 	spin_lock(root_lock);
+
+	ll_list = llist_del_all(&q->defer_list);
+	/* There is a small race because we clear defer_count not atomically
+	 * with the prior llist_del_all(). This means defer_list could grow
+	 * over q->limit.
+	 */
+	atomic_long_set(&q->defer_count, 0);
+
+	ll_list = llist_reverse_order(ll_list);
+
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
-		__qdisc_drop(skb, &to_free);
+		llist_for_each_entry_safe(skb, next, ll_list, ll_node)
+			__qdisc_drop(skb, &to_free);
 		rc = NET_XMIT_DROP;
-	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
-		   qdisc_run_begin(q)) {
+		goto unlock;
+	}
+	if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
+	    !llist_next(ll_list) && qdisc_run_begin(q)) {
 		/*
 		 * This is a work-conserving queue; there are no old skbs
 		 * waiting to be sent out; and the qdisc is not running -
 		 * xmit the skb directly.
 		 */
 
+		DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
+							  struct sk_buff,
+							  ll_node));
 		qdisc_bstats_update(q, skb);
-
-		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
+		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
 			__qdisc_run(q);
-		}
-
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-		if (qdisc_run_begin(q)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
-			__qdisc_run(q);
-			qdisc_run_end(q);
+		int count = 0;
+
+		llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
+			prefetch(next);
+			skb_mark_not_on_list(skb);
+			rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
+			count++;
 		}
+		qdisc_run(q);
+		if (count != 1)
+			rc = NET_XMIT_SUCCESS;
 	}
+unlock:
 	spin_unlock(root_lock);
 	if (unlikely(to_free))
 		kfree_skb_list_reason(to_free,
 				      tcf_get_drop_reason(to_free));
-	if (unlikely(contended))
-		spin_unlock(&q->busylock);
 	return rc;
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index dfa8e8e667d2..d9a98d02a55f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -666,7 +666,6 @@ struct Qdisc noop_qdisc = {
 	.ops		=	&noop_qdisc_ops,
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
 	.dev_queue	=	&noop_netdev_queue,
-	.busylock	=	__SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
 	.gso_skb = {
 		.next = (struct sk_buff *)&noop_qdisc.gso_skb,
 		.prev = (struct sk_buff *)&noop_qdisc.gso_skb,
@@ -970,10 +969,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 		}
 	}
 
-	spin_lock_init(&sch->busylock);
-	lockdep_set_class(&sch->busylock,
-			  dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
-
 	/* seqlock has the same scope of busylock, for NOLOCK qdisc */
 	spin_lock_init(&sch->seqlock);
 	lockdep_set_class(&sch->seqlock,
-- 
2.51.0.788.g6d19910ace-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-14 17:19 ` [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
@ 2025-10-15  3:57   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
  2025-10-15 12:02   ` Alexander Lobakin
  2 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15  3:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Willem de Bruijn, netdev,
	eric.dumazet

On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@google.com> wrote:
>
> While stress testing UDP senders on a host with expensive indirect
> calls, I found cpus processing TX completions where showing
> a very high cost (20%) in sock_wfree() due to
> CONFIG_MITIGATION_RETPOLINE=y.
>
> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection
  2025-10-14 17:19 ` [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection Eric Dumazet
@ 2025-10-15  5:24   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15  5:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Willem de Bruijn, netdev,
	eric.dumazet

On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
> added code in the fast path, even when act_mirred is not used.
>
> Prepare its revert by implementing loop detection in act_mirred.
>
> Adds an array of device pointers in struct netdev_xmit.
>
> tcf_mirred_is_act_redirect() can detect if the array
> already contains the target device.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion"
  2025-10-14 17:19 ` [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
@ 2025-10-15  5:26   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
  2025-10-16 12:03   ` Victor Nogueira
  2 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15  5:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Willem de Bruijn, netdev,
	eric.dumazet

On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@google.com> wrote:
>
> This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
> and 44180feaccf266d9b0b28cc4ceaac019817deb5c.
>
> Prior patch in this series implemented loop detection
> in act_mirred, we can remove q->owner to save some cycles
> in the fast path.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc
  2025-10-14 17:19 ` [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc Eric Dumazet
@ 2025-10-15  5:42   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15  5:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Willem de Bruijn, netdev,
	eric.dumazet

On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Replace state2 field with a boolean.
>
> Move it to a hole between qstats and state so that
> we shrink Qdisc by a full cache line.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt
  2025-10-14 17:19 ` [PATCH v2 net-next 1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt Eric Dumazet
@ 2025-10-15  5:54   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15  5:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Willem de Bruijn, netdev,
	eric.dumazet, Soham Chakradeo

On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@google.com> wrote:
>
> This test fails the first time I am running it after a fresh virtme-ng boot.
>
> tcp_user_timeout_user-timeout-probe.pkt:33: runtime error in write call: Expected result -1 but got 24 with errno 2 (No such file or directory)
>
> Tweaks the timings a bit, to reduce flakiness.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption
  2025-10-14 17:19 ` [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-10-15  6:20   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 33+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15  6:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Willem de Bruijn, netdev,
	eric.dumazet, Toke Høiland-Jørgensen

On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Note that this also allows in the future to submit batches
> to various qdisc->enqueue() methods.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
> Before:
>
> 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
> 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
>
> Lock contention (1 second sample taken on 8 cores)
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
>       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
>        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
>         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
>
> If netperf threads are pinned, spinlock stress is very high.
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
>        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
>         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
>
> @__dev_queue_xmit_ns:
> [256, 512)            21 |                                                    |
> [512, 1K)            631 |                                                    |
> [1K, 2K)           27328 |@                                                   |
> [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
> [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
> [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [32K, 64K)         19055 |@                                                   |
> [64K, 128K)        17240 |@                                                   |
> [128K, 256K)       25633 |@                                                   |
> [256K, 512K)           4 |                                                    |
>
> After:
>
> 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
> 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
> 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
> 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
> 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
>
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
>       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
>       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
>        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
>        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
>          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
>
> If netperf threads are pinned (~54 Mpps)
> perf lock record -C0-7 sleep 1; perf lock contention
>      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
>       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
>       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
>       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
>        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
>        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
>         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
>        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
>
> @__dev_queue_xmit_ns:
> [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K)         483936 |@@@@@@@@@@                                          |
> [1K, 2K)          265345 |@@@@@@                                              |
> [2K, 4K)          145463 |@@@                                                 |
> [4K, 8K)           54571 |@                                                   |
> [8K, 16K)          10270 |                                                    |
> [16K, 32K)          9385 |                                                    |
> [32K, 64K)          7749 |                                                    |
> [64K, 128K)        26799 |                                                    |
> [128K, 256K)        2665 |                                                    |
> [256K, 512K)         665 |                                                    |
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

Thanks for big improvement!

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-14 17:19 ` [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
  2025-10-15  3:57   ` Kuniyuki Iwashima
@ 2025-10-15  8:17   ` Toke Høiland-Jørgensen
  2025-10-15 12:02   ` Alexander Lobakin
  2 siblings, 0 replies; 33+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-15  8:17 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> While stress testing UDP senders on a host with expensive indirect
> calls, I found cpus processing TX completions where showing
> a very high cost (20%) in sock_wfree() due to
> CONFIG_MITIGATION_RETPOLINE=y.
>
> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection
  2025-10-14 17:19 ` [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection Eric Dumazet
  2025-10-15  5:24   ` Kuniyuki Iwashima
@ 2025-10-15  8:17   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 33+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-15  8:17 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
> added code in the fast path, even when act_mirred is not used.
>
> Prepare its revert by implementing loop detection in act_mirred.
>
> Adds an array of device pointers in struct netdev_xmit.
>
> tcf_mirred_is_act_redirect() can detect if the array
> already contains the target device.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion"
  2025-10-14 17:19 ` [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
  2025-10-15  5:26   ` Kuniyuki Iwashima
@ 2025-10-15  8:17   ` Toke Høiland-Jørgensen
  2025-10-16 12:03   ` Victor Nogueira
  2 siblings, 0 replies; 33+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-15  8:17 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
> and 44180feaccf266d9b0b28cc4ceaac019817deb5c.
>
> Prior patch in this series implemented loop detection
> in act_mirred, we can remove q->owner to save some cycles
> in the fast path.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc
  2025-10-14 17:19 ` [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc Eric Dumazet
  2025-10-15  5:42   ` Kuniyuki Iwashima
@ 2025-10-15  8:17   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 33+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-15  8:17 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> Replace state2 field with a boolean.
>
> Move it to a hole between qstats and state so that
> we shrink Qdisc by a full cache line.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-14 17:19 ` [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
  2025-10-15  3:57   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
@ 2025-10-15 12:02   ` Alexander Lobakin
  2025-10-15 12:16     ` Eric Dumazet
  2 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2025-10-15 12:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Tue, 14 Oct 2025 17:19:03 +0000

> While stress testing UDP senders on a host with expensive indirect
> calls, I found cpus processing TX completions where showing
> a very high cost (20%) in sock_wfree() due to
> CONFIG_MITIGATION_RETPOLINE=y.
> 
> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/skbuff.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index bc12790017b0..692e3a70e75e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
>  	skb_dst_drop(skb);
>  	if (skb->destructor) {
>  		DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> -		skb->destructor(skb);
> +#ifdef CONFIG_INET
> +		INDIRECT_CALL_3(skb->destructor,
> +				tcp_wfree, __sock_wfree, sock_wfree,
> +				skb);
> +#else
> +		INDIRECT_CALL_1(skb->destructor,
> +				sock_wfree,
> +				skb);
> +
> +#endif

Is it just me or seems like you ignored the suggestion/discussion under
v1 of this patch...

>  	}
>  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
>  	nf_conntrack_put(skb_nfct(skb));

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 12:02   ` Alexander Lobakin
@ 2025-10-15 12:16     ` Eric Dumazet
  2025-10-15 12:30       ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2025-10-15 12:16 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 15, 2025 at 5:02 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Tue, 14 Oct 2025 17:19:03 +0000
>
> > While stress testing UDP senders on a host with expensive indirect
> > calls, I found cpus processing TX completions where showing
> > a very high cost (20%) in sock_wfree() due to
> > CONFIG_MITIGATION_RETPOLINE=y.
> >
> > Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/core/skbuff.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index bc12790017b0..692e3a70e75e 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
> >       skb_dst_drop(skb);
> >       if (skb->destructor) {
> >               DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> > -             skb->destructor(skb);
> > +#ifdef CONFIG_INET
> > +             INDIRECT_CALL_3(skb->destructor,
> > +                             tcp_wfree, __sock_wfree, sock_wfree,
> > +                             skb);
> > +#else
> > +             INDIRECT_CALL_1(skb->destructor,
> > +                             sock_wfree,
> > +                             skb);
> > +
> > +#endif
>
> Is it just me or seems like you ignored the suggestion/discussion under
> v1 of this patch...
>

I did not. Please send a patch when you can demonstrate the difference.

We are not going to add all the possible destructors unless there is evidence.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 12:16     ` Eric Dumazet
@ 2025-10-15 12:30       ` Alexander Lobakin
  2025-10-15 12:46         ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2025-10-15 12:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 15 Oct 2025 05:16:05 -0700

> On Wed, Oct 15, 2025 at 5:02 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Tue, 14 Oct 2025 17:19:03 +0000
>>
>>> While stress testing UDP senders on a host with expensive indirect
>>> calls, I found cpus processing TX completions where showing
>>> a very high cost (20%) in sock_wfree() due to
>>> CONFIG_MITIGATION_RETPOLINE=y.
>>>
>>> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> ---
>>>  net/core/skbuff.c | 11 ++++++++++-
>>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index bc12790017b0..692e3a70e75e 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
>>>       skb_dst_drop(skb);
>>>       if (skb->destructor) {
>>>               DEBUG_NET_WARN_ON_ONCE(in_hardirq());
>>> -             skb->destructor(skb);
>>> +#ifdef CONFIG_INET
>>> +             INDIRECT_CALL_3(skb->destructor,
>>> +                             tcp_wfree, __sock_wfree, sock_wfree,
>>> +                             skb);
>>> +#else
>>> +             INDIRECT_CALL_1(skb->destructor,
>>> +                             sock_wfree,
>>> +                             skb);
>>> +
>>> +#endif
>>
>> Is it just me or seems like you ignored the suggestion/discussion under
>> v1 of this patch...
>>
> 
> I did not. Please send a patch when you can demonstrate the difference.

You "did not", but you didn't reply there, only sent v2 w/o any mention.

> 
> We are not going to add all the possible destructors unless there is evidence.

There are numbers in the original discussion, you'd have noticed if you
did read.

We only ask to add one more destructor which will help certain
perf-critical workloads. Add it to the end of the list, so that it won't
hurt your optimization.

"Send a patch" means you're now changing these lines now and then they
would be changed once again, why...

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 12:30       ` Alexander Lobakin
@ 2025-10-15 12:46         ` Eric Dumazet
  2025-10-15 12:49           ` Eric Dumazet
                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-15 12:46 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 15, 2025 at 5:30 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Wed, 15 Oct 2025 05:16:05 -0700
>
> > On Wed, Oct 15, 2025 at 5:02 AM Alexander Lobakin
> > <aleksander.lobakin@intel.com> wrote:
> >>
> >> From: Eric Dumazet <edumazet@google.com>
> >> Date: Tue, 14 Oct 2025 17:19:03 +0000
> >>
> >>> While stress testing UDP senders on a host with expensive indirect
> >>> calls, I found cpus processing TX completions where showing
> >>> a very high cost (20%) in sock_wfree() due to
> >>> CONFIG_MITIGATION_RETPOLINE=y.
> >>>
> >>> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> >>>
> >>> Signed-off-by: Eric Dumazet <edumazet@google.com>
> >>> ---
> >>>  net/core/skbuff.c | 11 ++++++++++-
> >>>  1 file changed, 10 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>> index bc12790017b0..692e3a70e75e 100644
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
> >>>       skb_dst_drop(skb);
> >>>       if (skb->destructor) {
> >>>               DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> >>> -             skb->destructor(skb);
> >>> +#ifdef CONFIG_INET
> >>> +             INDIRECT_CALL_3(skb->destructor,
> >>> +                             tcp_wfree, __sock_wfree, sock_wfree,
> >>> +                             skb);
> >>> +#else
> >>> +             INDIRECT_CALL_1(skb->destructor,
> >>> +                             sock_wfree,
> >>> +                             skb);
> >>> +
> >>> +#endif
> >>
> >> Is it just me or seems like you ignored the suggestion/discussion under
> >> v1 of this patch...
> >>
> >
> > I did not. Please send a patch when you can demonstrate the difference.
>
> You "did not", but you didn't reply there, only sent v2 w/o any mention.
>
> >
> > We are not going to add all the possible destructors unless there is evidence.
>
> There are numbers in the original discussion, you'd have noticed if you
> did read.
>
> We only ask to add one more destructor which will help certain
> perf-critical workloads. Add it to the end of the list, so that it won't
> hurt your optimization.
>
> "Send a patch" means you're now changing these lines now and then they
> would be changed once again, why...

I can not test what you propose.

I can drop this patch instead, and keep it in Google kernels, (we had
TCP support for years)

Or... you can send a patch on top of it later.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 12:46         ` Eric Dumazet
@ 2025-10-15 12:49           ` Eric Dumazet
  2025-10-15 12:54           ` Alexander Lobakin
  2025-10-20  7:01           ` Jason Xing
  2 siblings, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-15 12:49 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 15, 2025 at 5:46 AM Eric Dumazet <edumazet@google.com> wrote:
>

> I can not test what you propose.
>
> I can drop this patch instead, and keep it in Google kernels, (we had
> TCP support for years)
>
> Or... you can send a patch on top of it later.

To be very clear : my Signed-off-by: means that I have strong
confidence on what I tested.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 12:46         ` Eric Dumazet
  2025-10-15 12:49           ` Eric Dumazet
@ 2025-10-15 12:54           ` Alexander Lobakin
  2025-10-15 13:01             ` Eric Dumazet
  2025-10-20  7:01           ` Jason Xing
  2 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2025-10-15 12:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 15 Oct 2025 05:46:27 -0700

> On Wed, Oct 15, 2025 at 5:30 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Wed, 15 Oct 2025 05:16:05 -0700
>>
>>> On Wed, Oct 15, 2025 at 5:02 AM Alexander Lobakin
>>> <aleksander.lobakin@intel.com> wrote:
>>>>
>>>> From: Eric Dumazet <edumazet@google.com>
>>>> Date: Tue, 14 Oct 2025 17:19:03 +0000
>>>>
>>>>> While stress testing UDP senders on a host with expensive indirect
>>>>> calls, I found cpus processing TX completions where showing
>>>>> a very high cost (20%) in sock_wfree() due to
>>>>> CONFIG_MITIGATION_RETPOLINE=y.
>>>>>
>>>>> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
>>>>>
>>>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>>> ---
>>>>>  net/core/skbuff.c | 11 ++++++++++-
>>>>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>> index bc12790017b0..692e3a70e75e 100644
>>>>> --- a/net/core/skbuff.c
>>>>> +++ b/net/core/skbuff.c
>>>>> @@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
>>>>>       skb_dst_drop(skb);
>>>>>       if (skb->destructor) {
>>>>>               DEBUG_NET_WARN_ON_ONCE(in_hardirq());
>>>>> -             skb->destructor(skb);
>>>>> +#ifdef CONFIG_INET
>>>>> +             INDIRECT_CALL_3(skb->destructor,
>>>>> +                             tcp_wfree, __sock_wfree, sock_wfree,
>>>>> +                             skb);
>>>>> +#else
>>>>> +             INDIRECT_CALL_1(skb->destructor,
>>>>> +                             sock_wfree,
>>>>> +                             skb);
>>>>> +
>>>>> +#endif
>>>>
>>>> Is it just me or seems like you ignored the suggestion/discussion under
>>>> v1 of this patch...
>>>>
>>>
>>> I did not. Please send a patch when you can demonstrate the difference.
>>
>> You "did not", but you didn't reply there, only sent v2 w/o any mention.
>>
>>>
>>> We are not going to add all the possible destructors unless there is evidence.
>>
>> There are numbers in the original discussion, you'd have noticed if you
>> did read.
>>
>> We only ask to add one more destructor which will help certain
>> perf-critical workloads. Add it to the end of the list, so that it won't
>> hurt your optimization.
>>
>> "Send a patch" means you're now changing these lines now and then they
>> would be changed once again, why...
> 
> I can not test what you propose.

You asked *me* to show the difference, in the orig discussion there's a
patch, there are tests and there is difference... :D

> 
> I can drop this patch instead, and keep it in Google kernels, (we had
> TCP support for years)

Ok, enough, leave this one as it is, we'll send the XSk bit ourselves.

> 
> Or... you can send a patch on top of it later.

Re "my Signed-off-by means I have strong confidence" -- sometimes we
also have Tested-by from other folks and it's never been a problem,
hey we're the community.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 12:54           ` Alexander Lobakin
@ 2025-10-15 13:01             ` Eric Dumazet
  2025-10-15 13:08               ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2025-10-15 13:01 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 15, 2025 at 5:54 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:

> You asked *me* to show the difference, in the orig discussion there's a
> patch, there are tests and there is difference... :D

I am afraid I have not seen this.

The only thing I found was :

<quote>
Not sure, but maybe we could add generic XSk skb destructor here as
well? Or it's not that important as generic XSk is not the best way to
use XDP sockets?

Maciej, what do you think?
</quote>

No numbers.




>
> >
> > I can drop this patch instead, and keep it in Google kernels, (we had
> > TCP support for years)
>
> Ok, enough, leave this one as it is, we'll send the XSk bit ourselves.
>
> >
> > Or... you can send a patch on top of it later.
>
> Re "my Signed-off-by means I have strong confidence" -- sometimes we
> also have Tested-by from other folks and it's never been a problem,
> hey we're the community.
>
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 13:01             ` Eric Dumazet
@ 2025-10-15 13:08               ` Alexander Lobakin
  2025-10-15 13:11                 ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2025-10-15 13:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 15 Oct 2025 06:01:40 -0700

> On Wed, Oct 15, 2025 at 5:54 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
> 
>> You asked *me* to show the difference, in the orig discussion there's a
>> patch, there are tests and there is difference... :D
> 
> I am afraid I have not seen this.
> 
> The only thing I found was :
> 
> <quote>
> Not sure, but maybe we could add generic XSk skb destructor here as
> well? Or it's not that important as generic XSk is not the best way to
> use XDP sockets?
> 
> Maciej, what do you think?
> </quote>
> 
> No numbers.

From [0]:

From: Jason Xing <kerneljasonxing@gmail.com>
Date: Thu, 9 Oct 2025 16:37:56 +0800

> On Wed, Oct 8, 2025 at 3:42 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
>>

[...]

>>> Not sure, but maybe we could add generic XSk skb destructor here as
>>> well?
>
> I added the following snippet[1] and only saw a stable ~1% improvement
> when sending 64 size packets with xdpsock.
>
> I'm not so sure it deserves a follow-up patch to Eric's series. Better
> than nothing? Any ideas on this one?
>
> [1]
> INDIRECT_CALL_4(skb->destructor, tcp_wfree, __sock_wfree, sock_wfree,
> xsk_destruct_skb, skb);
>
>>>  Or it's not that important as generic XSk is not the best way to
>>> use XDP sockets?
>
> Yes, it surely matters. At least, virtio_net and veth need this copy
> mode. And I've been working on batch xmit to ramp up the generic path.
>
>>>
>>> Maciej, what do you think?
>>
>> I would appreciate it as there has been various attempts to optmize
>> xsk generic xmit path.
>
> So do I!
>
> Thanks,
> Jason

[0]
https://lore.kernel.org/netdev/CAL+tcoBN9puWX-sTGvTiBN0Hg5oXKR3mjv783YXeR4Bsovuxkw@mail.gmail.com

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 13:08               ` Alexander Lobakin
@ 2025-10-15 13:11                 ` Eric Dumazet
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-15 13:11 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 15, 2025 at 6:08 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>

> > I added the following snippet[1] and only saw a stable ~1% improvement
> > when sending 64 size packets with xdpsock.

1% is noise level, I am definitely not convinced.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (5 preceding siblings ...)
  2025-10-14 17:19 ` [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-10-15 22:00 ` Jamal Hadi Salim
  2025-10-15 22:11   ` Eric Dumazet
  2025-10-16 23:50 ` patchwork-bot+netdevbpf
  7 siblings, 1 reply; 33+ messages in thread
From: Jamal Hadi Salim @ 2025-10-15 22:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Tue, Oct 14, 2025 at 1:19 PM Eric Dumazet <edumazet@google.com> wrote:
>
> In this series, I replace the busylock spinlock we have in
> __dev_queue_xmit() and use lockless list (llist) to reduce
> spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this series, we get a 300 % (4x) improvement on heavy TX workloads,
> sending twice the number of packets per second, for half the cpu cycles.
>

Not important but i am curious: you didnt mention what NIC this was in
the commit messages ;->

For the patchset, I have done testing with existing tdc tests and no
regression..
It does inspire new things when time becomes available.... so will be
doing more testing and likely small extensions etc.
So:
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
(For the tc bits, since the majority of the code touches tc related stuff)

cheers,
jamal


> v2: deflake tcp_user_timeout_user-timeout-probe.pkt.
>     Ability to return a different code than NET_XMIT_SUCCESS
>     when __dev_xmit_skb() has a single skb to send.
>
> Eric Dumazet (6):
>   selftests/net: packetdrill: unflake
>     tcp_user_timeout_user-timeout-probe.pkt
>   net: add add indirect call wrapper in skb_release_head_state()
>   net/sched: act_mirred: add loop detection
>   Revert "net/sched: Fix mirred deadlock on device recursion"
>   net: sched: claim one cache line in Qdisc
>   net: dev_queue_xmit() llist adoption
>
>  include/linux/netdevice_xmit.h                |  9 +-
>  include/net/sch_generic.h                     | 23 ++---
>  net/core/dev.c                                | 97 +++++++++++--------
>  net/core/skbuff.c                             | 11 ++-
>  net/sched/act_mirred.c                        | 62 +++++-------
>  net/sched/sch_generic.c                       |  7 --
>  .../tcp_user_timeout_user-timeout-probe.pkt   |  6 +-
>  7 files changed, 111 insertions(+), 104 deletions(-)
>
> --
> 2.51.0.788.g6d19910ace-goog
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency
  2025-10-15 22:00 ` [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Jamal Hadi Salim
@ 2025-10-15 22:11   ` Eric Dumazet
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2025-10-15 22:11 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Wed, Oct 15, 2025 at 3:00 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Oct 14, 2025 at 1:19 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > In this series, I replace the busylock spinlock we have in
> > __dev_queue_xmit() and use lockless list (llist) to reduce
> > spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this series, we get a 300 % (4x) improvement on heavy TX workloads,
> > sending twice the number of packets per second, for half the cpu cycles.
> >
>
> Not important but i am curious: you didnt mention what NIC this was in
> the commit messages ;->

I have used two NIC : IDPF (200Gbit), and GQ (100Gbit Google NIC)
(Usually with 32 TX queues)

And a variety of platforms, up to 512 cores sharing these 32 TX queues.

>
> For the patchset, I have done testing with existing tdc tests and no
> regression..
> It does inspire new things when time becomes available.... so will be
> doing more testing and likely small extensions etc.
> So:
> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

Thanks Jamal !

I also have this idea:

cpus serving NIC interrupts and specifically TX completions are often
trapped in also restarting a busy qdisc (because it was stopped by BQL
or the driver's own flow control).

We can do better:

1) In the TX completion loop, collect the skbs and do not free them immediately.

2) Store them in a private list and sum their skb->len while doing so.

3) Then call netdev_tx_completed_queue() call and netif_tx_wake_queue().

If the queue was stopped, this might add the qdisc in our per-cpu
private list (sd->output_queue), raising NET_TX_SOFTIRQ (no immediate
action because napi poll runs while BH are blocked)

4) Then, take care of all dev_consume_skb_any().

Quite often freeing these skbs can take a lot of time, because of mm
contention, and other false sharing, or expensive skb->destructor like
TCP.

5) By the time net_tx_action() finally runs, perhaps another cpu saw
the queue being in XON state and was able to push more packets to the
queue.

This means net_tx_action() might have nothing to do, saving precious cycles.

We should add extra logic in net_tx_action() to not even grab
qdisc_lock() spinlock at all.

My thinking is to add back a sequence (to replace q->running boolean),
and store a snapshot of this sequence every time we restart a queue.
-> net_tx_action can compare the sequence against the last snapshot.

> (For the tc bits, since the majority of the code touches tc related stuff)
>
> cheers,
> jamal
>
>
> > v2: deflake tcp_user_timeout_user-timeout-probe.pkt.
> >     Ability to return a different code than NET_XMIT_SUCCESS
> >     when __dev_xmit_skb() has a single skb to send.
> >
> > Eric Dumazet (6):
> >   selftests/net: packetdrill: unflake
> >     tcp_user_timeout_user-timeout-probe.pkt
> >   net: add add indirect call wrapper in skb_release_head_state()
> >   net/sched: act_mirred: add loop detection
> >   Revert "net/sched: Fix mirred deadlock on device recursion"
> >   net: sched: claim one cache line in Qdisc
> >   net: dev_queue_xmit() llist adoption
> >
> >  include/linux/netdevice_xmit.h                |  9 +-
> >  include/net/sch_generic.h                     | 23 ++---
> >  net/core/dev.c                                | 97 +++++++++++--------
> >  net/core/skbuff.c                             | 11 ++-
> >  net/sched/act_mirred.c                        | 62 +++++-------
> >  net/sched/sch_generic.c                       |  7 --
> >  .../tcp_user_timeout_user-timeout-probe.pkt   |  6 +-
> >  7 files changed, 111 insertions(+), 104 deletions(-)
> >
> > --
> > 2.51.0.788.g6d19910ace-goog
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion"
  2025-10-14 17:19 ` [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
  2025-10-15  5:26   ` Kuniyuki Iwashima
  2025-10-15  8:17   ` Toke Høiland-Jørgensen
@ 2025-10-16 12:03   ` Victor Nogueira
  2 siblings, 0 replies; 33+ messages in thread
From: Victor Nogueira @ 2025-10-16 12:03 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On 14/10/2025 14:19, Eric Dumazet wrote:
> This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
> and 44180feaccf266d9b0b28cc4ceaac019817deb5c.
> 
> Prior patch in this series implemented loop detection
> in act_mirred, we can remove q->owner to save some cycles
> in the fast path.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Victor Nogueira <victor@mojatatu.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency
  2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (6 preceding siblings ...)
  2025-10-15 22:00 ` [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Jamal Hadi Salim
@ 2025-10-16 23:50 ` patchwork-bot+netdevbpf
  7 siblings, 0 replies; 33+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-10-16 23:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, horms, jhs, xiyou.wangcong, jiri, kuniyu,
	willemb, netdev, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 14 Oct 2025 17:19:01 +0000 you wrote:
> In this series, I replace the busylock spinlock we have in
> __dev_queue_xmit() and use lockless list (llist) to reduce
> spinlock contention to the minimum.
> 
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
> 
> [...]

Here is the summary with links:
  - [v2,net-next,1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt
    https://git.kernel.org/netdev/net-next/c/56cef47c28dc
  - [v2,net-next,2/6] net: add add indirect call wrapper in skb_release_head_state()
    https://git.kernel.org/netdev/net-next/c/5b2b7dec05f3
  - [v2,net-next,3/6] net/sched: act_mirred: add loop detection
    https://git.kernel.org/netdev/net-next/c/fe946a751d9b
  - [v2,net-next,4/6] Revert "net/sched: Fix mirred deadlock on device recursion"
    https://git.kernel.org/netdev/net-next/c/178ca30889a1
  - [v2,net-next,5/6] net: sched: claim one cache line in Qdisc
    https://git.kernel.org/netdev/net-next/c/526f5fb112f7
  - [v2,net-next,6/6] net: dev_queue_xmit() llist adoption
    https://git.kernel.org/netdev/net-next/c/100dfa74cad9

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-15 12:46         ` Eric Dumazet
  2025-10-15 12:49           ` Eric Dumazet
  2025-10-15 12:54           ` Alexander Lobakin
@ 2025-10-20  7:01           ` Jason Xing
  2025-10-20  7:41             ` Eric Dumazet
  2 siblings, 1 reply; 33+ messages in thread
From: Jason Xing @ 2025-10-20  7:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Lobakin, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 15, 2025 at 8:46 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Oct 15, 2025 at 5:30 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Wed, 15 Oct 2025 05:16:05 -0700
> >
> > > On Wed, Oct 15, 2025 at 5:02 AM Alexander Lobakin
> > > <aleksander.lobakin@intel.com> wrote:
> > >>
> > >> From: Eric Dumazet <edumazet@google.com>
> > >> Date: Tue, 14 Oct 2025 17:19:03 +0000
> > >>
> > >>> While stress testing UDP senders on a host with expensive indirect
> > >>> calls, I found cpus processing TX completions where showing
> > >>> a very high cost (20%) in sock_wfree() due to
> > >>> CONFIG_MITIGATION_RETPOLINE=y.
> > >>>
> > >>> Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.
> > >>>
> > >>> Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >>> ---
> > >>>  net/core/skbuff.c | 11 ++++++++++-
> > >>>  1 file changed, 10 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > >>> index bc12790017b0..692e3a70e75e 100644
> > >>> --- a/net/core/skbuff.c
> > >>> +++ b/net/core/skbuff.c
> > >>> @@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
> > >>>       skb_dst_drop(skb);
> > >>>       if (skb->destructor) {
> > >>>               DEBUG_NET_WARN_ON_ONCE(in_hardirq());
> > >>> -             skb->destructor(skb);
> > >>> +#ifdef CONFIG_INET
> > >>> +             INDIRECT_CALL_3(skb->destructor,
> > >>> +                             tcp_wfree, __sock_wfree, sock_wfree,
> > >>> +                             skb);
> > >>> +#else
> > >>> +             INDIRECT_CALL_1(skb->destructor,
> > >>> +                             sock_wfree,
> > >>> +                             skb);
> > >>> +
> > >>> +#endif
> > >>
> > >> Is it just me or seems like you ignored the suggestion/discussion under
> > >> v1 of this patch...
> > >>
> > >
> > > I did not. Please send a patch when you can demonstrate the difference.
> >
> > You "did not", but you didn't reply there, only sent v2 w/o any mention.
> >
> > >
> > > We are not going to add all the possible destructors unless there is evidence.
> >
> > There are numbers in the original discussion, you'd have noticed if you
> > did read.
> >
> > We only ask to add one more destructor which will help certain
> > perf-critical workloads. Add it to the end of the list, so that it won't
> > hurt your optimization.
> >
> > "Send a patch" means you're now changing these lines now and then they
> > would be changed once again, why...
>
> I can not test what you propose.
>
> I can drop this patch instead, and keep it in Google kernels, (we had
> TCP support for years)
>
> Or... you can send a patch on top of it later.
>

Sorry, I've been away from the keyboard for a few days. I think it's
fair to let us (who are currently working on the xsk improvement) post
a simple patch based on the series.

Regarding what you mentioned that 1% is a noisy number, I disagree.
The overall numbers are improved, rather than only one or small part
of them. I've done a few tests under different servers, so I believe
what I've seen. BTW, xdpsock is the test tool that gives a stable
number especially when running on the physical machine.

@ Alexander I think I can post that patch with more test numbers and
your 'suggested-by' tag included if you have no objection:) Or if you
wish you could do it on your own, please feel free to send one then :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-20  7:01           ` Jason Xing
@ 2025-10-20  7:41             ` Eric Dumazet
  2025-10-20  9:14               ` Jason Xing
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2025-10-20  7:41 UTC (permalink / raw)
  To: Jason Xing
  Cc: Alexander Lobakin, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Mon, Oct 20, 2025 at 12:02 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>

>
> Sorry, I've been away from the keyboard for a few days. I think it's
> fair to let us (who are currently working on the xsk improvement) post
> a simple patch based on the series.
>
> Regarding what you mentioned that 1% is a noisy number, I disagree.
> The overall numbers are improved, rather than only one or small part
> of them. I've done a few tests under different servers, so I believe
> what I've seen. BTW, xdpsock is the test tool that gives a stable
> number especially when running on the physical machine.
>
> @ Alexander I think I can post that patch with more test numbers and
> your 'suggested-by' tag included if you have no objection:) Or if you
> wish you could do it on your own, please feel free to send one then :)

The series focus was on something bringing 100% improvement.
The 1% figure _was_ noise.

I think you are mistaken on what a "SIgned-off-by: Eric Dumazet
<edumazet@google.com>" means.

I am not opposed to a patch that you will support by yourself.
I am opposed to you trying to let me take responsibility for something
I have no time/desire to support.

I added the indirect call wrapper mostly at the last moment, so that
anyone wanting to test my series
like I described (UDP workload) would not have to mention the sock_wfree cost.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-20  7:41             ` Eric Dumazet
@ 2025-10-20  9:14               ` Jason Xing
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Xing @ 2025-10-20  9:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Lobakin, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Mon, Oct 20, 2025 at 3:41 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 20, 2025 at 12:02 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
>
> >
> > Sorry, I've been away from the keyboard for a few days. I think it's
> > fair to let us (who are currently working on the xsk improvement) post
> > a simple patch based on the series.
> >
> > Regarding what you mentioned that 1% is a noisy number, I disagree.
> > The overall numbers are improved, rather than only one or small part
> > of them. I've done a few tests under different servers, so I believe
> > what I've seen. BTW, xdpsock is the test tool that gives a stable
> > number especially when running on the physical machine.
> >
> > @ Alexander I think I can post that patch with more test numbers and
> > your 'suggested-by' tag included if you have no objection:) Or if you
> > wish you could do it on your own, please feel free to send one then :)
>
> The series focus was on something bringing 100% improvement.
> The 1% figure _was_ noise.
>
> I think you are mistaken on what a "SIgned-off-by: Eric Dumazet
> <edumazet@google.com>" means.
>
> I am not opposed to a patch that you will support by yourself.
> I am opposed to you trying to let me take responsibility for something
> I have no time/desire to support.
>
> I added the indirect call wrapper mostly at the last moment, so that
> anyone wanting to test my series
> like I described (UDP workload) would not have to mention the sock_wfree cost.

Eric, I totally understand what you meant and thanks for bringing up
this idea :) Agree that one patch should do one thing at a time. It's
clean. If people smell something wrong in the future, they can easily
bisect and revert. So what I replied was that I decided to add a
follow-up patch to only support the xsk scenario.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-10-20  9:15 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-14 17:19 [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Eric Dumazet
2025-10-14 17:19 ` [PATCH v2 net-next 1/6] selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt Eric Dumazet
2025-10-15  5:54   ` Kuniyuki Iwashima
2025-10-14 17:19 ` [PATCH v2 net-next 2/6] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
2025-10-15  3:57   ` Kuniyuki Iwashima
2025-10-15  8:17   ` Toke Høiland-Jørgensen
2025-10-15 12:02   ` Alexander Lobakin
2025-10-15 12:16     ` Eric Dumazet
2025-10-15 12:30       ` Alexander Lobakin
2025-10-15 12:46         ` Eric Dumazet
2025-10-15 12:49           ` Eric Dumazet
2025-10-15 12:54           ` Alexander Lobakin
2025-10-15 13:01             ` Eric Dumazet
2025-10-15 13:08               ` Alexander Lobakin
2025-10-15 13:11                 ` Eric Dumazet
2025-10-20  7:01           ` Jason Xing
2025-10-20  7:41             ` Eric Dumazet
2025-10-20  9:14               ` Jason Xing
2025-10-14 17:19 ` [PATCH v2 net-next 3/6] net/sched: act_mirred: add loop detection Eric Dumazet
2025-10-15  5:24   ` Kuniyuki Iwashima
2025-10-15  8:17   ` Toke Høiland-Jørgensen
2025-10-14 17:19 ` [PATCH v2 net-next 4/6] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
2025-10-15  5:26   ` Kuniyuki Iwashima
2025-10-15  8:17   ` Toke Høiland-Jørgensen
2025-10-16 12:03   ` Victor Nogueira
2025-10-14 17:19 ` [PATCH v2 net-next 5/6] net: sched: claim one cache line in Qdisc Eric Dumazet
2025-10-15  5:42   ` Kuniyuki Iwashima
2025-10-15  8:17   ` Toke Høiland-Jørgensen
2025-10-14 17:19 ` [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-10-15  6:20   ` Kuniyuki Iwashima
2025-10-15 22:00 ` [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency Jamal Hadi Salim
2025-10-15 22:11   ` Eric Dumazet
2025-10-16 23:50 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).