[PATCH RFC net-next 0/4] net: deal with strange attractors tx queues

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC net-next 0/4] net: deal with strange attractors tx queues
@ 2025-10-08 10:46 Eric Dumazet
  2025-10-08 10:46 ` [PATCH RFC net-next 1/4] net: add SK_WMEM_ALLOC_BIAS constant Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 10:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Back 2010, Tom Herbert added skb->ooo_okay to TCP flows.

Extend the feature to connected flows for other protocols like UDP.

skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu.

The so-called "strange attractors" has caused many performance
issues, we need to do better now that TCP deals better with
potential reorders.

Add new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.

After this series, we no longer have to make sure threads
are pinned to cpus, they now can be migrated without
adding too much spinlock/qdisc/TX completion pressure anymore.

Eric Dumazet (4):
  net: add SK_WMEM_ALLOC_BIAS constant
  net: control skb->ooo_okay from skb_set_owner_w()
  net: add /proc/sys/net/core/txq_reselection_ms control
  net: allow busy connected flows to switch tx queues

 Documentation/admin-guide/sysctl/net.rst | 15 +++++++++
 include/net/netns/core.h                 |  1 +
 include/net/sock.h                       | 43 ++++++++++++------------
 net/atm/common.c                         |  2 +-
 net/core/dev.c                           | 27 +++++++++++++--
 net/core/net_namespace.c                 |  1 +
 net/core/sock.c                          | 17 +++++++---
 net/core/sysctl_net_core.c               |  7 ++++
 8 files changed, 84 insertions(+), 29 deletions(-)

-- 
2.51.0.710.ga91ca5db03-goog

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC net-next 1/4] net: add SK_WMEM_ALLOC_BIAS constant
  2025-10-08 10:46 [PATCH RFC net-next 0/4] net: deal with strange attractors tx queues Eric Dumazet
@ 2025-10-08 10:46 ` Eric Dumazet
  2025-10-08 14:25   ` Neal Cardwell
  2025-10-08 10:46 ` [PATCH RFC net-next 2/4] net: control skb->ooo_okay from skb_set_owner_w() Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 10:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get()
takes care of this initial value.

Add SK_WMEM_ALLOC_BIAS define to not spread this magic value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h | 3 ++-
 net/atm/common.c   | 2 +-
 net/core/sock.c    | 5 ++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 60bcb13f045c3144609908a36960528b33e4f71c..2794bc5c565424491a064049d3d76c3fb7ba1ed8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2303,6 +2303,7 @@ static inline int skb_copy_to_page_nocache(struct sock *sk, struct iov_iter *fro
 	return 0;
 }
 
+#define SK_WMEM_ALLOC_BIAS 1
 /**
  * sk_wmem_alloc_get - returns write allocations
  * @sk: socket
@@ -2311,7 +2312,7 @@ static inline int skb_copy_to_page_nocache(struct sock *sk, struct iov_iter *fro
  */
 static inline int sk_wmem_alloc_get(const struct sock *sk)
 {
-	return refcount_read(&sk->sk_wmem_alloc) - 1;
+	return refcount_read(&sk->sk_wmem_alloc) - SK_WMEM_ALLOC_BIAS;
 }
 
 /**
diff --git a/net/atm/common.c b/net/atm/common.c
index 881c7f259dbd46be35d71e558a73eb2f26223963..cecc71a8bee1176518a2d63b4613dbd243543695 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -157,7 +157,7 @@ int vcc_create(struct net *net, struct socket *sock, int protocol, int family, i
 	memset(&vcc->local, 0, sizeof(struct sockaddr_atmsvc));
 	memset(&vcc->remote, 0, sizeof(struct sockaddr_atmsvc));
 	vcc->qos.txtp.max_sdu = 1 << 16; /* for meta VCs */
-	refcount_set(&sk->sk_wmem_alloc, 1);
+	refcount_set(&sk->sk_wmem_alloc, SK_WMEM_ALLOC_BIAS);
 	atomic_set(&sk->sk_rmem_alloc, 0);
 	vcc->push = NULL;
 	vcc->pop = NULL;
diff --git a/net/core/sock.c b/net/core/sock.c
index dc03d4b5909a2a68aee84eb9a153b2c3970f6b32..542cfa16ee125f6c8487237c9040695d42794087 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2313,7 +2313,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		}
 
 		sock_net_set(sk, net);
-		refcount_set(&sk->sk_wmem_alloc, 1);
+		refcount_set(&sk->sk_wmem_alloc, SK_WMEM_ALLOC_BIAS);
 
 		mem_cgroup_sk_alloc(sk);
 		cgroup_sk_alloc(&sk->sk_cgrp_data);
@@ -2494,8 +2494,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 
 	atomic_set(&newsk->sk_rmem_alloc, 0);
 
-	/* sk_wmem_alloc set to one (see sk_free() and sock_wfree()) */
-	refcount_set(&newsk->sk_wmem_alloc, 1);
+	refcount_set(&newsk->sk_wmem_alloc, SK_WMEM_ALLOC_BIAS);
 
 	atomic_set(&newsk->sk_omem_alloc, 0);
 	sk_init_common(newsk);
-- 
2.51.0.710.ga91ca5db03-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC net-next 2/4] net: control skb->ooo_okay from skb_set_owner_w()
  2025-10-08 10:46 [PATCH RFC net-next 0/4] net: deal with strange attractors tx queues Eric Dumazet
  2025-10-08 10:46 ` [PATCH RFC net-next 1/4] net: add SK_WMEM_ALLOC_BIAS constant Eric Dumazet
@ 2025-10-08 10:46 ` Eric Dumazet
  2025-10-08 14:26   ` Neal Cardwell
  2025-10-08 10:46 ` [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control Eric Dumazet
  2025-10-08 10:46 ` [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues Eric Dumazet
  3 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 10:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

15 years after Tom Herbert added skb->ooo_okay, only TCP transport
benefits from it.

We can support other transports directly from skb_set_owner_w().

If no other TX packet for this socket is in a host queue (qdisc, NIC queue)
there is no risk of self-inflicted reordering, we can set skb->ooo_okay.

This allows netdev_pick_tx() to choose a TX queue based on XPS settings,
instead of reusing the queue chosen at the time the first packet was sent
for connected sockets.

Tested:
  500 concurrent UDP_RR connected UDP flows, host with 32 TX queues, XPS setup.

  super_netperf 500 -t UDP_RR -H <host> -l 1000 -- -r 100,100 -Nn &

This patch saves between 10% and 20% of cycles, depending on how
process scheduler migrates threads among cpus.

Using following bpftrace script, we can see the effect on Qdisc/NIC tx queues
being better used (less cache line misses).

bpftrace -e '
k:__dev_queue_xmit { @start[cpu] = nsecs; }
kr:__dev_queue_xmit {
 if (@start[cpu]) {
    $delay = nsecs - @start[cpu];
    delete(@start[cpu]);
    @__dev_queue_xmit_ns = hist($delay);
 }
}
END { clear(@start); }'

Before:
@__dev_queue_xmit_ns:
[128, 256)             6 |                                                    |
[256, 512)        116283 |                                                    |
[512, 1K)        1888205 |@@@@@@@@@@@                                         |
[1K, 2K)         8106167 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[2K, 4K)         8699293 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)         2600676 |@@@@@@@@@@@@@@@                                     |
[8K, 16K)         721688 |@@@@                                                |
[16K, 32K)        122995 |                                                    |
[32K, 64K)         10639 |                                                    |
[64K, 128K)          119 |                                                    |
[128K, 256K)           1 |                                                    |

After:
@__dev_queue_xmit_ns:
[128, 256)             3 |                                                    |
[256, 512)        651112 |@@                                                  |
[512, 1K)        8109938 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[1K, 2K)        16081031 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)         2411692 |@@@@@@@                                             |
[4K, 8K)           98994 |                                                    |
[8K, 16K)           1536 |                                                    |
[16K, 32K)           587 |                                                    |
[32K, 64K)             2 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 542cfa16ee125f6c8487237c9040695d42794087..08ae20069b6d287745800710192396f76c8781b4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2694,6 +2694,8 @@ void __sock_wfree(struct sk_buff *skb)
 
 void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
+	int old_wmem;
+
 	skb_orphan(skb);
 #ifdef CONFIG_INET
 	if (unlikely(!sk_fullsock(sk)))
@@ -2707,7 +2709,15 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	 * is enough to guarantee sk_free() won't free this sock until
 	 * all in-flight packets are completed
 	 */
-	refcount_add(skb->truesize, &sk->sk_wmem_alloc);
+	__refcount_add(skb->truesize, &sk->sk_wmem_alloc, &old_wmem);
+
+	/* (old_wmem == SK_WMEM_ALLOC_BIAS) if no other TX packet for this socket
+	 * is in a host queue (qdisc, NIC queue).
+	 * Set skb->ooo_okay so that netdev_pick_tx() can choose a TX queue
+	 * based on XPS for better performance.
+	 * Otherwise clear ooo_okay to not risk Out Of Order delivery.
+	 */
+	skb->ooo_okay = (old_wmem == SK_WMEM_ALLOC_BIAS);
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
-- 
2.51.0.710.ga91ca5db03-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control
  2025-10-08 10:46 [PATCH RFC net-next 0/4] net: deal with strange attractors tx queues Eric Dumazet
  2025-10-08 10:46 ` [PATCH RFC net-next 1/4] net: add SK_WMEM_ALLOC_BIAS constant Eric Dumazet
  2025-10-08 10:46 ` [PATCH RFC net-next 2/4] net: control skb->ooo_okay from skb_set_owner_w() Eric Dumazet
@ 2025-10-08 10:46 ` Eric Dumazet
  2025-10-08 14:27   ` Neal Cardwell
  2025-10-08 15:21   ` Paolo Abeni
  2025-10-08 10:46 ` [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues Eric Dumazet
  3 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 10:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Add a new sysctl to control how often a queue reselection
can happen even if a flow has a persistent queue of skbs
in a Qdisc or NIC queue.

This sysctl is used in the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/admin-guide/sysctl/net.rst | 15 +++++++++++++++
 include/net/netns/core.h                 |  1 +
 net/core/net_namespace.c                 |  1 +
 net/core/sysctl_net_core.c               |  7 +++++++
 4 files changed, 24 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff16b01baf32f5ded9b75a6e699b184..6cfc2655dd489aa4a9233c09b1ab4ebe0bc76b0c 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -406,6 +406,21 @@ to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
 If set to 1 (default), hash rethink is performed on listening socket.
 If set to 0, hash rethink is not performed.
 
+txq_reselection_ms
+------------------
+
+Controls how often (in ms) a busy connected flow can select another tx queue.
+
+A resection is desirable when/if user thread has migrated and XPS
+would select a different queue. Same can occur without XPS
+if the flow hash has changed.
+
+But switching txq can introduce reorders, especially if the
+old queue is under high pressure. Modern TCP stacks deal
+well with reorders if they happen not too often.
+
+Default : 1000
+
 gro_normal_batch
 ----------------
 
diff --git a/include/net/netns/core.h b/include/net/netns/core.h
index 9b36f0ff0c200c9cf89753f2c78a3ee0fe5256b7..cb9c3e4cd7385016de3ac87dac65411d54bd093b 100644
--- a/include/net/netns/core.h
+++ b/include/net/netns/core.h
@@ -13,6 +13,7 @@ struct netns_core {
 	struct ctl_table_header	*sysctl_hdr;
 
 	int	sysctl_somaxconn;
+	int	sysctl_txq_reselection;
 	int	sysctl_optmem_max;
 	u8	sysctl_txrehash;
 	u8	sysctl_tstamp_allow_data;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index b0e0f22d7b213c522c2b83a5fcbcabb43e72cd11..adcfef55a66f1691cb76d954af32334e532864bb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -395,6 +395,7 @@ static __net_init void preinit_net_sysctl(struct net *net)
 	net->core.sysctl_optmem_max = 128 * 1024;
 	net->core.sysctl_txrehash = SOCK_TXREHASH_ENABLED;
 	net->core.sysctl_tstamp_allow_data = 1;
+	net->core.sysctl_txq_reselection = msecs_to_jiffies(1000);
 }
 
 /* init code that must occur even if setup_net() is not called. */
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1e0bf61ad4ac219ab4eccf638658a..f79137826d7f9d653e2e22d8f42e23bec08e083c 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -667,6 +667,13 @@ static struct ctl_table netns_core_table[] = {
 		.extra2		= SYSCTL_ONE,
 		.proc_handler	= proc_dou8vec_minmax,
 	},
+	{
+		.procname	= "txq_reselection_ms",
+		.data		= &init_net.core.sysctl_txq_reselection,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_ms_jiffies,
+	},
 	{
 		.procname	= "tstamp_allow_data",
 		.data		= &init_net.core.sysctl_tstamp_allow_data,
-- 
2.51.0.710.ga91ca5db03-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues
  2025-10-08 10:46 [PATCH RFC net-next 0/4] net: deal with strange attractors tx queues Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-10-08 10:46 ` [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control Eric Dumazet
@ 2025-10-08 10:46 ` Eric Dumazet
  2025-10-08 13:38   ` Willem de Bruijn
                     ` (2 more replies)
  3 siblings, 3 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 10:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

This is a followup of commit 726e9e8b94b9 ("tcp: refine
skb->ooo_okay setting") and to the prior commit in this series
("net: control skb->ooo_okay from skb_set_owner_w()")

skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu.

The so-called "strange attractors" has caused many performance
issues, we need to do better.

We have tried very hard to avoid reorders because TCP was
not dealing with them nicely a decade ago.

Use the new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.

After this patch, we no longer have to make sure threads
are pinned to cpus, they now can be migrated without
adding too much spinlock/qdisc/TX completion pressure anymore.

TX completion part was problematic, because it added false sharing
on various socket fields, but also added false sharing and spinlock
contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
is not an option unfortunately.

Note for later: move sk->sk_tx_queue_mapping closer
to sk_tx_queue_mapping_jiffies for better cache locality.

Tested:

Used a host with 32 TX queues, shared by groups of 8 cores.
XPS setup :

echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
...

Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15

taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host

Checked that only queues 0 and 1 are used as instructed by XPS :
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
 backlog 123489410b 1890p
 backlog 69809026b 1064p
 backlog 52401054b 805p

Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121

C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done

Set txq_reselection_ms to 1000
echo 1000 > /proc/sys/net/core/txq_reselection_ms

Check that the flows have migrated nicely:

tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
 backlog 130508314b 1916p
 backlog 8584380b 126p
 backlog 8584380b 126p
 backlog 8379990b 123p
 backlog 8584380b 126p
 backlog 8487484b 125p
 backlog 8584380b 126p
 backlog 8448120b 124p
 backlog 8584380b 126p
 backlog 8720640b 128p
 backlog 8856900b 130p
 backlog 8584380b 126p
 backlog 8652510b 127p
 backlog 8448120b 124p
 backlog 8516250b 125p
 backlog 7834950b 115p

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h | 40 +++++++++++++++++++---------------------
 net/core/dev.c     | 27 +++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 2794bc5c565424491a064049d3d76c3fb7ba1ed8..61f92bb03e00d7167cccfe70da16174f2b40f6de 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -485,6 +485,7 @@ struct sock {
 	unsigned long		sk_pacing_rate; /* bytes per second */
 	atomic_t		sk_zckey;
 	atomic_t		sk_tskey;
+	unsigned long		sk_tx_queue_mapping_jiffies;
 	__cacheline_group_end(sock_write_tx);
 
 	__cacheline_group_begin(sock_read_tx);
@@ -1984,6 +1985,14 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 	return __sk_receive_skb(sk, skb, nested, 1, true);
 }
 
+/* This helper checks if a socket is a full socket,
+ * ie _not_ a timewait or request socket.
+ */
+static inline bool sk_fullsock(const struct sock *sk)
+{
+	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
+}
+
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
 	/* sk_tx_queue_mapping accept only upto a 16-bit value */
@@ -1992,7 +2001,15 @@ static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 	/* Paired with READ_ONCE() in sk_tx_queue_get() and
 	 * other WRITE_ONCE() because socket lock might be not held.
 	 */
-	WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
+	if (READ_ONCE(sk->sk_tx_queue_mapping) != tx_queue) {
+		WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
+		WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
+		return;
+	}
+
+	/* Refresh sk_tx_queue_mapping_jiffies if too old. */
+	if (time_is_before_jiffies(READ_ONCE(sk->sk_tx_queue_mapping_jiffies) + HZ))
+		WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
 }
 
 #define NO_QUEUE_MAPPING	USHRT_MAX
@@ -2005,19 +2022,7 @@ static inline void sk_tx_queue_clear(struct sock *sk)
 	WRITE_ONCE(sk->sk_tx_queue_mapping, NO_QUEUE_MAPPING);
 }
 
-static inline int sk_tx_queue_get(const struct sock *sk)
-{
-	if (sk) {
-		/* Paired with WRITE_ONCE() in sk_tx_queue_clear()
-		 * and sk_tx_queue_set().
-		 */
-		int val = READ_ONCE(sk->sk_tx_queue_mapping);
-
-		if (val != NO_QUEUE_MAPPING)
-			return val;
-	}
-	return -1;
-}
+int sk_tx_queue_get(const struct sock *sk);
 
 static inline void __sk_rx_queue_set(struct sock *sk,
 				     const struct sk_buff *skb,
@@ -2945,13 +2950,6 @@ skb_sk_is_prefetched(struct sk_buff *skb)
 #endif /* CONFIG_INET */
 }
 
-/* This helper checks if a socket is a full socket,
- * ie _not_ a timewait or request socket.
- */
-static inline bool sk_fullsock(const struct sock *sk)
-{
-	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
-}
 
 static inline bool
 sk_is_refcounted(struct sock *sk)
diff --git a/net/core/dev.c b/net/core/dev.c
index a64cef2c537e98ee87776e6f8d3ca3d98f0711b3..c302fd5bb57894c6e5651b7adc8d033ac719070a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4591,6 +4591,30 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(dev_pick_tx_zero);
 
+int sk_tx_queue_get(const struct sock *sk)
+{
+	int val;
+
+	if (!sk)
+		return -1;
+	/* Paired with WRITE_ONCE() in sk_tx_queue_clear()
+	 * and sk_tx_queue_set().
+	 */
+	val = READ_ONCE(sk->sk_tx_queue_mapping);
+
+	if (val == NO_QUEUE_MAPPING)
+		return -1;
+
+	if (sk_fullsock(sk) &&
+	    time_is_before_jiffies(
+			READ_ONCE(sk->sk_tx_queue_mapping_jiffies) +
+			READ_ONCE(sock_net(sk)->core.sysctl_txq_reselection)))
+		return -1;
+
+	return val;
+}
+EXPORT_SYMBOL(sk_tx_queue_get);
+
 u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
 		     struct net_device *sb_dev)
 {
@@ -4606,8 +4630,7 @@ u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
 		if (new_index < 0)
 			new_index = skb_tx_hash(dev, sb_dev, skb);
 
-		if (queue_index != new_index && sk &&
-		    sk_fullsock(sk) &&
+		if (sk && sk_fullsock(sk) &&
 		    rcu_access_pointer(sk->sk_dst_cache))
 			sk_tx_queue_set(sk, new_index);
 
-- 
2.51.0.710.ga91ca5db03-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues
  2025-10-08 10:46 ` [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues Eric Dumazet
@ 2025-10-08 13:38   ` Willem de Bruijn
  2025-10-08 14:38     ` Eric Dumazet
  2025-10-08 15:30   ` Paolo Abeni
  2025-10-13  9:18   ` Simon Horman
  2 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2025-10-08 13:38 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Eric Dumazet wrote:
> This is a followup of commit 726e9e8b94b9 ("tcp: refine
> skb->ooo_okay setting") and to the prior commit in this series
> ("net: control skb->ooo_okay from skb_set_owner_w()")
> 
> skb->ooo_okay might never be set for bulk flows that always
> have at least one skb in a qdisc queue of NIC queue,
> especially if TX completion is delayed because of a stressed cpu.
> 
> The so-called "strange attractors" has caused many performance
> issues, we need to do better.
> 
> We have tried very hard to avoid reorders because TCP was
> not dealing with them nicely a decade ago.
> 
> Use the new net.core.txq_reselection_ms sysctl to let
> flows follow XPS and select a more efficient queue.
> 
> After this patch, we no longer have to make sure threads
> are pinned to cpus, they now can be migrated without
> adding too much spinlock/qdisc/TX completion pressure anymore.
> 
> TX completion part was problematic, because it added false sharing
> on various socket fields, but also added false sharing and spinlock
> contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
> is not an option unfortunately.
> 
> Note for later: move sk->sk_tx_queue_mapping closer
> to sk_tx_queue_mapping_jiffies for better cache locality.
> 
> Tested:
> 
> Used a host with 32 TX queues, shared by groups of 8 cores.
> XPS setup :
> 
> echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
> echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
> echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
> echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
> echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
> echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
> echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
> echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
> ...
> 
> Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15
> 
> taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host
> 
> Checked that only queues 0 and 1 are used as instructed by XPS :
> tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
>  backlog 123489410b 1890p
>  backlog 69809026b 1064p
>  backlog 52401054b 805p

Just in case it's not clear to all readers, this implies MQ + some
per queue qdisc, right. Here MQ + FQ.
 
> Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121
> 
> C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done
> 
> Set txq_reselection_ms to 1000
> echo 1000 > /proc/sys/net/core/txq_reselection_ms

Just curious: is the once per second heuristic based on anything. Is
it likely to set another (more aggressive) value here?

Series looks great to me, btw, thanks.

> 
> Check that the flows have migrated nicely:
> 
> tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
>  backlog 130508314b 1916p
>  backlog 8584380b 126p
>  backlog 8584380b 126p
>  backlog 8379990b 123p
>  backlog 8584380b 126p
>  backlog 8487484b 125p
>  backlog 8584380b 126p
>  backlog 8448120b 124p
>  backlog 8584380b 126p
>  backlog 8720640b 128p
>  backlog 8856900b 130p
>  backlog 8584380b 126p
>  backlog 8652510b 127p
>  backlog 8448120b 124p
>  backlog 8516250b 125p
>  backlog 7834950b 115p
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/net/sock.h | 40 +++++++++++++++++++---------------------
>  net/core/dev.c     | 27 +++++++++++++++++++++++++--
>  2 files changed, 44 insertions(+), 23 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2794bc5c565424491a064049d3d76c3fb7ba1ed8..61f92bb03e00d7167cccfe70da16174f2b40f6de 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -485,6 +485,7 @@ struct sock {
>  	unsigned long		sk_pacing_rate; /* bytes per second */
>  	atomic_t		sk_zckey;
>  	atomic_t		sk_tskey;
> +	unsigned long		sk_tx_queue_mapping_jiffies;
>  	__cacheline_group_end(sock_write_tx);
>  
>  	__cacheline_group_begin(sock_read_tx);
> @@ -1984,6 +1985,14 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
>  	return __sk_receive_skb(sk, skb, nested, 1, true);
>  }
>  
> +/* This helper checks if a socket is a full socket,
> + * ie _not_ a timewait or request socket.
> + */
> +static inline bool sk_fullsock(const struct sock *sk)
> +{
> +	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> +}
> +
>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>  {
>  	/* sk_tx_queue_mapping accept only upto a 16-bit value */
> @@ -1992,7 +2001,15 @@ static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>  	/* Paired with READ_ONCE() in sk_tx_queue_get() and
>  	 * other WRITE_ONCE() because socket lock might be not held.
>  	 */
> -	WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
> +	if (READ_ONCE(sk->sk_tx_queue_mapping) != tx_queue) {
> +		WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
> +		WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
> +		return;
> +	}
> +
> +	/* Refresh sk_tx_queue_mapping_jiffies if too old. */
> +	if (time_is_before_jiffies(READ_ONCE(sk->sk_tx_queue_mapping_jiffies) + HZ))
> +		WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
>  }
>  
>  #define NO_QUEUE_MAPPING	USHRT_MAX
> @@ -2005,19 +2022,7 @@ static inline void sk_tx_queue_clear(struct sock *sk)
>  	WRITE_ONCE(sk->sk_tx_queue_mapping, NO_QUEUE_MAPPING);
>  }
>  
> -static inline int sk_tx_queue_get(const struct sock *sk)
> -{
> -	if (sk) {
> -		/* Paired with WRITE_ONCE() in sk_tx_queue_clear()
> -		 * and sk_tx_queue_set().
> -		 */
> -		int val = READ_ONCE(sk->sk_tx_queue_mapping);
> -
> -		if (val != NO_QUEUE_MAPPING)
> -			return val;
> -	}
> -	return -1;
> -}7
> +int sk_tx_queue_get(const struct sock *sk);
>  
>  static inline void __sk_rx_queue_set(struct sock *sk,
>  				     const struct sk_buff *skb,
> @@ -2945,13 +2950,6 @@ skb_sk_is_prefetched(struct sk_buff *skb)
>  #endif /* CONFIG_INET */
>  }
>  
> -/* This helper checks if a socket is a full socket,
> - * ie _not_ a timewait or request socket.
> - */
> -static inline bool sk_fullsock(const struct sock *sk)
> -{
> -	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> -}
>  
>  static inline bool
>  sk_is_refcounted(struct sock *sk)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a64cef2c537e98ee87776e6f8d3ca3d98f0711b3..c302fd5bb57894c6e5651b7adc8d033ac719070a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4591,6 +4591,30 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
>  }
>  EXPORT_SYMBOL(dev_pick_tx_zero);
>  
> +int sk_tx_queue_get(const struct sock *sk)
> +{
> +	int val;
> +
> +	if (!sk)
> +		return -1;
> +	/* Paired with WRITE_ONCE() in sk_tx_queue_clear()
> +	 * and sk_tx_queue_set().
> +	 */
> +	val = READ_ONCE(sk->sk_tx_queue_mapping);
> +
> +	if (val == NO_QUEUE_MAPPING)
> +		return -1;
> +
> +	if (sk_fullsock(sk) &&
> +	    time_is_before_jiffies(
> +			READ_ONCE(sk->sk_tx_queue_mapping_jiffies) +
> +			READ_ONCE(sock_net(sk)->core.sysctl_txq_reselection)))
> +		return -1;
> +
> +	return val;
> +}
> +EXPORT_SYMBOL(sk_tx_queue_get);
> +
>  u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
>  		     struct net_device *sb_dev)
>  {
> @@ -4606,8 +4630,7 @@ u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
>  		if (new_index < 0)
>  			new_index = skb_tx_hash(dev, sb_dev, skb);
>  
> -		if (queue_index != new_index && sk &&
> -		    sk_fullsock(sk) &&
> +		if (sk && sk_fullsock(sk) &&
>  		    rcu_access_pointer(sk->sk_dst_cache))
>  			sk_tx_queue_set(sk, new_index);
>  
> -- 
> 2.51.0.710.ga91ca5db03-goog
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 1/4] net: add SK_WMEM_ALLOC_BIAS constant
  2025-10-08 10:46 ` [PATCH RFC net-next 1/4] net: add SK_WMEM_ALLOC_BIAS constant Eric Dumazet
@ 2025-10-08 14:25   ` Neal Cardwell
  0 siblings, 0 replies; 16+ messages in thread
From: Neal Cardwell @ 2025-10-08 14:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 6:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get()
> takes care of this initial value.
>
> Add SK_WMEM_ALLOC_BIAS define to not spread this magic value.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Neal Cardwell <ncardwell@google.com>

Nice! Thanks!

neal

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 2/4] net: control skb->ooo_okay from skb_set_owner_w()
  2025-10-08 10:46 ` [PATCH RFC net-next 2/4] net: control skb->ooo_okay from skb_set_owner_w() Eric Dumazet
@ 2025-10-08 14:26   ` Neal Cardwell
  0 siblings, 0 replies; 16+ messages in thread
From: Neal Cardwell @ 2025-10-08 14:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 6:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> 15 years after Tom Herbert added skb->ooo_okay, only TCP transport
> benefits from it.
>
> We can support other transports directly from skb_set_owner_w().
>
> If no other TX packet for this socket is in a host queue (qdisc, NIC queue)
> there is no risk of self-inflicted reordering, we can set skb->ooo_okay.
>
> This allows netdev_pick_tx() to choose a TX queue based on XPS settings,
> instead of reusing the queue chosen at the time the first packet was sent
> for connected sockets.
>
> Tested:
>   500 concurrent UDP_RR connected UDP flows, host with 32 TX queues, XPS setup.
>
>   super_netperf 500 -t UDP_RR -H <host> -l 1000 -- -r 100,100 -Nn &
>
> This patch saves between 10% and 20% of cycles, depending on how
> process scheduler migrates threads among cpus.
>
> Using following bpftrace script, we can see the effect on Qdisc/NIC tx queues
> being better used (less cache line misses).
>
> bpftrace -e '
> k:__dev_queue_xmit { @start[cpu] = nsecs; }
> kr:__dev_queue_xmit {
>  if (@start[cpu]) {
>     $delay = nsecs - @start[cpu];
>     delete(@start[cpu]);
>     @__dev_queue_xmit_ns = hist($delay);
>  }
> }
> END { clear(@start); }'
>
> Before:
> @__dev_queue_xmit_ns:
> [128, 256)             6 |                                                    |
> [256, 512)        116283 |                                                    |
> [512, 1K)        1888205 |@@@@@@@@@@@                                         |
> [1K, 2K)         8106167 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
> [2K, 4K)         8699293 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K)         2600676 |@@@@@@@@@@@@@@@                                     |
> [8K, 16K)         721688 |@@@@                                                |
> [16K, 32K)        122995 |                                                    |
> [32K, 64K)         10639 |                                                    |
> [64K, 128K)          119 |                                                    |
> [128K, 256K)           1 |                                                    |
>
> After:
> @__dev_queue_xmit_ns:
> [128, 256)             3 |                                                    |
> [256, 512)        651112 |@@                                                  |
> [512, 1K)        8109938 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
> [1K, 2K)        16081031 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [2K, 4K)         2411692 |@@@@@@@                                             |
> [4K, 8K)           98994 |                                                    |
> [8K, 16K)           1536 |                                                    |
> [16K, 32K)           587 |                                                    |
> [32K, 64K)             2 |                                                    |
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Reviewed-by: Neal Cardwell <ncardwell@google.com>

Nice! Thanks!

neal

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control
  2025-10-08 10:46 ` [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control Eric Dumazet
@ 2025-10-08 14:27   ` Neal Cardwell
  2025-10-08 15:21   ` Paolo Abeni
  1 sibling, 0 replies; 16+ messages in thread
From: Neal Cardwell @ 2025-10-08 14:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 6:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Add a new sysctl to control how often a queue reselection
> can happen even if a flow has a persistent queue of skbs
> in a Qdisc or NIC queue.
>
> This sysctl is used in the following patch.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Neal Cardwell <ncardwell@google.com>

Nice! Thanks!

neal

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues
  2025-10-08 13:38   ` Willem de Bruijn
@ 2025-10-08 14:38     ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 14:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 6:38 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Eric Dumazet wrote:
> > This is a followup of commit 726e9e8b94b9 ("tcp: refine
> > skb->ooo_okay setting") and to the prior commit in this series
> > ("net: control skb->ooo_okay from skb_set_owner_w()")
> >
> > skb->ooo_okay might never be set for bulk flows that always
> > have at least one skb in a qdisc queue of NIC queue,
> > especially if TX completion is delayed because of a stressed cpu.
> >
> > The so-called "strange attractors" has caused many performance
> > issues, we need to do better.
> >
> > We have tried very hard to avoid reorders because TCP was
> > not dealing with them nicely a decade ago.
> >
> > Use the new net.core.txq_reselection_ms sysctl to let
> > flows follow XPS and select a more efficient queue.
> >
> > After this patch, we no longer have to make sure threads
> > are pinned to cpus, they now can be migrated without
> > adding too much spinlock/qdisc/TX completion pressure anymore.
> >
> > TX completion part was problematic, because it added false sharing
> > on various socket fields, but also added false sharing and spinlock
> > contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
> > is not an option unfortunately.
> >
> > Note for later: move sk->sk_tx_queue_mapping closer
> > to sk_tx_queue_mapping_jiffies for better cache locality.
> >
> > Tested:
> >
> > Used a host with 32 TX queues, shared by groups of 8 cores.
> > XPS setup :
> >
> > echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
> > echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
> > echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
> > echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
> > echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
> > echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
> > echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
> > echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
> > ...
> >
> > Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15
> >
> > taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host
> >
> > Checked that only queues 0 and 1 are used as instructed by XPS :
> > tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
> >  backlog 123489410b 1890p
> >  backlog 69809026b 1064p
> >  backlog 52401054b 805p
>
> Just in case it's not clear to all readers, this implies MQ + some
> per queue qdisc, right. Here MQ + FQ.
>
> > Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121
> >
> > C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done
> >
> > Set txq_reselection_ms to 1000
> > echo 1000 > /proc/sys/net/core/txq_reselection_ms
>
> Just curious: is the once per second heuristic based on anything. Is
> it likely to set another (more aggressive) value here?

Just an initial number, I do not have a strong opinion on the precise number.

More for discussion would be to decide if the default should be set to
<disable>,
but I suspect it will take five years for everybody to catch up.

Note that in RFC, <disable> would be a big value, we could also decide
to reserve 0 as "this feature is disabled".

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control
  2025-10-08 10:46 ` [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control Eric Dumazet
  2025-10-08 14:27   ` Neal Cardwell
@ 2025-10-08 15:21   ` Paolo Abeni
  2025-10-08 15:25     ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Paolo Abeni @ 2025-10-08 15:21 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet

On 10/8/25 12:46 PM, Eric Dumazet wrote:
> @@ -667,6 +667,13 @@ static struct ctl_table netns_core_table[] = {
>  		.extra2		= SYSCTL_ONE,
>  		.proc_handler	= proc_dou8vec_minmax,
>  	},
> +	{
> +		.procname	= "txq_reselection_ms",
> +		.data		= &init_net.core.sysctl_txq_reselection,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_ms_jiffies,

Do we need a min value to avoid syzbot or some users tripping on bad values?

/P


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control
  2025-10-08 15:21   ` Paolo Abeni
@ 2025-10-08 15:25     ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 15:25 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 8:21 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 10/8/25 12:46 PM, Eric Dumazet wrote:
> > @@ -667,6 +667,13 @@ static struct ctl_table netns_core_table[] = {
> >               .extra2         = SYSCTL_ONE,
> >               .proc_handler   = proc_dou8vec_minmax,
> >       },
> > +     {
> > +             .procname       = "txq_reselection_ms",
> > +             .data           = &init_net.core.sysctl_txq_reselection,
> > +             .maxlen         = sizeof(int),
> > +             .mode           = 0644,
> > +             .proc_handler   = proc_dointvec_ms_jiffies,
>
> Do we need a min value to avoid syzbot or some users tripping on bad values?

I was thinking about accepting all values. I do not think syzbot would
find any issue here,
even on a 32bit host.

0 could be the value to disable the feature, instead of

echo 2147483647 >/proc/sys/net/core/txq_reselection_ms

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues
  2025-10-08 10:46 ` [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues Eric Dumazet
  2025-10-08 13:38   ` Willem de Bruijn
@ 2025-10-08 15:30   ` Paolo Abeni
  2025-10-08 16:30     ` Eric Dumazet
  2025-10-13  9:18   ` Simon Horman
  2 siblings, 1 reply; 16+ messages in thread
From: Paolo Abeni @ 2025-10-08 15:30 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet

On 10/8/25 12:46 PM, Eric Dumazet wrote:
> This is a followup of commit 726e9e8b94b9 ("tcp: refine
> skb->ooo_okay setting") and to the prior commit in this series
> ("net: control skb->ooo_okay from skb_set_owner_w()")
> 
> skb->ooo_okay might never be set for bulk flows that always
> have at least one skb in a qdisc queue of NIC queue,
> especially if TX completion is delayed because of a stressed cpu.
> 
> The so-called "strange attractors" has caused many performance
> issues, we need to do better.

I must admit my ignorance about the topic, do you have any reference handy?

> @@ -1984,6 +1985,14 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
>  	return __sk_receive_skb(sk, skb, nested, 1, true);
>  }
>  
> +/* This helper checks if a socket is a full socket,
> + * ie _not_ a timewait or request socket.
> + */
> +static inline bool sk_fullsock(const struct sock *sk)
> +{
> +	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> +}
> +

I'm possibly low on coffee, but it looks like it's not needed to move
around sk_fullsock() ?!? possibly sk_tx_queue_get() remained inline in a
previous version of the patch?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues
  2025-10-08 15:30   ` Paolo Abeni
@ 2025-10-08 16:30     ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-10-08 16:30 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 8, 2025 at 8:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 10/8/25 12:46 PM, Eric Dumazet wrote:
> > This is a followup of commit 726e9e8b94b9 ("tcp: refine
> > skb->ooo_okay setting") and to the prior commit in this series
> > ("net: control skb->ooo_okay from skb_set_owner_w()")
> >
> > skb->ooo_okay might never be set for bulk flows that always
> > have at least one skb in a qdisc queue of NIC queue,
> > especially if TX completion is delayed because of a stressed cpu.
> >
> > The so-called "strange attractors" has caused many performance
> > issues, we need to do better.
>
> I must admit my ignorance about the topic, do you have any reference handy?
>
> > @@ -1984,6 +1985,14 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
> >       return __sk_receive_skb(sk, skb, nested, 1, true);
> >  }
> >
> > +/* This helper checks if a socket is a full socket,
> > + * ie _not_ a timewait or request socket.
> > + */
> > +static inline bool sk_fullsock(const struct sock *sk)
> > +{
> > +     return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> > +}
> > +
>
> I'm possibly low on coffee, but it looks like it's not needed to move
> around sk_fullsock() ?!? possibly sk_tx_queue_get() remained inline in a
> previous version of the patch?

Yes, I was using it in sk_tx_queue_set(), then realized all callers
were already doing it.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues
  2025-10-08 10:46 ` [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues Eric Dumazet
  2025-10-08 13:38   ` Willem de Bruijn
  2025-10-08 15:30   ` Paolo Abeni
@ 2025-10-13  9:18   ` Simon Horman
  2025-10-13  9:33     ` Eric Dumazet
  2 siblings, 1 reply; 16+ messages in thread
From: Simon Horman @ 2025-10-13  9:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Wed, Oct 08, 2025 at 10:46:11AM +0000, Eric Dumazet wrote:
> This is a followup of commit 726e9e8b94b9 ("tcp: refine
> skb->ooo_okay setting") and to the prior commit in this series
> ("net: control skb->ooo_okay from skb_set_owner_w()")
> 
> skb->ooo_okay might never be set for bulk flows that always
> have at least one skb in a qdisc queue of NIC queue,
> especially if TX completion is delayed because of a stressed cpu.
> 
> The so-called "strange attractors" has caused many performance
> issues, we need to do better.
> 
> We have tried very hard to avoid reorders because TCP was
> not dealing with them nicely a decade ago.
> 
> Use the new net.core.txq_reselection_ms sysctl to let
> flows follow XPS and select a more efficient queue.
> 
> After this patch, we no longer have to make sure threads
> are pinned to cpus, they now can be migrated without
> adding too much spinlock/qdisc/TX completion pressure anymore.
> 
> TX completion part was problematic, because it added false sharing
> on various socket fields, but also added false sharing and spinlock
> contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
> is not an option unfortunately.
> 
> Note for later: move sk->sk_tx_queue_mapping closer
> to sk_tx_queue_mapping_jiffies for better cache locality.
> 
> Tested:
> 
> Used a host with 32 TX queues, shared by groups of 8 cores.
> XPS setup :
> 
> echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
> echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
> echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
> echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
> echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
> echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
> echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
> echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
> ...
> 
> Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15
> 
> taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host
> 
> Checked that only queues 0 and 1 are used as instructed by XPS :
> tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
>  backlog 123489410b 1890p
>  backlog 69809026b 1064p
>  backlog 52401054b 805p
> 
> Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121
> 
> C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done
> 
> Set txq_reselection_ms to 1000
> echo 1000 > /proc/sys/net/core/txq_reselection_ms
> 
> Check that the flows have migrated nicely:
> 
> tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
>  backlog 130508314b 1916p
>  backlog 8584380b 126p
>  backlog 8584380b 126p
>  backlog 8379990b 123p
>  backlog 8584380b 126p
>  backlog 8487484b 125p
>  backlog 8584380b 126p
>  backlog 8448120b 124p
>  backlog 8584380b 126p
>  backlog 8720640b 128p
>  backlog 8856900b 130p
>  backlog 8584380b 126p
>  backlog 8652510b 127p
>  backlog 8448120b 124p
>  backlog 8516250b 125p
>  backlog 7834950b 115p
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Very nice :)

> ---
>  include/net/sock.h | 40 +++++++++++++++++++---------------------
>  net/core/dev.c     | 27 +++++++++++++++++++++++++--
>  2 files changed, 44 insertions(+), 23 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2794bc5c565424491a064049d3d76c3fb7ba1ed8..61f92bb03e00d7167cccfe70da16174f2b40f6de 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -485,6 +485,7 @@ struct sock {
>  	unsigned long		sk_pacing_rate; /* bytes per second */
>  	atomic_t		sk_zckey;
>  	atomic_t		sk_tskey;
> +	unsigned long		sk_tx_queue_mapping_jiffies;

nit: please add sk_tx_queue_mapping_jiffies to Kernel doc

>  	__cacheline_group_end(sock_write_tx);
>  
>  	__cacheline_group_begin(sock_read_tx);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues
  2025-10-13  9:18   ` Simon Horman
@ 2025-10-13  9:33     ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-10-13  9:33 UTC (permalink / raw)
  To: Simon Horman
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Mon, Oct 13, 2025 at 2:18 AM Simon Horman <horms@kernel.org> wrote:
>
 Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> Very nice :)

Yep ;)

>
> > ---
> >  include/net/sock.h | 40 +++++++++++++++++++---------------------
> >  net/core/dev.c     | 27 +++++++++++++++++++++++++--
> >  2 files changed, 44 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 2794bc5c565424491a064049d3d76c3fb7ba1ed8..61f92bb03e00d7167cccfe70da16174f2b40f6de 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -485,6 +485,7 @@ struct sock {
> >       unsigned long           sk_pacing_rate; /* bytes per second */
> >       atomic_t                sk_zckey;
> >       atomic_t                sk_tskey;
> > +     unsigned long           sk_tx_queue_mapping_jiffies;
>
> nit: please add sk_tx_queue_mapping_jiffies to Kernel doc

Will do this, thank you !

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-13  9:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 10:46 [PATCH RFC net-next 0/4] net: deal with strange attractors tx queues Eric Dumazet
2025-10-08 10:46 ` [PATCH RFC net-next 1/4] net: add SK_WMEM_ALLOC_BIAS constant Eric Dumazet
2025-10-08 14:25   ` Neal Cardwell
2025-10-08 10:46 ` [PATCH RFC net-next 2/4] net: control skb->ooo_okay from skb_set_owner_w() Eric Dumazet
2025-10-08 14:26   ` Neal Cardwell
2025-10-08 10:46 ` [PATCH RFC net-next 3/4] net: add /proc/sys/net/core/txq_reselection_ms control Eric Dumazet
2025-10-08 14:27   ` Neal Cardwell
2025-10-08 15:21   ` Paolo Abeni
2025-10-08 15:25     ` Eric Dumazet
2025-10-08 10:46 ` [PATCH RFC net-next 4/4] net: allow busy connected flows to switch tx queues Eric Dumazet
2025-10-08 13:38   ` Willem de Bruijn
2025-10-08 14:38     ` Eric Dumazet
2025-10-08 15:30   ` Paolo Abeni
2025-10-08 16:30     ` Eric Dumazet
2025-10-13  9:18   ` Simon Horman
2025-10-13  9:33     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).