netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] tcp: scale connect() under pressure
@ 2025-03-02 12:42 Eric Dumazet
  2025-03-02 12:42 ` [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established() Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Eric Dumazet @ 2025-03-02 12:42 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet,
	Eric Dumazet

Adoption of bhash2 in linux-6.1 made some operations almost twice
more expensive, because of additional locks.

This series adds RCU in __inet_hash_connect() to help the
case where many attempts need to be made before finding
an available 4-tuple.

This brings a ~200 % improvement in this experiment:

Server:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog

Client:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before series:

  utime_start=0.288582
  utime_end=1.548707
  stime_start=20.637138
  stime_end=2002.489845
  num_transactions=484453
  latency_min=0.156279245
  latency_max=20.922042756
  latency_mean=1.546521274
  latency_stddev=3.936005194
  num_samples=312537
  throughput=47426.00

perf top on the client:

 49.54%  [kernel]       [k] _raw_spin_lock
 25.87%  [kernel]       [k] _raw_spin_lock_bh
  5.97%  [kernel]       [k] queued_spin_lock_slowpath
  5.67%  [kernel]       [k] __inet_hash_connect
  3.53%  [kernel]       [k] __inet6_check_established
  3.48%  [kernel]       [k] inet6_ehashfn
  0.64%  [kernel]       [k] rcu_all_qs

After this series:

  utime_start=0.271607
  utime_end=3.847111
  stime_start=18.407684
  stime_end=1997.485557
  num_transactions=1350742
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853   # Nice reduction of latency metrics
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80       # 194 % increase

perf top on client:

 56.86%  [kernel]       [k] __inet6_check_established
 17.96%  [kernel]       [k] __inet_hash_connect
 13.88%  [kernel]       [k] inet6_ehashfn
  2.52%  [kernel]       [k] rcu_all_qs
  2.01%  [kernel]       [k] __cond_resched
  0.41%  [kernel]       [k] _raw_spin_lock

Eric Dumazet (4):
  tcp: use RCU in __inet{6}_check_established()
  tcp: optimize inet_use_bhash2_on_bind()
  tcp: add RCU management to inet_bind_bucket
  tcp: use RCU lookup in __inet_hash_connect()

 include/net/inet_hashtables.h   |  7 ++--
 net/ipv4/inet_connection_sock.c |  8 ++--
 net/ipv4/inet_hashtables.c      | 65 ++++++++++++++++++++++++---------
 net/ipv4/inet_timewait_sock.c   |  2 +-
 net/ipv6/inet6_hashtables.c     | 23 ++++++++++--
 5 files changed, 75 insertions(+), 30 deletions(-)

-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established()
  2025-03-02 12:42 [PATCH net-next 0/4] tcp: scale connect() under pressure Eric Dumazet
@ 2025-03-02 12:42 ` Eric Dumazet
  2025-03-03  0:24   ` Jason Xing
  2025-03-04  0:20   ` Kuniyuki Iwashima
  2025-03-02 12:42 ` [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind() Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2025-03-02 12:42 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet,
	Eric Dumazet

When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
__inet_check_established() and/or __inet6_check_established().

This patch adds an RCU lookup to avoid the spinlock
acquisition when the 4-tuple is found in the hash table.

Note that there are still spin_lock_bh() calls in
__inet_hash_connect() to protect inet_bind_hashbucket,
this will be fixed later in this series.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
---
 net/ipv4/inet_hashtables.c  | 19 ++++++++++++++++---
 net/ipv6/inet6_hashtables.c | 19 ++++++++++++++++---
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..46d39aa2199ec3a405b50e8e85130e990d2c26b7 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -551,11 +551,24 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 	unsigned int hash = inet_ehashfn(net, daddr, lport,
 					 saddr, inet->inet_dport);
 	struct inet_ehash_bucket *head = inet_ehash_bucket(hinfo, hash);
-	spinlock_t *lock = inet_ehash_lockp(hinfo, hash);
-	struct sock *sk2;
-	const struct hlist_nulls_node *node;
 	struct inet_timewait_sock *tw = NULL;
+	const struct hlist_nulls_node *node;
+	struct sock *sk2;
+	spinlock_t *lock;
+
+	rcu_read_lock();
+	sk_nulls_for_each(sk2, node, &head->chain) {
+		if (sk2->sk_hash != hash ||
+		    !inet_match(net, sk2, acookie, ports, dif, sdif))
+			continue;
+		if (sk2->sk_state == TCP_TIME_WAIT)
+			break;
+		rcu_read_unlock();
+		return -EADDRNOTAVAIL;
+	}
+	rcu_read_unlock();
 
+	lock = inet_ehash_lockp(hinfo, hash);
 	spin_lock(lock);
 
 	sk_nulls_for_each(sk2, node, &head->chain) {
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 9ec05e354baa69d14e88da37f5a9fce11e874e35..3604a5cae5d29a25d24f9513308334ff8e64b083 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -276,11 +276,24 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 	const unsigned int hash = inet6_ehashfn(net, daddr, lport, saddr,
 						inet->inet_dport);
 	struct inet_ehash_bucket *head = inet_ehash_bucket(hinfo, hash);
-	spinlock_t *lock = inet_ehash_lockp(hinfo, hash);
-	struct sock *sk2;
-	const struct hlist_nulls_node *node;
 	struct inet_timewait_sock *tw = NULL;
+	const struct hlist_nulls_node *node;
+	struct sock *sk2;
+	spinlock_t *lock;
+
+	rcu_read_lock();
+	sk_nulls_for_each(sk2, node, &head->chain) {
+		if (sk2->sk_hash != hash ||
+		    !inet6_match(net, sk2, saddr, daddr, ports, dif, sdif))
+			continue;
+		if (sk2->sk_state == TCP_TIME_WAIT)
+			break;
+		rcu_read_unlock();
+		return -EADDRNOTAVAIL;
+	}
+	rcu_read_unlock();
 
+	lock = inet_ehash_lockp(hinfo, hash);
 	spin_lock(lock);
 
 	sk_nulls_for_each(sk2, node, &head->chain) {
-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind()
  2025-03-02 12:42 [PATCH net-next 0/4] tcp: scale connect() under pressure Eric Dumazet
  2025-03-02 12:42 ` [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established() Eric Dumazet
@ 2025-03-02 12:42 ` Eric Dumazet
  2025-03-03  0:24   ` Jason Xing
  2025-03-04  0:22   ` Kuniyuki Iwashima
  2025-03-02 12:42 ` [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2025-03-02 12:42 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet,
	Eric Dumazet

There is no reason to call ipv6_addr_type().

Instead, use highly optimized ipv6_addr_any() and ipv6_addr_v4mapped().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inet_connection_sock.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index bf9ce0c196575910b4b03fca13001979d4326297..b4e514da22b64f02cbd9f6c10698db359055e0cc 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -157,12 +157,10 @@ static bool inet_use_bhash2_on_bind(const struct sock *sk)
 {
 #if IS_ENABLED(CONFIG_IPV6)
 	if (sk->sk_family == AF_INET6) {
-		int addr_type = ipv6_addr_type(&sk->sk_v6_rcv_saddr);
-
-		if (addr_type == IPV6_ADDR_ANY)
+		if (ipv6_addr_any(&sk->sk_v6_rcv_saddr))
 			return false;
 
-		if (addr_type != IPV6_ADDR_MAPPED)
+		if (!ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr))
 			return true;
 	}
 #endif
-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket
  2025-03-02 12:42 [PATCH net-next 0/4] tcp: scale connect() under pressure Eric Dumazet
  2025-03-02 12:42 ` [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established() Eric Dumazet
  2025-03-02 12:42 ` [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind() Eric Dumazet
@ 2025-03-02 12:42 ` Eric Dumazet
  2025-03-03  0:57   ` Jason Xing
  2025-03-04  0:43   ` Kuniyuki Iwashima
  2025-03-02 12:42 ` [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect() Eric Dumazet
  2025-03-05  2:00 ` [PATCH net-next 0/4] tcp: scale connect() under pressure patchwork-bot+netdevbpf
  4 siblings, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2025-03-02 12:42 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet,
	Eric Dumazet

Add RCU protection to inet_bind_bucket structure.

- Add rcu_head field to the structure definition.

- Use kfree_rcu() at destroy time, and remove inet_bind_bucket_destroy()
  first argument.

- Use hlist_del_rcu() and hlist_add_head_rcu() methods.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/inet_hashtables.h   |  4 ++--
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/inet_hashtables.c      | 14 +++++++-------
 net/ipv4/inet_timewait_sock.c   |  2 +-
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 5eea47f135a421ce8275d4cd83c5771b3f448e5c..73c0e4087fd1a6d0d2a40ab0394165e07b08ed6d 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -89,6 +89,7 @@ struct inet_bind_bucket {
 	bool			fast_ipv6_only;
 	struct hlist_node	node;
 	struct hlist_head	bhash2;
+	struct rcu_head		rcu;
 };
 
 struct inet_bind2_bucket {
@@ -226,8 +227,7 @@ struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
 			const unsigned short snum, int l3mdev);
-void inet_bind_bucket_destroy(struct kmem_cache *cachep,
-			      struct inet_bind_bucket *tb);
+void inet_bind_bucket_destroy(struct inet_bind_bucket *tb);
 
 bool inet_bind_bucket_match(const struct inet_bind_bucket *tb,
 			    const struct net *net, unsigned short port,
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b4e514da22b64f02cbd9f6c10698db359055e0cc..e93c660340770a76446f97617ba23af32dc136fb 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -598,7 +598,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 		if (bhash2_created)
 			inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, tb2);
 		if (bhash_created)
-			inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
+			inet_bind_bucket_destroy(tb);
 	}
 	if (head2_lock_acquired)
 		spin_unlock(&head2->lock);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 46d39aa2199ec3a405b50e8e85130e990d2c26b7..b737e13f8459c53428980221355344327c4bc8dd 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -76,7 +76,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 		tb->fastreuse = 0;
 		tb->fastreuseport = 0;
 		INIT_HLIST_HEAD(&tb->bhash2);
-		hlist_add_head(&tb->node, &head->chain);
+		hlist_add_head_rcu(&tb->node, &head->chain);
 	}
 	return tb;
 }
@@ -84,11 +84,11 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 /*
  * Caller must hold hashbucket lock for this tb with local BH disabled
  */
-void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket *tb)
+void inet_bind_bucket_destroy(struct inet_bind_bucket *tb)
 {
 	if (hlist_empty(&tb->bhash2)) {
-		__hlist_del(&tb->node);
-		kmem_cache_free(cachep, tb);
+		hlist_del_rcu(&tb->node);
+		kfree_rcu(tb, rcu);
 	}
 }
 
@@ -201,7 +201,7 @@ static void __inet_put_port(struct sock *sk)
 	}
 	spin_unlock(&head2->lock);
 
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	inet_bind_bucket_destroy(tb);
 	spin_unlock(&head->lock);
 }
 
@@ -285,7 +285,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 
 error:
 	if (created_inet_bind_bucket)
-		inet_bind_bucket_destroy(table->bind_bucket_cachep, tb);
+		inet_bind_bucket_destroy(tb);
 	spin_unlock(&head2->lock);
 	spin_unlock(&head->lock);
 	return -ENOMEM;
@@ -1162,7 +1162,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 	spin_unlock(&head2->lock);
 	if (tb_created)
-		inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
+		inet_bind_bucket_destroy(tb);
 	spin_unlock(&head->lock);
 
 	if (tw)
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 337390ba85b4082701f78f1a0913ba47c1741378..aded4bf1bc16d9f1d9fd80d60f41027dd53f38eb 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -39,7 +39,7 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 	tw->tw_tb = NULL;
 	tw->tw_tb2 = NULL;
 	inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	inet_bind_bucket_destroy(tb);
 
 	__sock_put((struct sock *)tw);
 }
-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()
  2025-03-02 12:42 [PATCH net-next 0/4] tcp: scale connect() under pressure Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-03-02 12:42 ` [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket Eric Dumazet
@ 2025-03-02 12:42 ` Eric Dumazet
  2025-03-03  1:07   ` Jason Xing
                     ` (2 more replies)
  2025-03-05  2:00 ` [PATCH net-next 0/4] tcp: scale connect() under pressure patchwork-bot+netdevbpf
  4 siblings, 3 replies; 17+ messages in thread
From: Eric Dumazet @ 2025-03-02 12:42 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet,
	Eric Dumazet

When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
the many spin_lock_bh(&head->lock) performed in its loop.

This patch adds an RCU lookup to avoid the spinlock cost.

check_established() gets a new @rcu_lookup argument.
First reason is to not make any changes while head->lock
is not held.
Second reason is to not make this RCU lookup a second time
after the spinlock has been acquired.

Tested:

Server:

ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog

Client:

ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before series:

  utime_start=0.288582
  utime_end=1.548707
  stime_start=20.637138
  stime_end=2002.489845
  num_transactions=484453
  latency_min=0.156279245
  latency_max=20.922042756
  latency_mean=1.546521274
  latency_stddev=3.936005194
  num_samples=312537
  throughput=47426.00

perf top on the client:

 49.54%  [kernel]       [k] _raw_spin_lock
 25.87%  [kernel]       [k] _raw_spin_lock_bh
  5.97%  [kernel]       [k] queued_spin_lock_slowpath
  5.67%  [kernel]       [k] __inet_hash_connect
  3.53%  [kernel]       [k] __inet6_check_established
  3.48%  [kernel]       [k] inet6_ehashfn
  0.64%  [kernel]       [k] rcu_all_qs

After this series:

  utime_start=0.271607
  utime_end=3.847111
  stime_start=18.407684
  stime_end=1997.485557
  num_transactions=1350742
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853  # Nice reduction of latency metrics
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80      # 190 % increase

perf top on client:

 56.86%  [kernel]       [k] __inet6_check_established
 17.96%  [kernel]       [k] __inet_hash_connect
 13.88%  [kernel]       [k] inet6_ehashfn
  2.52%  [kernel]       [k] rcu_all_qs
  2.01%  [kernel]       [k] __cond_resched
  0.41%  [kernel]       [k] _raw_spin_lock

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/inet_hashtables.h |  3 +-
 net/ipv4/inet_hashtables.c    | 52 +++++++++++++++++++++++------------
 net/ipv6/inet6_hashtables.c   | 24 ++++++++--------
 3 files changed, 50 insertions(+), 29 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 73c0e4087fd1a6d0d2a40ab0394165e07b08ed6d..b12797f13c9a3d66fab99c877d059f9c29c30d11 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -529,7 +529,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			struct sock *sk, u64 port_offset,
 			int (*check_established)(struct inet_timewait_death_row *,
 						 struct sock *, __u16,
-						 struct inet_timewait_sock **));
+						 struct inet_timewait_sock **,
+						 bool rcu_lookup));
 
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
 		      struct sock *sk);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index b737e13f8459c53428980221355344327c4bc8dd..d1b5f45ee718410fdf3e78c113c7ebd4a1ddba40 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -537,7 +537,8 @@ EXPORT_SYMBOL_GPL(__inet_lookup_established);
 /* called with local bh disabled */
 static int __inet_check_established(struct inet_timewait_death_row *death_row,
 				    struct sock *sk, __u16 lport,
-				    struct inet_timewait_sock **twp)
+				    struct inet_timewait_sock **twp,
+				    bool rcu_lookup)
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	struct inet_sock *inet = inet_sk(sk);
@@ -556,17 +557,17 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 	struct sock *sk2;
 	spinlock_t *lock;
 
-	rcu_read_lock();
-	sk_nulls_for_each(sk2, node, &head->chain) {
-		if (sk2->sk_hash != hash ||
-		    !inet_match(net, sk2, acookie, ports, dif, sdif))
-			continue;
-		if (sk2->sk_state == TCP_TIME_WAIT)
-			break;
-		rcu_read_unlock();
-		return -EADDRNOTAVAIL;
+	if (rcu_lookup) {
+		sk_nulls_for_each(sk2, node, &head->chain) {
+			if (sk2->sk_hash != hash ||
+			    !inet_match(net, sk2, acookie, ports, dif, sdif))
+				continue;
+			if (sk2->sk_state == TCP_TIME_WAIT)
+				break;
+			return -EADDRNOTAVAIL;
+		}
+		return 0;
 	}
-	rcu_read_unlock();
 
 	lock = inet_ehash_lockp(hinfo, hash);
 	spin_lock(lock);
@@ -1007,7 +1008,8 @@ static u32 *table_perturb;
 int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		struct sock *sk, u64 port_offset,
 		int (*check_established)(struct inet_timewait_death_row *,
-			struct sock *, __u16, struct inet_timewait_sock **))
+			struct sock *, __u16, struct inet_timewait_sock **,
+			bool rcu_lookup))
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	struct inet_bind_hashbucket *head, *head2;
@@ -1025,7 +1027,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 	if (port) {
 		local_bh_disable();
-		ret = check_established(death_row, sk, port, NULL);
+		ret = check_established(death_row, sk, port, NULL, false);
 		local_bh_enable();
 		return ret;
 	}
@@ -1061,6 +1063,21 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			continue;
 		head = &hinfo->bhash[inet_bhashfn(net, port,
 						  hinfo->bhash_size)];
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(tb, &head->chain, node) {
+			if (!inet_bind_bucket_match(tb, net, port, l3mdev))
+				continue;
+			if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) {
+				rcu_read_unlock();
+				goto next_port;
+			}
+			if (!check_established(death_row, sk, port, &tw, true))
+				break;
+			rcu_read_unlock();
+			goto next_port;
+		}
+		rcu_read_unlock();
+
 		spin_lock_bh(&head->lock);
 
 		/* Does not bother with rcv_saddr checks, because
@@ -1070,12 +1087,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
 				if (tb->fastreuse >= 0 ||
 				    tb->fastreuseport >= 0)
-					goto next_port;
+					goto next_port_unlock;
 				WARN_ON(hlist_empty(&tb->bhash2));
 				if (!check_established(death_row, sk,
-						       port, &tw))
+						       port, &tw, false))
 					goto ok;
-				goto next_port;
+				goto next_port_unlock;
 			}
 		}
 
@@ -1089,8 +1106,9 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		tb->fastreuse = -1;
 		tb->fastreuseport = -1;
 		goto ok;
-next_port:
+next_port_unlock:
 		spin_unlock_bh(&head->lock);
+next_port:
 		cond_resched();
 	}
 
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 3604a5cae5d29a25d24f9513308334ff8e64b083..9be315496459fcb391123a07ac887e2f59d27360 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -263,7 +263,8 @@ EXPORT_SYMBOL_GPL(inet6_lookup);
 
 static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 				     struct sock *sk, const __u16 lport,
-				     struct inet_timewait_sock **twp)
+				     struct inet_timewait_sock **twp,
+				     bool rcu_lookup)
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	struct inet_sock *inet = inet_sk(sk);
@@ -281,17 +282,18 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 	struct sock *sk2;
 	spinlock_t *lock;
 
-	rcu_read_lock();
-	sk_nulls_for_each(sk2, node, &head->chain) {
-		if (sk2->sk_hash != hash ||
-		    !inet6_match(net, sk2, saddr, daddr, ports, dif, sdif))
-			continue;
-		if (sk2->sk_state == TCP_TIME_WAIT)
-			break;
-		rcu_read_unlock();
-		return -EADDRNOTAVAIL;
+	if (rcu_lookup) {
+		sk_nulls_for_each(sk2, node, &head->chain) {
+			if (sk2->sk_hash != hash ||
+			    !inet6_match(net, sk2, saddr, daddr,
+					 ports, dif, sdif))
+				continue;
+			if (sk2->sk_state == TCP_TIME_WAIT)
+				break;
+			return -EADDRNOTAVAIL;
+		}
+		return 0;
 	}
-	rcu_read_unlock();
 
 	lock = inet_ehash_lockp(hinfo, hash);
 	spin_lock(lock);
-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established()
  2025-03-02 12:42 ` [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established() Eric Dumazet
@ 2025-03-03  0:24   ` Jason Xing
  2025-03-04  0:20   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-03-03  0:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Simon Horman, netdev, eric.dumazet

On Sun, Mar 2, 2025 at 8:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> When __inet_hash_connect() has to try many 4-tuples before
> finding an available one, we see a high spinlock cost from
> __inet_check_established() and/or __inet6_check_established().
>
> This patch adds an RCU lookup to avoid the spinlock
> acquisition when the 4-tuple is found in the hash table.
>
> Note that there are still spin_lock_bh() calls in
> __inet_hash_connect() to protect inet_bind_hashbucket,
> this will be fixed later in this series.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Yesterday, I did a few tests on this single patch and managed to see a
~7% increase in performance on my virtual machine[1] :) Thanks!

Tested-by: Jason Xing <kerneljasonxing@gmail.com>

[1]: https://lore.kernel.org/all/CAL+tcoBAVmTk_JBX=OEBqZZuoSzZd8bjuw9rgwRLMd9fvZOSkA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind()
  2025-03-02 12:42 ` [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind() Eric Dumazet
@ 2025-03-03  0:24   ` Jason Xing
  2025-03-04  0:22   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-03-03  0:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Simon Horman, netdev, eric.dumazet

On Sun, Mar 2, 2025 at 8:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> There is no reason to call ipv6_addr_type().
>
> Instead, use highly optimized ipv6_addr_any() and ipv6_addr_v4mapped().
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket
  2025-03-02 12:42 ` [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket Eric Dumazet
@ 2025-03-03  0:57   ` Jason Xing
  2025-03-04  0:43   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-03-03  0:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Simon Horman, netdev, eric.dumazet

On Sun, Mar 2, 2025 at 8:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Add RCU protection to inet_bind_bucket structure.
>
> - Add rcu_head field to the structure definition.
>
> - Use kfree_rcu() at destroy time, and remove inet_bind_bucket_destroy()
>   first argument.
>
> - Use hlist_del_rcu() and hlist_add_head_rcu() methods.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()
  2025-03-02 12:42 ` [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect() Eric Dumazet
@ 2025-03-03  1:07   ` Jason Xing
  2025-03-03 10:25     ` Eric Dumazet
  2025-03-04  0:51   ` Kuniyuki Iwashima
  2025-03-10 14:03   ` kernel test robot
  2 siblings, 1 reply; 17+ messages in thread
From: Jason Xing @ 2025-03-03  1:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Simon Horman, netdev, eric.dumazet

On Sun, Mar 2, 2025 at 8:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> When __inet_hash_connect() has to try many 4-tuples before
> finding an available one, we see a high spinlock cost from
> the many spin_lock_bh(&head->lock) performed in its loop.
>
> This patch adds an RCU lookup to avoid the spinlock cost.
>
> check_established() gets a new @rcu_lookup argument.
> First reason is to not make any changes while head->lock
> is not held.
> Second reason is to not make this RCU lookup a second time
> after the spinlock has been acquired.
>
> Tested:
>
> Server:
>
> ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
>
> Client:
>
> ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
>
> Before series:
>
>   utime_start=0.288582
>   utime_end=1.548707
>   stime_start=20.637138
>   stime_end=2002.489845
>   num_transactions=484453
>   latency_min=0.156279245
>   latency_max=20.922042756
>   latency_mean=1.546521274
>   latency_stddev=3.936005194
>   num_samples=312537
>   throughput=47426.00
>
> perf top on the client:
>
>  49.54%  [kernel]       [k] _raw_spin_lock
>  25.87%  [kernel]       [k] _raw_spin_lock_bh
>   5.97%  [kernel]       [k] queued_spin_lock_slowpath
>   5.67%  [kernel]       [k] __inet_hash_connect
>   3.53%  [kernel]       [k] __inet6_check_established
>   3.48%  [kernel]       [k] inet6_ehashfn
>   0.64%  [kernel]       [k] rcu_all_qs
>
> After this series:
>
>   utime_start=0.271607
>   utime_end=3.847111
>   stime_start=18.407684
>   stime_end=1997.485557
>   num_transactions=1350742
>   latency_min=0.014131929
>   latency_max=17.895073144
>   latency_mean=0.505675853  # Nice reduction of latency metrics
>   latency_stddev=2.125164772
>   num_samples=307884
>   throughput=139866.80      # 190 % increase
>
> perf top on client:
>
>  56.86%  [kernel]       [k] __inet6_check_established
>  17.96%  [kernel]       [k] __inet_hash_connect
>  13.88%  [kernel]       [k] inet6_ehashfn
>   2.52%  [kernel]       [k] rcu_all_qs
>   2.01%  [kernel]       [k] __cond_resched
>   0.41%  [kernel]       [k] _raw_spin_lock
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>

I tested only on my virtual machine (with 64 cpus) and got an around
100% performance increase which is really good. And I also noticed
that the spin lock hotspot has gone :)

Thanks for working on this!!!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()
  2025-03-03  1:07   ` Jason Xing
@ 2025-03-03 10:25     ` Eric Dumazet
  2025-03-03 10:39       ` Jason Xing
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2025-03-03 10:25 UTC (permalink / raw)
  To: Jason Xing
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Simon Horman, netdev, eric.dumazet

On Mon, Mar 3, 2025 at 2:08 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sun, Mar 2, 2025 at 8:42 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > When __inet_hash_connect() has to try many 4-tuples before
> > finding an available one, we see a high spinlock cost from
> > the many spin_lock_bh(&head->lock) performed in its loop.
> >
> > This patch adds an RCU lookup to avoid the spinlock cost.
> >
> > check_established() gets a new @rcu_lookup argument.
> > First reason is to not make any changes while head->lock
> > is not held.
> > Second reason is to not make this RCU lookup a second time
> > after the spinlock has been acquired.
> >
> > Tested:
> >
> > Server:
> >
> > ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> >
> > Client:
> >
> > ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
> >
> > Before series:
> >
> >   utime_start=0.288582
> >   utime_end=1.548707
> >   stime_start=20.637138
> >   stime_end=2002.489845
> >   num_transactions=484453
> >   latency_min=0.156279245
> >   latency_max=20.922042756
> >   latency_mean=1.546521274
> >   latency_stddev=3.936005194
> >   num_samples=312537
> >   throughput=47426.00
> >
> > perf top on the client:
> >
> >  49.54%  [kernel]       [k] _raw_spin_lock
> >  25.87%  [kernel]       [k] _raw_spin_lock_bh
> >   5.97%  [kernel]       [k] queued_spin_lock_slowpath
> >   5.67%  [kernel]       [k] __inet_hash_connect
> >   3.53%  [kernel]       [k] __inet6_check_established
> >   3.48%  [kernel]       [k] inet6_ehashfn
> >   0.64%  [kernel]       [k] rcu_all_qs
> >
> > After this series:
> >
> >   utime_start=0.271607
> >   utime_end=3.847111
> >   stime_start=18.407684
> >   stime_end=1997.485557
> >   num_transactions=1350742
> >   latency_min=0.014131929
> >   latency_max=17.895073144
> >   latency_mean=0.505675853  # Nice reduction of latency metrics
> >   latency_stddev=2.125164772
> >   num_samples=307884
> >   throughput=139866.80      # 190 % increase
> >
> > perf top on client:
> >
> >  56.86%  [kernel]       [k] __inet6_check_established
> >  17.96%  [kernel]       [k] __inet_hash_connect
> >  13.88%  [kernel]       [k] inet6_ehashfn
> >   2.52%  [kernel]       [k] rcu_all_qs
> >   2.01%  [kernel]       [k] __cond_resched
> >   0.41%  [kernel]       [k] _raw_spin_lock
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
>
> I tested only on my virtual machine (with 64 cpus) and got an around
> 100% performance increase which is really good. And I also noticed
> that the spin lock hotspot has gone :)
>
> Thanks for working on this!!!

Hold your breath, I have two additional patches bringing the perf to :

local_throughput=353891          #   646 % improvement

I will wait for this first series to be merged before sending these.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()
  2025-03-03 10:25     ` Eric Dumazet
@ 2025-03-03 10:39       ` Jason Xing
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Xing @ 2025-03-03 10:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Simon Horman, netdev, eric.dumazet

On Mon, Mar 3, 2025 at 6:25 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Mar 3, 2025 at 2:08 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Sun, Mar 2, 2025 at 8:42 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > When __inet_hash_connect() has to try many 4-tuples before
> > > finding an available one, we see a high spinlock cost from
> > > the many spin_lock_bh(&head->lock) performed in its loop.
> > >
> > > This patch adds an RCU lookup to avoid the spinlock cost.
> > >
> > > check_established() gets a new @rcu_lookup argument.
> > > First reason is to not make any changes while head->lock
> > > is not held.
> > > Second reason is to not make this RCU lookup a second time
> > > after the spinlock has been acquired.
> > >
> > > Tested:
> > >
> > > Server:
> > >
> > > ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> > >
> > > Client:
> > >
> > > ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
> > >
> > > Before series:
> > >
> > >   utime_start=0.288582
> > >   utime_end=1.548707
> > >   stime_start=20.637138
> > >   stime_end=2002.489845
> > >   num_transactions=484453
> > >   latency_min=0.156279245
> > >   latency_max=20.922042756
> > >   latency_mean=1.546521274
> > >   latency_stddev=3.936005194
> > >   num_samples=312537
> > >   throughput=47426.00
> > >
> > > perf top on the client:
> > >
> > >  49.54%  [kernel]       [k] _raw_spin_lock
> > >  25.87%  [kernel]       [k] _raw_spin_lock_bh
> > >   5.97%  [kernel]       [k] queued_spin_lock_slowpath
> > >   5.67%  [kernel]       [k] __inet_hash_connect
> > >   3.53%  [kernel]       [k] __inet6_check_established
> > >   3.48%  [kernel]       [k] inet6_ehashfn
> > >   0.64%  [kernel]       [k] rcu_all_qs
> > >
> > > After this series:
> > >
> > >   utime_start=0.271607
> > >   utime_end=3.847111
> > >   stime_start=18.407684
> > >   stime_end=1997.485557
> > >   num_transactions=1350742
> > >   latency_min=0.014131929
> > >   latency_max=17.895073144
> > >   latency_mean=0.505675853  # Nice reduction of latency metrics
> > >   latency_stddev=2.125164772
> > >   num_samples=307884
> > >   throughput=139866.80      # 190 % increase
> > >
> > > perf top on client:
> > >
> > >  56.86%  [kernel]       [k] __inet6_check_established
> > >  17.96%  [kernel]       [k] __inet_hash_connect
> > >  13.88%  [kernel]       [k] inet6_ehashfn
> > >   2.52%  [kernel]       [k] rcu_all_qs
> > >   2.01%  [kernel]       [k] __cond_resched
> > >   0.41%  [kernel]       [k] _raw_spin_lock
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > I tested only on my virtual machine (with 64 cpus) and got an around
> > 100% performance increase which is really good. And I also noticed
> > that the spin lock hotspot has gone :)
> >
> > Thanks for working on this!!!
>
> Hold your breath, I have two additional patches bringing the perf to :
>
> local_throughput=353891          #   646 % improvement
>
> I will wait for this first series to be merged before sending these.

OMG, I'm really shocked... It would be super cool :D

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established()
  2025-03-02 12:42 ` [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established() Eric Dumazet
  2025-03-03  0:24   ` Jason Xing
@ 2025-03-04  0:20   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 17+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-04  0:20 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, horms, kerneljasonxing, kuba, kuniyu,
	ncardwell, netdev, pabeni

From: Eric Dumazet <edumazet@google.com>
Date: Sun,  2 Mar 2025 12:42:34 +0000
> When __inet_hash_connect() has to try many 4-tuples before
> finding an available one, we see a high spinlock cost from
> __inet_check_established() and/or __inet6_check_established().
> 
> This patch adds an RCU lookup to avoid the spinlock
> acquisition when the 4-tuple is found in the hash table.
> 
> Note that there are still spin_lock_bh() calls in
> __inet_hash_connect() to protect inet_bind_hashbucket,
> this will be fixed later in this series.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind()
  2025-03-02 12:42 ` [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind() Eric Dumazet
  2025-03-03  0:24   ` Jason Xing
@ 2025-03-04  0:22   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 17+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-04  0:22 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, horms, kerneljasonxing, kuba, kuniyu,
	ncardwell, netdev, pabeni

From: Eric Dumazet <edumazet@google.com>
Date: Sun,  2 Mar 2025 12:42:35 +0000
> There is no reason to call ipv6_addr_type().
> 
> Instead, use highly optimized ipv6_addr_any() and ipv6_addr_v4mapped().
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket
  2025-03-02 12:42 ` [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket Eric Dumazet
  2025-03-03  0:57   ` Jason Xing
@ 2025-03-04  0:43   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 17+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-04  0:43 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, horms, kerneljasonxing, kuba, kuniyu,
	ncardwell, netdev, pabeni

From: Eric Dumazet <edumazet@google.com>
Date: Sun,  2 Mar 2025 12:42:36 +0000
> Add RCU protection to inet_bind_bucket structure.
> 
> - Add rcu_head field to the structure definition.
> 
> - Use kfree_rcu() at destroy time, and remove inet_bind_bucket_destroy()
>   first argument.
> 
> - Use hlist_del_rcu() and hlist_add_head_rcu() methods.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()
  2025-03-02 12:42 ` [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect() Eric Dumazet
  2025-03-03  1:07   ` Jason Xing
@ 2025-03-04  0:51   ` Kuniyuki Iwashima
  2025-03-10 14:03   ` kernel test robot
  2 siblings, 0 replies; 17+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-04  0:51 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, horms, kerneljasonxing, kuba, kuniyu,
	ncardwell, netdev, pabeni

From: Eric Dumazet <edumazet@google.com>
Date: Sun,  2 Mar 2025 12:42:37 +0000
> When __inet_hash_connect() has to try many 4-tuples before
> finding an available one, we see a high spinlock cost from
> the many spin_lock_bh(&head->lock) performed in its loop.
> 
> This patch adds an RCU lookup to avoid the spinlock cost.
> 
> check_established() gets a new @rcu_lookup argument.
> First reason is to not make any changes while head->lock
> is not held.
> Second reason is to not make this RCU lookup a second time
> after the spinlock has been acquired.
> 
> Tested:
> 
> Server:
> 
> ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> 
> Client:
> 
> ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
> 
> Before series:
> 
>   utime_start=0.288582
>   utime_end=1.548707
>   stime_start=20.637138
>   stime_end=2002.489845
>   num_transactions=484453
>   latency_min=0.156279245
>   latency_max=20.922042756
>   latency_mean=1.546521274
>   latency_stddev=3.936005194
>   num_samples=312537
>   throughput=47426.00
> 
> perf top on the client:
> 
>  49.54%  [kernel]       [k] _raw_spin_lock
>  25.87%  [kernel]       [k] _raw_spin_lock_bh
>   5.97%  [kernel]       [k] queued_spin_lock_slowpath
>   5.67%  [kernel]       [k] __inet_hash_connect
>   3.53%  [kernel]       [k] __inet6_check_established
>   3.48%  [kernel]       [k] inet6_ehashfn
>   0.64%  [kernel]       [k] rcu_all_qs
> 
> After this series:
> 
>   utime_start=0.271607
>   utime_end=3.847111
>   stime_start=18.407684
>   stime_end=1997.485557
>   num_transactions=1350742
>   latency_min=0.014131929
>   latency_max=17.895073144
>   latency_mean=0.505675853  # Nice reduction of latency metrics
>   latency_stddev=2.125164772
>   num_samples=307884
>   throughput=139866.80      # 190 % increase
> 
> perf top on client:
> 
>  56.86%  [kernel]       [k] __inet6_check_established
>  17.96%  [kernel]       [k] __inet_hash_connect
>  13.88%  [kernel]       [k] inet6_ehashfn
>   2.52%  [kernel]       [k] rcu_all_qs
>   2.01%  [kernel]       [k] __cond_resched
>   0.41%  [kernel]       [k] _raw_spin_lock
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks for the great optimisation!

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 0/4] tcp: scale connect() under pressure
  2025-03-02 12:42 [PATCH net-next 0/4] tcp: scale connect() under pressure Eric Dumazet
                   ` (3 preceding siblings ...)
  2025-03-02 12:42 ` [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect() Eric Dumazet
@ 2025-03-05  2:00 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 17+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-03-05  2:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, ncardwell, kuniyu, kerneljasonxing, horms,
	netdev, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sun,  2 Mar 2025 12:42:33 +0000 you wrote:
> Adoption of bhash2 in linux-6.1 made some operations almost twice
> more expensive, because of additional locks.
> 
> This series adds RCU in __inet_hash_connect() to help the
> case where many attempts need to be made before finding
> an available 4-tuple.
> 
> [...]

Here is the summary with links:
  - [net-next,1/4] tcp: use RCU in __inet{6}_check_established()
    https://git.kernel.org/netdev/net-next/c/ae9d5b19b322
  - [net-next,2/4] tcp: optimize inet_use_bhash2_on_bind()
    https://git.kernel.org/netdev/net-next/c/ca79d80b0b9f
  - [net-next,3/4] tcp: add RCU management to inet_bind_bucket
    https://git.kernel.org/netdev/net-next/c/d186f405fdf4
  - [net-next,4/4] tcp: use RCU lookup in __inet_hash_connect()
    https://git.kernel.org/netdev/net-next/c/86c2bc293b81

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()
  2025-03-02 12:42 ` [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect() Eric Dumazet
  2025-03-03  1:07   ` Jason Xing
  2025-03-04  0:51   ` Kuniyuki Iwashima
@ 2025-03-10 14:03   ` kernel test robot
  2 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2025-03-10 14:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: oe-lkp, lkp, netdev, David S . Miller, Jakub Kicinski,
	Paolo Abeni, Neal Cardwell, Kuniyuki Iwashima, Jason Xing,
	Simon Horman, eric.dumazet, Eric Dumazet, oliver.sang



Hello,

kernel test robot noticed a 6.9% improvement of stress-ng.sockmany.ops_per_sec on:


commit: ba6c94b99d772f431fd589dd2cd606b59063557b ("[PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()")
url: https://github.com/intel-lab-lkp/linux/commits/Eric-Dumazet/tcp-use-RCU-in-__inet-6-_check_established/20250302-204711
base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git f77f12010f67259bd0e1ad18877ed27c721b627a
patch link: https://lore.kernel.org/all/20250302124237.3913746-5-edumazet@google.com/
patch subject: [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect()

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: sockmany
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250310/202503102159.5f78c207-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/sockmany/stress-ng/60s

commit: 
  4f97f75a5b ("tcp: add RCU management to inet_bind_bucket")
  ba6c94b99d ("tcp: use RCU lookup in __inet_hash_connect()")

4f97f75a5bfa79ba ba6c94b99d772f431fd589dd2cd 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1742139 ± 89%     -91.6%     146373 ± 56%  numa-meminfo.node1.Unevictable
      0.61 ±  3%      +0.1        0.71 ±  3%  mpstat.cpu.all.irq%
      0.42            +0.0        0.46 ±  2%  mpstat.cpu.all.usr%
    435534 ± 89%     -91.6%      36593 ± 56%  numa-vmstat.node1.nr_unevictable
    435534 ± 89%     -91.6%      36593 ± 56%  numa-vmstat.node1.nr_zone_unevictable
   4057584            +7.0%    4340521        stress-ng.sockmany.ops
     67264            +6.9%      71933        stress-ng.sockmany.ops_per_sec
    604900           +12.3%     679404 ±  4%  perf-c2c.DRAM.local
     42998 ±  2%     -55.7%      19034 ±  3%  perf-c2c.HITM.local
     13764 ±  4%     -95.2%     663.67 ± 13%  perf-c2c.HITM.remote
     56762 ±  2%     -65.3%      19698 ±  4%  perf-c2c.HITM.total
   7422009           +13.2%    8403980 ±  2%  sched_debug.cfs_rq:/.avg_vruntime.max
    195564 ±  5%     +62.7%     318178 ± 10%  sched_debug.cfs_rq:/.avg_vruntime.stddev
      0.23 ±  7%     +25.4%       0.29 ±  4%  sched_debug.cfs_rq:/.h_nr_queued.stddev
     39935 ±  4%     +27.0%      50726 ± 29%  sched_debug.cfs_rq:/.load_avg.max
   7422009           +13.2%    8403980 ±  2%  sched_debug.cfs_rq:/.min_vruntime.max
    195564 ±  5%     +62.7%     318178 ± 10%  sched_debug.cfs_rq:/.min_vruntime.stddev
      0.23 ±  6%     +26.6%       0.29 ±  4%  sched_debug.cpu.nr_running.stddev
    387640            +5.9%     410501 ±  9%  proc-vmstat.nr_active_anon
    109911 ±  2%      +8.5%     119206 ±  2%  proc-vmstat.nr_mapped
    200627            +1.9%     204454        proc-vmstat.nr_shmem
    895041            +4.9%     939289        proc-vmstat.nr_slab_reclaimable
   2982921            +5.0%    3131084        proc-vmstat.nr_slab_unreclaimable
    387640            +5.9%     410501 ±  9%  proc-vmstat.nr_zone_active_anon
   2071760            +2.0%    2112591        proc-vmstat.numa_hit
   1839824            +2.2%    1880606        proc-vmstat.numa_local
   5905025            +5.2%    6210697        proc-vmstat.pgalloc_normal
   5291411 ± 12%     +11.9%    5921072        proc-vmstat.pgfree
      0.82 ± 13%     -29.0%       0.58 ±  6%  perf-sched.sch_delay.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect
      4.50 ± 16%     +29.5%       5.83 ± 15%  perf-sched.sch_delay.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.vfs_write.ksys_write
      0.03 ± 56%     -88.8%       0.00 ±223%  perf-sched.sch_delay.max.ms.__cond_resched.stop_one_cpu.migrate_task_to.task_numa_migrate.isra
      0.07 ±125%   +3754.0%       2.67 ± 71%  perf-sched.sch_delay.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
     19.83           -22.3%      15.41        perf-sched.total_wait_and_delay.average.ms
    177991           +32.7%     236147        perf-sched.total_wait_and_delay.count.ms
     19.76           -22.3%      15.35        perf-sched.total_wait_time.average.ms
      1.64 ± 12%     -28.9%       1.17 ±  6%  perf-sched.wait_and_delay.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect
     13.69           -26.2%      10.10        perf-sched.wait_and_delay.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept
      6844           +11.8%       7651 ±  3%  perf-sched.wait_and_delay.count.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect
     78701           +33.6%     105168        perf-sched.wait_and_delay.count.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect
     81026           +35.2%     109539        perf-sched.wait_and_delay.count.schedule_timeout.inet_csk_accept.inet_accept.do_accept
      2268 ± 14%     +90.6%       4325 ±  6%  perf-sched.wait_and_delay.count.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked
      0.82 ± 12%     -28.6%       0.59 ±  6%  perf-sched.wait_time.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect
     13.49           -26.5%       9.91        perf-sched.wait_time.avg.ms.__cond_resched.__release_sock.release_sock.tcp_sendmsg.__sys_sendto
      3.05 ±  3%     +16.5%       3.55 ±  3%  perf-sched.wait_time.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
     30.10 ± 20%     -64.4%      10.72 ±113%  perf-sched.wait_time.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
      1.14 ±  9%     +22.2%       1.40 ±  7%  perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
     13.67           -26.3%      10.08        perf-sched.wait_time.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept
      7.36 ± 57%    +103.9%      15.01 ± 27%  perf-sched.wait_time.avg.ms.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      0.03 ± 56%     -88.8%       0.00 ±223%  perf-sched.wait_time.max.ms.__cond_resched.stop_one_cpu.migrate_task_to.task_numa_migrate.isra
      0.07 ±125%    +4e+05%     275.31 ±115%  perf-sched.wait_time.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
     35.70           +15.3%      41.18        perf-stat.i.MPKI
 1.368e+10            +4.6%  1.431e+10        perf-stat.i.branch-instructions
      2.15            +0.1        2.27        perf-stat.i.branch-miss-rate%
 2.884e+08           +10.7%  3.192e+08        perf-stat.i.branch-misses
     71.62            +5.5       77.09        perf-stat.i.cache-miss-rate%
 2.377e+09           +26.3%  3.003e+09        perf-stat.i.cache-misses
 3.264e+09           +17.4%  3.832e+09        perf-stat.i.cache-references
      9.40            -8.1%       8.64        perf-stat.i.cpi
    292.27           -18.0%     239.70        perf-stat.i.cycles-between-cache-misses
 6.963e+10            +9.8%  7.645e+10        perf-stat.i.instructions
      0.12 ±  2%      +7.3%       0.13        perf-stat.i.ipc
     34.12           +15.0%      39.25        perf-stat.overall.MPKI
      2.11            +0.1        2.23        perf-stat.overall.branch-miss-rate%
     72.81            +5.5       78.36        perf-stat.overall.cache-miss-rate%
      9.07            -8.4%       8.31        perf-stat.overall.cpi
    265.92           -20.4%     211.72        perf-stat.overall.cycles-between-cache-misses
      0.11            +9.2%       0.12        perf-stat.overall.ipc
 1.345e+10            +4.6%  1.408e+10        perf-stat.ps.branch-instructions
 2.835e+08           +10.7%  3.139e+08        perf-stat.ps.branch-misses
 2.337e+09           +26.3%  2.952e+09        perf-stat.ps.cache-misses
 3.209e+09           +17.4%  3.768e+09        perf-stat.ps.cache-references
 6.849e+10            +9.8%  7.521e+10        perf-stat.ps.instructions
 4.236e+12            +9.1%  4.621e+12        perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-03-10 14:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-02 12:42 [PATCH net-next 0/4] tcp: scale connect() under pressure Eric Dumazet
2025-03-02 12:42 ` [PATCH net-next 1/4] tcp: use RCU in __inet{6}_check_established() Eric Dumazet
2025-03-03  0:24   ` Jason Xing
2025-03-04  0:20   ` Kuniyuki Iwashima
2025-03-02 12:42 ` [PATCH net-next 2/4] tcp: optimize inet_use_bhash2_on_bind() Eric Dumazet
2025-03-03  0:24   ` Jason Xing
2025-03-04  0:22   ` Kuniyuki Iwashima
2025-03-02 12:42 ` [PATCH net-next 3/4] tcp: add RCU management to inet_bind_bucket Eric Dumazet
2025-03-03  0:57   ` Jason Xing
2025-03-04  0:43   ` Kuniyuki Iwashima
2025-03-02 12:42 ` [PATCH net-next 4/4] tcp: use RCU lookup in __inet_hash_connect() Eric Dumazet
2025-03-03  1:07   ` Jason Xing
2025-03-03 10:25     ` Eric Dumazet
2025-03-03 10:39       ` Jason Xing
2025-03-04  0:51   ` Kuniyuki Iwashima
2025-03-10 14:03   ` kernel test robot
2025-03-05  2:00 ` [PATCH net-next 0/4] tcp: scale connect() under pressure patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).