[PATCH net-next 0/2] tcp: even faster connect() under stress

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] tcp: even faster connect() under stress
@ 2025-03-05  3:45 Eric Dumazet
  2025-03-05  3:45 ` [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Eric Dumazet @ 2025-03-05  3:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Neal Cardwell, Kuniyuki Iwashima, Jason Xing, Simon Horman,
	netdev, eric.dumazet, Eric Dumazet

This is a followup on the prior series, "tcp: scale connect() under pressure"

Now spinlocks are no longer in the picture, we see a very high cost
of the inet6_ehashfn() function.

In this series (of 2), I change how lport contributes to inet6_ehashfn()
to ensure better cache locality and call inet6_ehashfn()
only once per connect() system call.

This brings an additional 229 % increase of performance
for "neper/tcp_crr -6 -T 200 -F 30000" stress test,
while greatly improving latency metrics.

Before:
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80

After:
  latency_min=0.003041375
  latency_max=7.056589232
  latency_mean=0.141075048
  latency_stddev=0.526900516
  num_samples=312996
  throughput=320677.21

Eric Dumazet (2):
  inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
  inet: call inet6_ehashfn() once from inet6_hash_connect()

 include/net/inet_hashtables.h |  4 +++-
 include/net/ip.h              |  2 +-
 net/ipv4/inet_hashtables.c    | 30 ++++++++++++++++++++----------
 net/ipv6/inet6_hashtables.c   | 19 +++++++++++++------
 4 files changed, 37 insertions(+), 18 deletions(-)

-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
  2025-03-05  3:45 [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
@ 2025-03-05  3:45 ` Eric Dumazet
  2025-03-06  4:24   ` Kuniyuki Iwashima
                     ` (2 more replies)
  2025-03-05  3:45 ` [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect() Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 12+ messages in thread
From: Eric Dumazet @ 2025-03-05  3:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Neal Cardwell, Kuniyuki Iwashima, Jason Xing, Simon Horman,
	netdev, eric.dumazet, Eric Dumazet

In order to speedup __inet_hash_connect(), we want to ensure hash values
for <source address, port X, destination address, destination port>
are not randomly spread, but monotonically increasing.

Goal is to allow __inet_hash_connect() to derive the hash value
of a candidate 4-tuple with a single addition in the following
patch in the series.

Given :
  hash_0 = inet_ehashfn(saddr, 0, daddr, dport)
  hash_sport = inet_ehashfn(saddr, sport, daddr, dport)

Then (hash_sport == hash_0 + sport) for all sport values.

As far as I know, there is no security implication with this change.

After this patch, when __inet_hash_connect() has to try XXXX candidates,
the hash table buckets are contiguous and packed, allowing
a better use of cpu caches and hardware prefetchers.

Tested:

Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before this patch:

  utime_start=0.271607
  utime_end=3.847111
  stime_start=18.407684
  stime_end=1997.485557
  num_transactions=1350742
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80

perf top on client:

 56.86%  [kernel]       [k] __inet6_check_established
 17.96%  [kernel]       [k] __inet_hash_connect
 13.88%  [kernel]       [k] inet6_ehashfn
  2.52%  [kernel]       [k] rcu_all_qs
  2.01%  [kernel]       [k] __cond_resched
  0.41%  [kernel]       [k] _raw_spin_lock

After this patch:

  utime_start=0.286131
  utime_end=4.378886
  stime_start=11.952556
  stime_end=1991.655533
  num_transactions=1446830
  latency_min=0.001061085
  latency_max=12.075275028
  latency_mean=0.376375302
  latency_stddev=1.361969596
  num_samples=306383
  throughput=151866.56

perf top:

 50.01%  [kernel]       [k] __inet6_check_established
 20.65%  [kernel]       [k] __inet_hash_connect
 15.81%  [kernel]       [k] inet6_ehashfn
  2.92%  [kernel]       [k] rcu_all_qs
  2.34%  [kernel]       [k] __cond_resched
  0.50%  [kernel]       [k] _raw_spin_lock
  0.34%  [kernel]       [k] sched_balance_trigger
  0.24%  [kernel]       [k] queued_spin_lock_slowpath

There is indeed an increase of throughput and reduction of latency.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/inet_hashtables.c  | 4 ++--
 net/ipv6/inet6_hashtables.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d1b5f45ee718410fdf3e78c113c7ebd4a1ddba40..3025d2b708852acd9744709a897fca17564523d5 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -35,8 +35,8 @@ u32 inet_ehashfn(const struct net *net, const __be32 laddr,
 {
 	net_get_random_once(&inet_ehash_secret, sizeof(inet_ehash_secret));
 
-	return __inet_ehashfn(laddr, lport, faddr, fport,
-			      inet_ehash_secret + net_hash_mix(net));
+	return lport + __inet_ehashfn(laddr, 0, faddr, fport,
+				      inet_ehash_secret + net_hash_mix(net));
 }
 EXPORT_SYMBOL_GPL(inet_ehashfn);
 
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 9be315496459fcb391123a07ac887e2f59d27360..3d95f1e75a118ff8027d4ec0f33910d23b6af832 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -35,8 +35,8 @@ u32 inet6_ehashfn(const struct net *net,
 	lhash = (__force u32)laddr->s6_addr32[3];
 	fhash = __ipv6_addr_jhash(faddr, tcp_ipv6_hash_secret);
 
-	return __inet6_ehashfn(lhash, lport, fhash, fport,
-			       inet6_ehash_secret + net_hash_mix(net));
+	return lport + __inet6_ehashfn(lhash, 0, fhash, fport,
+				       inet6_ehash_secret + net_hash_mix(net));
 }
 EXPORT_SYMBOL_GPL(inet6_ehashfn);
 
-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect()
  2025-03-05  3:45 [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
  2025-03-05  3:45 ` [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() Eric Dumazet
@ 2025-03-05  3:45 ` Eric Dumazet
  2025-03-06  4:26   ` Kuniyuki Iwashima
  2025-03-06  8:22   ` Jason Xing
  2025-03-05  4:01 ` [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
  2025-03-06 23:40 ` patchwork-bot+netdevbpf
  3 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2025-03-05  3:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Neal Cardwell, Kuniyuki Iwashima, Jason Xing, Simon Horman,
	netdev, eric.dumazet, Eric Dumazet

inet6_ehashfn() being called from __inet6_check_established()
has a big impact on performance, as shown in the Tested section.

After prior patch, we can compute the hash for port 0
from inet6_hash_connect(), and derive each hash in
__inet_hash_connect() from this initial hash:

hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport

Apply the same principle for __inet_check_established(),
although inet_ehashfn() has a smaller cost.

Tested:

Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before this patch:

  utime_start=0.286131
  utime_end=4.378886
  stime_start=11.952556
  stime_end=1991.655533
  num_transactions=1446830
  latency_min=0.001061085
  latency_max=12.075275028
  latency_mean=0.376375302
  latency_stddev=1.361969596
  num_samples=306383
  throughput=151866.56

perf top:

 50.01%  [kernel]       [k] __inet6_check_established
 20.65%  [kernel]       [k] __inet_hash_connect
 15.81%  [kernel]       [k] inet6_ehashfn
  2.92%  [kernel]       [k] rcu_all_qs
  2.34%  [kernel]       [k] __cond_resched
  0.50%  [kernel]       [k] _raw_spin_lock
  0.34%  [kernel]       [k] sched_balance_trigger
  0.24%  [kernel]       [k] queued_spin_lock_slowpath

After this patch:

  utime_start=0.315047
  utime_end=9.257617
  stime_start=7.041489
  stime_end=1923.688387
  num_transactions=3057968
  latency_min=0.003041375
  latency_max=7.056589232
  latency_mean=0.141075048    # Better latency metrics
  latency_stddev=0.526900516
  num_samples=312996
  throughput=320677.21        # 111 % increase, and 229 % for the series

perf top: inet6_ehashfn is no longer seen.

 39.67%  [kernel]       [k] __inet_hash_connect
 37.06%  [kernel]       [k] __inet6_check_established
  4.79%  [kernel]       [k] rcu_all_qs
  3.82%  [kernel]       [k] __cond_resched
  1.76%  [kernel]       [k] sched_balance_domains
  0.82%  [kernel]       [k] _raw_spin_lock
  0.81%  [kernel]       [k] sched_balance_rq
  0.81%  [kernel]       [k] sched_balance_trigger
  0.76%  [kernel]       [k] queued_spin_lock_slowpath

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/inet_hashtables.h |  4 +++-
 include/net/ip.h              |  2 +-
 net/ipv4/inet_hashtables.c    | 26 ++++++++++++++++++--------
 net/ipv6/inet6_hashtables.c   | 15 +++++++++++----
 4 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index f447d61d95982090aac492b31e4199534970c4fb..949641e925398f741f2d4dda5898efc683b305dc 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -527,10 +527,12 @@ static inline void sk_rcv_saddr_set(struct sock *sk, __be32 addr)
 
 int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			struct sock *sk, u64 port_offset,
+			u32 hash_port0,
 			int (*check_established)(struct inet_timewait_death_row *,
 						 struct sock *, __u16,
 						 struct inet_timewait_sock **,
-						 bool rcu_lookup));
+						 bool rcu_lookup,
+						 u32 hash));
 
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
 		      struct sock *sk);
diff --git a/include/net/ip.h b/include/net/ip.h
index ce5e59957dd553697536ddf111bb1406d9d99408..8a48ade24620b4c8e2ebb4726f27a69aac7138b0 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -357,7 +357,7 @@ static inline void inet_get_local_port_range(const struct net *net, int *low, in
 bool inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high);
 
 #ifdef CONFIG_SYSCTL
-static inline bool inet_is_local_reserved_port(struct net *net, unsigned short port)
+static inline bool inet_is_local_reserved_port(const struct net *net, unsigned short port)
 {
 	if (!net->ipv4.sysctl_local_reserved_ports)
 		return false;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 3025d2b708852acd9744709a897fca17564523d5..1e3a9573c19834cc96d0b4cbf816f86433134450 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -538,7 +538,8 @@ EXPORT_SYMBOL_GPL(__inet_lookup_established);
 static int __inet_check_established(struct inet_timewait_death_row *death_row,
 				    struct sock *sk, __u16 lport,
 				    struct inet_timewait_sock **twp,
-				    bool rcu_lookup)
+				    bool rcu_lookup,
+				    u32 hash)
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	struct inet_sock *inet = inet_sk(sk);
@@ -549,8 +550,6 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 	int sdif = l3mdev_master_ifindex_by_index(net, dif);
 	INET_ADDR_COOKIE(acookie, saddr, daddr);
 	const __portpair ports = INET_COMBINED_PORTS(inet->inet_dport, lport);
-	unsigned int hash = inet_ehashfn(net, daddr, lport,
-					 saddr, inet->inet_dport);
 	struct inet_ehash_bucket *head = inet_ehash_bucket(hinfo, hash);
 	struct inet_timewait_sock *tw = NULL;
 	const struct hlist_nulls_node *node;
@@ -1007,9 +1006,10 @@ static u32 *table_perturb;
 
 int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		struct sock *sk, u64 port_offset,
+		u32 hash_port0,
 		int (*check_established)(struct inet_timewait_death_row *,
 			struct sock *, __u16, struct inet_timewait_sock **,
-			bool rcu_lookup))
+			bool rcu_lookup, u32 hash))
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	struct inet_bind_hashbucket *head, *head2;
@@ -1027,7 +1027,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 	if (port) {
 		local_bh_disable();
-		ret = check_established(death_row, sk, port, NULL, false);
+		ret = check_established(death_row, sk, port, NULL, false,
+					hash_port0 + port);
 		local_bh_enable();
 		return ret;
 	}
@@ -1071,7 +1072,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 				rcu_read_unlock();
 				goto next_port;
 			}
-			if (!check_established(death_row, sk, port, &tw, true))
+			if (!check_established(death_row, sk, port, &tw, true,
+					       hash_port0 + port))
 				break;
 			rcu_read_unlock();
 			goto next_port;
@@ -1090,7 +1092,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 					goto next_port_unlock;
 				WARN_ON(hlist_empty(&tb->bhash2));
 				if (!check_established(death_row, sk,
-						       port, &tw, false))
+						       port, &tw, false,
+						       hash_port0 + port))
 					goto ok;
 				goto next_port_unlock;
 			}
@@ -1197,11 +1200,18 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
 		      struct sock *sk)
 {
+	const struct inet_sock *inet = inet_sk(sk);
+	const struct net *net = sock_net(sk);
 	u64 port_offset = 0;
+	u32 hash_port0;
 
 	if (!inet_sk(sk)->inet_num)
 		port_offset = inet_sk_port_offset(sk);
-	return __inet_hash_connect(death_row, sk, port_offset,
+
+	hash_port0 = inet_ehashfn(net, inet->inet_rcv_saddr, 0,
+				  inet->inet_daddr, inet->inet_dport);
+
+	return __inet_hash_connect(death_row, sk, port_offset, hash_port0,
 				   __inet_check_established);
 }
 EXPORT_SYMBOL_GPL(inet_hash_connect);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 3d95f1e75a118ff8027d4ec0f33910d23b6af832..76ee521189eb77c48845eeeac9d50b3a93a250a6 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -264,7 +264,8 @@ EXPORT_SYMBOL_GPL(inet6_lookup);
 static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 				     struct sock *sk, const __u16 lport,
 				     struct inet_timewait_sock **twp,
-				     bool rcu_lookup)
+				     bool rcu_lookup,
+				     u32 hash)
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	struct inet_sock *inet = inet_sk(sk);
@@ -274,8 +275,6 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 	struct net *net = sock_net(sk);
 	const int sdif = l3mdev_master_ifindex_by_index(net, dif);
 	const __portpair ports = INET_COMBINED_PORTS(inet->inet_dport, lport);
-	const unsigned int hash = inet6_ehashfn(net, daddr, lport, saddr,
-						inet->inet_dport);
 	struct inet_ehash_bucket *head = inet_ehash_bucket(hinfo, hash);
 	struct inet_timewait_sock *tw = NULL;
 	const struct hlist_nulls_node *node;
@@ -354,11 +353,19 @@ static u64 inet6_sk_port_offset(const struct sock *sk)
 int inet6_hash_connect(struct inet_timewait_death_row *death_row,
 		       struct sock *sk)
 {
+	const struct in6_addr *daddr = &sk->sk_v6_rcv_saddr;
+	const struct in6_addr *saddr = &sk->sk_v6_daddr;
+	const struct inet_sock *inet = inet_sk(sk);
+	const struct net *net = sock_net(sk);
 	u64 port_offset = 0;
+	u32 hash_port0;
 
 	if (!inet_sk(sk)->inet_num)
 		port_offset = inet6_sk_port_offset(sk);
-	return __inet_hash_connect(death_row, sk, port_offset,
+
+	hash_port0 = inet6_ehashfn(net, daddr, 0, saddr, inet->inet_dport);
+
+	return __inet_hash_connect(death_row, sk, port_offset, hash_port0,
 				   __inet6_check_established);
 }
 EXPORT_SYMBOL_GPL(inet6_hash_connect);
-- 
2.48.1.711.g2feabab25a-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/2] tcp: even faster connect() under stress
  2025-03-05  3:45 [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
  2025-03-05  3:45 ` [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() Eric Dumazet
  2025-03-05  3:45 ` [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect() Eric Dumazet
@ 2025-03-05  4:01 ` Eric Dumazet
  2025-03-06 23:40 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2025-03-05  4:01 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Neal Cardwell, Kuniyuki Iwashima, Jason Xing, Simon Horman,
	netdev, eric.dumazet

On Wed, Mar 5, 2025 at 4:45 AM Eric Dumazet <edumazet@google.com> wrote:
>
> This is a followup on the prior series, "tcp: scale connect() under pressure"
>
> Now spinlocks are no longer in the picture, we see a very high cost
> of the inet6_ehashfn() function.
>
> In this series (of 2), I change how lport contributes to inet6_ehashfn()
> to ensure better cache locality and call inet6_ehashfn()
> only once per connect() system call.
>
> This brings an additional 229 % increase of performance

This is 129 % additional QPS (going from 139866.80 to 320677.21)

Sorry for the confusion :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
  2025-03-05  3:45 ` [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() Eric Dumazet
@ 2025-03-06  4:24   ` Kuniyuki Iwashima
  2025-03-06  7:54   ` Jason Xing
  2025-03-17 13:44   ` kernel test robot
  2 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-06  4:24 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, horms, kernelxing, kuba, kuniyu, ncardwell,
	netdev, pabeni

From: Eric Dumazet <edumazet@google.com>
Date: Wed,  5 Mar 2025 03:45:49 +0000
> In order to speedup __inet_hash_connect(), we want to ensure hash values
> for <source address, port X, destination address, destination port>
> are not randomly spread, but monotonically increasing.
> 
> Goal is to allow __inet_hash_connect() to derive the hash value
> of a candidate 4-tuple with a single addition in the following
> patch in the series.
> 
> Given :
>   hash_0 = inet_ehashfn(saddr, 0, daddr, dport)
>   hash_sport = inet_ehashfn(saddr, sport, daddr, dport)
> 
> Then (hash_sport == hash_0 + sport) for all sport values.
> 
> As far as I know, there is no security implication with this change.
> 
> After this patch, when __inet_hash_connect() has to try XXXX candidates,
> the hash table buckets are contiguous and packed, allowing
> a better use of cpu caches and hardware prefetchers.
> 
> Tested:
> 
> Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
> 
> Before this patch:
> 
>   utime_start=0.271607
>   utime_end=3.847111
>   stime_start=18.407684
>   stime_end=1997.485557
>   num_transactions=1350742
>   latency_min=0.014131929
>   latency_max=17.895073144
>   latency_mean=0.505675853
>   latency_stddev=2.125164772
>   num_samples=307884
>   throughput=139866.80
> 
> perf top on client:
> 
>  56.86%  [kernel]       [k] __inet6_check_established
>  17.96%  [kernel]       [k] __inet_hash_connect
>  13.88%  [kernel]       [k] inet6_ehashfn
>   2.52%  [kernel]       [k] rcu_all_qs
>   2.01%  [kernel]       [k] __cond_resched
>   0.41%  [kernel]       [k] _raw_spin_lock
> 
> After this patch:
> 
>   utime_start=0.286131
>   utime_end=4.378886
>   stime_start=11.952556
>   stime_end=1991.655533
>   num_transactions=1446830
>   latency_min=0.001061085
>   latency_max=12.075275028
>   latency_mean=0.376375302
>   latency_stddev=1.361969596
>   num_samples=306383
>   throughput=151866.56
> 
> perf top:
> 
>  50.01%  [kernel]       [k] __inet6_check_established
>  20.65%  [kernel]       [k] __inet_hash_connect
>  15.81%  [kernel]       [k] inet6_ehashfn
>   2.92%  [kernel]       [k] rcu_all_qs
>   2.34%  [kernel]       [k] __cond_resched
>   0.50%  [kernel]       [k] _raw_spin_lock
>   0.34%  [kernel]       [k] sched_balance_trigger
>   0.24%  [kernel]       [k] queued_spin_lock_slowpath
> 
> There is indeed an increase of throughput and reduction of latency.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect()
  2025-03-05  3:45 ` [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect() Eric Dumazet
@ 2025-03-06  4:26   ` Kuniyuki Iwashima
  2025-03-06  8:22   ` Jason Xing
  1 sibling, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-03-06  4:26 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, horms, kernelxing, kuba, kuniyu, ncardwell,
	netdev, pabeni

From: Eric Dumazet <edumazet@google.com>
Date: Wed,  5 Mar 2025 03:45:50 +0000
> inet6_ehashfn() being called from __inet6_check_established()
> has a big impact on performance, as shown in the Tested section.
> 
> After prior patch, we can compute the hash for port 0
> from inet6_hash_connect(), and derive each hash in
> __inet_hash_connect() from this initial hash:
> 
> hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport
> 
> Apply the same principle for __inet_check_established(),
> although inet_ehashfn() has a smaller cost.
> 
> Tested:
> 
> Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
> 
> Before this patch:
> 
>   utime_start=0.286131
>   utime_end=4.378886
>   stime_start=11.952556
>   stime_end=1991.655533
>   num_transactions=1446830
>   latency_min=0.001061085
>   latency_max=12.075275028
>   latency_mean=0.376375302
>   latency_stddev=1.361969596
>   num_samples=306383
>   throughput=151866.56
> 
> perf top:
> 
>  50.01%  [kernel]       [k] __inet6_check_established
>  20.65%  [kernel]       [k] __inet_hash_connect
>  15.81%  [kernel]       [k] inet6_ehashfn
>   2.92%  [kernel]       [k] rcu_all_qs
>   2.34%  [kernel]       [k] __cond_resched
>   0.50%  [kernel]       [k] _raw_spin_lock
>   0.34%  [kernel]       [k] sched_balance_trigger
>   0.24%  [kernel]       [k] queued_spin_lock_slowpath
> 
> After this patch:
> 
>   utime_start=0.315047
>   utime_end=9.257617
>   stime_start=7.041489
>   stime_end=1923.688387
>   num_transactions=3057968
>   latency_min=0.003041375
>   latency_max=7.056589232
>   latency_mean=0.141075048    # Better latency metrics
>   latency_stddev=0.526900516
>   num_samples=312996
>   throughput=320677.21        # 111 % increase, and 229 % for the series
> 
> perf top: inet6_ehashfn is no longer seen.
> 
>  39.67%  [kernel]       [k] __inet_hash_connect
>  37.06%  [kernel]       [k] __inet6_check_established
>   4.79%  [kernel]       [k] rcu_all_qs
>   3.82%  [kernel]       [k] __cond_resched
>   1.76%  [kernel]       [k] sched_balance_domains
>   0.82%  [kernel]       [k] _raw_spin_lock
>   0.81%  [kernel]       [k] sched_balance_rq
>   0.81%  [kernel]       [k] sched_balance_trigger
>   0.76%  [kernel]       [k] queued_spin_lock_slowpath
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Interesting optimisation, thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
  2025-03-05  3:45 ` [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() Eric Dumazet
  2025-03-06  4:24   ` Kuniyuki Iwashima
@ 2025-03-06  7:54   ` Jason Xing
  2025-03-06  8:14     ` Eric Dumazet
  2025-03-17 13:44   ` kernel test robot
  2 siblings, 1 reply; 12+ messages in thread
From: Jason Xing @ 2025-03-06  7:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet

On Wed, Mar 5, 2025 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> In order to speedup __inet_hash_connect(), we want to ensure hash values
> for <source address, port X, destination address, destination port>
> are not randomly spread, but monotonically increasing.
>
> Goal is to allow __inet_hash_connect() to derive the hash value
> of a candidate 4-tuple with a single addition in the following
> patch in the series.
>
> Given :
>   hash_0 = inet_ehashfn(saddr, 0, daddr, dport)
>   hash_sport = inet_ehashfn(saddr, sport, daddr, dport)
>
> Then (hash_sport == hash_0 + sport) for all sport values.
>
> As far as I know, there is no security implication with this change.

Good to know this. The moment I read the first paragraph, I was
thinking if it might bring potential risk.

Sorry that I hesitate to bring up one question: could this new
algorithm result in sockets concentrating into several buckets instead
of being sufficiently dispersed like before. Well good news is that I
tested other cases like TCP_CRR and saw no degradation in performance.
But they didn't cover establishing from one client to many different
servers cases.

>
> After this patch, when __inet_hash_connect() has to try XXXX candidates,
> the hash table buckets are contiguous and packed, allowing
> a better use of cpu caches and hardware prefetchers.
>
> Tested:
>
> Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
>
> Before this patch:
>
>   utime_start=0.271607
>   utime_end=3.847111
>   stime_start=18.407684
>   stime_end=1997.485557
>   num_transactions=1350742
>   latency_min=0.014131929
>   latency_max=17.895073144
>   latency_mean=0.505675853
>   latency_stddev=2.125164772
>   num_samples=307884
>   throughput=139866.80
>
> perf top on client:
>
>  56.86%  [kernel]       [k] __inet6_check_established
>  17.96%  [kernel]       [k] __inet_hash_connect
>  13.88%  [kernel]       [k] inet6_ehashfn
>   2.52%  [kernel]       [k] rcu_all_qs
>   2.01%  [kernel]       [k] __cond_resched
>   0.41%  [kernel]       [k] _raw_spin_lock
>
> After this patch:
>
>   utime_start=0.286131
>   utime_end=4.378886
>   stime_start=11.952556
>   stime_end=1991.655533
>   num_transactions=1446830
>   latency_min=0.001061085
>   latency_max=12.075275028
>   latency_mean=0.376375302
>   latency_stddev=1.361969596
>   num_samples=306383
>   throughput=151866.56
>
> perf top:
>
>  50.01%  [kernel]       [k] __inet6_check_established
>  20.65%  [kernel]       [k] __inet_hash_connect
>  15.81%  [kernel]       [k] inet6_ehashfn
>   2.92%  [kernel]       [k] rcu_all_qs
>   2.34%  [kernel]       [k] __cond_resched
>   0.50%  [kernel]       [k] _raw_spin_lock
>   0.34%  [kernel]       [k] sched_balance_trigger
>   0.24%  [kernel]       [k] queued_spin_lock_slowpath
>
> There is indeed an increase of throughput and reduction of latency.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Throughput goes from 12829 to 26072.. The percentage increase - 103% -
is alluring to me!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
  2025-03-06  7:54   ` Jason Xing
@ 2025-03-06  8:14     ` Eric Dumazet
  2025-03-06  8:19       ` Jason Xing
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2025-03-06  8:14 UTC (permalink / raw)
  To: Jason Xing
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet

On Thu, Mar 6, 2025 at 8:54 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, Mar 5, 2025 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > In order to speedup __inet_hash_connect(), we want to ensure hash values
> > for <source address, port X, destination address, destination port>
> > are not randomly spread, but monotonically increasing.
> >
> > Goal is to allow __inet_hash_connect() to derive the hash value
> > of a candidate 4-tuple with a single addition in the following
> > patch in the series.
> >
> > Given :
> >   hash_0 = inet_ehashfn(saddr, 0, daddr, dport)
> >   hash_sport = inet_ehashfn(saddr, sport, daddr, dport)
> >
> > Then (hash_sport == hash_0 + sport) for all sport values.
> >
> > As far as I know, there is no security implication with this change.
>
> Good to know this. The moment I read the first paragraph, I was
> thinking if it might bring potential risk.
>
> Sorry that I hesitate to bring up one question: could this new
> algorithm result in sockets concentrating into several buckets instead
> of being sufficiently dispersed like before.

As I said, I see no difference for servers, since their sport is a fixed value.

What matters for them is the hash contribution of the remote address and port,
because the server port is usually well known.

This change does not change the hash distribution, an attacker will not be able
to target a particular bucket.

> Well good news is that I
> tested other cases like TCP_CRR and saw no degradation in performance.
> But they didn't cover establishing from one client to many different
> servers cases.
>
> >
> > After this patch, when __inet_hash_connect() has to try XXXX candidates,
> > the hash table buckets are contiguous and packed, allowing
> > a better use of cpu caches and hardware prefetchers.
> >
> > Tested:
> >
> > Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> > Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
> >
> > Before this patch:
> >
> >   utime_start=0.271607
> >   utime_end=3.847111
> >   stime_start=18.407684
> >   stime_end=1997.485557
> >   num_transactions=1350742
> >   latency_min=0.014131929
> >   latency_max=17.895073144
> >   latency_mean=0.505675853
> >   latency_stddev=2.125164772
> >   num_samples=307884
> >   throughput=139866.80
> >
> > perf top on client:
> >
> >  56.86%  [kernel]       [k] __inet6_check_established
> >  17.96%  [kernel]       [k] __inet_hash_connect
> >  13.88%  [kernel]       [k] inet6_ehashfn
> >   2.52%  [kernel]       [k] rcu_all_qs
> >   2.01%  [kernel]       [k] __cond_resched
> >   0.41%  [kernel]       [k] _raw_spin_lock
> >
> > After this patch:
> >
> >   utime_start=0.286131
> >   utime_end=4.378886
> >   stime_start=11.952556
> >   stime_end=1991.655533
> >   num_transactions=1446830
> >   latency_min=0.001061085
> >   latency_max=12.075275028
> >   latency_mean=0.376375302
> >   latency_stddev=1.361969596
> >   num_samples=306383
> >   throughput=151866.56
> >
> > perf top:
> >
> >  50.01%  [kernel]       [k] __inet6_check_established
> >  20.65%  [kernel]       [k] __inet_hash_connect
> >  15.81%  [kernel]       [k] inet6_ehashfn
> >   2.92%  [kernel]       [k] rcu_all_qs
> >   2.34%  [kernel]       [k] __cond_resched
> >   0.50%  [kernel]       [k] _raw_spin_lock
> >   0.34%  [kernel]       [k] sched_balance_trigger
> >   0.24%  [kernel]       [k] queued_spin_lock_slowpath
> >
> > There is indeed an increase of throughput and reduction of latency.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>
> Throughput goes from 12829 to 26072.. The percentage increase - 103% -
> is alluring to me!
>
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
  2025-03-06  8:14     ` Eric Dumazet
@ 2025-03-06  8:19       ` Jason Xing
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Xing @ 2025-03-06  8:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet

On Thu, Mar 6, 2025 at 4:14 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Mar 6, 2025 at 8:54 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Wed, Mar 5, 2025 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > In order to speedup __inet_hash_connect(), we want to ensure hash values
> > > for <source address, port X, destination address, destination port>
> > > are not randomly spread, but monotonically increasing.
> > >
> > > Goal is to allow __inet_hash_connect() to derive the hash value
> > > of a candidate 4-tuple with a single addition in the following
> > > patch in the series.
> > >
> > > Given :
> > >   hash_0 = inet_ehashfn(saddr, 0, daddr, dport)
> > >   hash_sport = inet_ehashfn(saddr, sport, daddr, dport)
> > >
> > > Then (hash_sport == hash_0 + sport) for all sport values.
> > >
> > > As far as I know, there is no security implication with this change.
> >
> > Good to know this. The moment I read the first paragraph, I was
> > thinking if it might bring potential risk.
> >
> > Sorry that I hesitate to bring up one question: could this new
> > algorithm result in sockets concentrating into several buckets instead
> > of being sufficiently dispersed like before.
>
> As I said, I see no difference for servers, since their sport is a fixed value.
>
> What matters for them is the hash contribution of the remote address and port,
> because the server port is usually well known.
>
> This change does not change the hash distribution, an attacker will not be able
> to target a particular bucket.

Point taken. Thank you very much for the explanation.

Thanks,
Jason

>
> > Well good news is that I
> > tested other cases like TCP_CRR and saw no degradation in performance.
> > But they didn't cover establishing from one client to many different
> > servers cases.
> >
> > >
> > > After this patch, when __inet_hash_connect() has to try XXXX candidates,
> > > the hash table buckets are contiguous and packed, allowing
> > > a better use of cpu caches and hardware prefetchers.
> > >
> > > Tested:
> > >
> > > Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> > > Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
> > >
> > > Before this patch:
> > >
> > >   utime_start=0.271607
> > >   utime_end=3.847111
> > >   stime_start=18.407684
> > >   stime_end=1997.485557
> > >   num_transactions=1350742
> > >   latency_min=0.014131929
> > >   latency_max=17.895073144
> > >   latency_mean=0.505675853
> > >   latency_stddev=2.125164772
> > >   num_samples=307884
> > >   throughput=139866.80
> > >
> > > perf top on client:
> > >
> > >  56.86%  [kernel]       [k] __inet6_check_established
> > >  17.96%  [kernel]       [k] __inet_hash_connect
> > >  13.88%  [kernel]       [k] inet6_ehashfn
> > >   2.52%  [kernel]       [k] rcu_all_qs
> > >   2.01%  [kernel]       [k] __cond_resched
> > >   0.41%  [kernel]       [k] _raw_spin_lock
> > >
> > > After this patch:
> > >
> > >   utime_start=0.286131
> > >   utime_end=4.378886
> > >   stime_start=11.952556
> > >   stime_end=1991.655533
> > >   num_transactions=1446830
> > >   latency_min=0.001061085
> > >   latency_max=12.075275028
> > >   latency_mean=0.376375302
> > >   latency_stddev=1.361969596
> > >   num_samples=306383
> > >   throughput=151866.56
> > >
> > > perf top:
> > >
> > >  50.01%  [kernel]       [k] __inet6_check_established
> > >  20.65%  [kernel]       [k] __inet_hash_connect
> > >  15.81%  [kernel]       [k] inet6_ehashfn
> > >   2.92%  [kernel]       [k] rcu_all_qs
> > >   2.34%  [kernel]       [k] __cond_resched
> > >   0.50%  [kernel]       [k] _raw_spin_lock
> > >   0.34%  [kernel]       [k] sched_balance_trigger
> > >   0.24%  [kernel]       [k] queued_spin_lock_slowpath
> > >
> > > There is indeed an increase of throughput and reduction of latency.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > Throughput goes from 12829 to 26072.. The percentage increase - 103% -
> > is alluring to me!
> >
> > Thanks,
> > Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect()
  2025-03-05  3:45 ` [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect() Eric Dumazet
  2025-03-06  4:26   ` Kuniyuki Iwashima
@ 2025-03-06  8:22   ` Jason Xing
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Xing @ 2025-03-06  8:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Kuniyuki Iwashima, Jason Xing, Simon Horman, netdev, eric.dumazet

On Wed, Mar 5, 2025 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> inet6_ehashfn() being called from __inet6_check_established()
> has a big impact on performance, as shown in the Tested section.
>
> After prior patch, we can compute the hash for port 0
> from inet6_hash_connect(), and derive each hash in
> __inet_hash_connect() from this initial hash:
>
> hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport
>
> Apply the same principle for __inet_check_established(),
> although inet_ehashfn() has a smaller cost.
>
> Tested:
>
> Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
> Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
>
> Before this patch:
>
>   utime_start=0.286131
>   utime_end=4.378886
>   stime_start=11.952556
>   stime_end=1991.655533
>   num_transactions=1446830
>   latency_min=0.001061085
>   latency_max=12.075275028
>   latency_mean=0.376375302
>   latency_stddev=1.361969596
>   num_samples=306383
>   throughput=151866.56
>
> perf top:
>
>  50.01%  [kernel]       [k] __inet6_check_established
>  20.65%  [kernel]       [k] __inet_hash_connect
>  15.81%  [kernel]       [k] inet6_ehashfn
>   2.92%  [kernel]       [k] rcu_all_qs
>   2.34%  [kernel]       [k] __cond_resched
>   0.50%  [kernel]       [k] _raw_spin_lock
>   0.34%  [kernel]       [k] sched_balance_trigger
>   0.24%  [kernel]       [k] queued_spin_lock_slowpath
>
> After this patch:
>
>   utime_start=0.315047
>   utime_end=9.257617
>   stime_start=7.041489
>   stime_end=1923.688387
>   num_transactions=3057968
>   latency_min=0.003041375
>   latency_max=7.056589232
>   latency_mean=0.141075048    # Better latency metrics
>   latency_stddev=0.526900516
>   num_samples=312996
>   throughput=320677.21        # 111 % increase, and 229 % for the series
>
> perf top: inet6_ehashfn is no longer seen.
>
>  39.67%  [kernel]       [k] __inet_hash_connect
>  37.06%  [kernel]       [k] __inet6_check_established
>   4.79%  [kernel]       [k] rcu_all_qs
>   3.82%  [kernel]       [k] __cond_resched
>   1.76%  [kernel]       [k] sched_balance_domains
>   0.82%  [kernel]       [k] _raw_spin_lock
>   0.81%  [kernel]       [k] sched_balance_rq
>   0.81%  [kernel]       [k] sched_balance_trigger
>   0.76%  [kernel]       [k] queued_spin_lock_slowpath
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thank you!

Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/2] tcp: even faster connect() under stress
  2025-03-05  3:45 [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-03-05  4:01 ` [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
@ 2025-03-06 23:40 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-03-06 23:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, ncardwell, kuniyu, kernelxing, horms, netdev,
	eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed,  5 Mar 2025 03:45:48 +0000 you wrote:
> This is a followup on the prior series, "tcp: scale connect() under pressure"
> 
> Now spinlocks are no longer in the picture, we see a very high cost
> of the inet6_ehashfn() function.
> 
> In this series (of 2), I change how lport contributes to inet6_ehashfn()
> to ensure better cache locality and call inet6_ehashfn()
> only once per connect() system call.
> 
> [...]

Here is the summary with links:
  - [net-next,1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
    https://git.kernel.org/netdev/net-next/c/9544d60a2605
  - [net-next,2/2] inet: call inet6_ehashfn() once from inet6_hash_connect()
    https://git.kernel.org/netdev/net-next/c/d4438ce68bf1

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
  2025-03-05  3:45 ` [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() Eric Dumazet
  2025-03-06  4:24   ` Kuniyuki Iwashima
  2025-03-06  7:54   ` Jason Xing
@ 2025-03-17 13:44   ` kernel test robot
  2 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2025-03-17 13:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: oe-lkp, lkp, netdev, David S . Miller, Jakub Kicinski,
	Paolo Abeni, Neal Cardwell, Kuniyuki Iwashima, Jason Xing,
	Simon Horman, eric.dumazet, Eric Dumazet, oliver.sang



Hello,

kernel test robot noticed a 26.0% improvement of stress-ng.sockmany.ops_per_sec on:


commit: 265acc444f8a96246e9d42b54b6931d078034218 ("[PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()")
url: https://github.com/intel-lab-lkp/linux/commits/Eric-Dumazet/inet-change-lport-contribution-to-inet_ehashfn-and-inet6_ehashfn/20250305-114734
base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git f252f23ab657cd224cb8334ba69966396f3f629b
patch link: https://lore.kernel.org/all/20250305034550.879255-2-edumazet@google.com/
patch subject: [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: sockmany
	cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.sockmany.ops_per_sec 4.4% improvement                                  |
| test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory |
| test parameters  | cpufreq_governor=performance                                                                |
|                  | nr_threads=100%                                                                             |
|                  | test=sockmany                                                                               |
|                  | testtime=60s                                                                                |
+------------------+---------------------------------------------------------------------------------------------+




Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250317/202503171623.f2e16b60-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sockmany/stress-ng/60s

commit: 
  f252f23ab6 ("net: Prevent use after free in netif_napi_set_irq_locked()")
  265acc444f ("inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()")

f252f23ab657cd22 265acc444f8a96246e9d42b54b6 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.60 ±  6%      +0.2        0.75 ±  6%  mpstat.cpu.all.soft%
    376850 ±  9%     +15.7%     436068 ±  9%  numa-numastat.node0.local_node
    376612 ±  9%     +15.8%     435968 ±  9%  numa-vmstat.node0.numa_local
     54708           +22.0%      66753 ±  2%  vmstat.system.cs
      2308         +1167.7%      29267 ± 26%  perf-c2c.HITM.local
      2499         +1078.3%      29447 ± 26%  perf-c2c.HITM.total
      1413 ±  8%     -13.8%       1218 ±  4%  sched_debug.cfs_rq:/.runnable_avg.max
     28302           +21.2%      34303 ±  2%  sched_debug.cpu.nr_switches.avg
     39625 ±  6%     +63.4%      64761 ±  6%  sched_debug.cpu.nr_switches.max
      4170 ±  9%    +126.1%       9429 ±  8%  sched_debug.cpu.nr_switches.stddev
   1606932           +25.9%    2023746 ±  3%  stress-ng.sockmany.ops
     26687           +26.0%      33624 ±  3%  stress-ng.sockmany.ops_per_sec
   1561801           +28.1%    2000939 ±  3%  stress-ng.time.involuntary_context_switches
   1731525           +22.3%    2118259 ±  2%  stress-ng.time.voluntary_context_switches
     84783            +2.6%      86953        proc-vmstat.nr_shmem
      5339 ±  6%     -26.4%       3931 ± 16%  proc-vmstat.numa_hint_faults_local
    878479            +6.8%     937819        proc-vmstat.numa_hit
    812262            +7.3%     871615        proc-vmstat.numa_local
   2550690           +12.5%    2870404        proc-vmstat.pgalloc_normal
   2407108           +13.2%    2724922        proc-vmstat.pgfree
     21.96           -17.2%      18.18 ±  2%  perf-stat.i.MPKI
 7.517e+09           +18.8%  8.933e+09        perf-stat.i.branch-instructions
      2.70            -0.7        1.96        perf-stat.i.branch-miss-rate%
  2.03e+08           -13.1%  1.765e+08        perf-stat.i.branch-misses
     60.22            -2.3       57.89 ±  2%  perf-stat.i.cache-miss-rate%
 1.472e+09            +4.7%  1.542e+09        perf-stat.i.cache-references
     56669           +22.3%      69301 ±  2%  perf-stat.i.context-switches
      5.56           -18.4%       4.53 ±  2%  perf-stat.i.cpi
  4.24e+10           +19.2%  5.054e+10        perf-stat.i.instructions
      0.20           +20.1%       0.24 ±  4%  perf-stat.i.ipc
      0.49           +21.0%       0.60 ±  8%  perf-stat.i.metric.K/sec
     21.03           -15.1%      17.85        perf-stat.overall.MPKI
      2.70            -0.7        1.98        perf-stat.overall.branch-miss-rate%
     60.56            -2.1       58.49        perf-stat.overall.cache-miss-rate%
      5.34           -16.6%       4.45        perf-stat.overall.cpi
    253.77            -1.7%     249.50        perf-stat.overall.cycles-between-cache-misses
      0.19           +19.9%       0.22        perf-stat.overall.ipc
 7.395e+09           +18.9%  8.789e+09        perf-stat.ps.branch-instructions
 1.997e+08           -13.0%  1.737e+08        perf-stat.ps.branch-misses
 1.448e+09            +4.7%  1.517e+09        perf-stat.ps.cache-references
     55820           +22.2%      68204 ±  2%  perf-stat.ps.context-switches
 4.172e+10           +19.2%  4.972e+10        perf-stat.ps.instructions
 2.556e+12           +20.2%  3.072e+12 ±  2%  perf-stat.total.instructions
      0.35 ±  9%     -14.9%       0.29 ±  6%  perf-sched.sch_delay.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect
      0.06 ±  7%     -20.5%       0.04 ±  4%  perf-sched.sch_delay.avg.ms.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect
      0.16 ±218%    +798.3%       1.44 ± 40%  perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file
      0.25 ±152%    +291.3%       0.99 ± 45%  perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.security_inode_alloc.inode_init_always_gfp.alloc_inode
      0.11 ±166%    +568.2%       0.75 ± 45%  perf-sched.sch_delay.avg.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect
      0.84 ± 14%     +39.2%       1.17 ±  9%  perf-sched.sch_delay.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
      0.11 ± 22%    +108.5%       0.23 ± 12%  perf-sched.sch_delay.avg.ms.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked
      0.08 ± 59%     -60.0%       0.03 ±  4%  perf-sched.sch_delay.max.ms.__cond_resched.__release_sock.release_sock.tcp_sendmsg.__sys_sendto
      0.16 ±218%   +1286.4%       2.22 ± 25%  perf-sched.sch_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file
      0.13 ±153%    +910.1%       1.27 ± 34%  perf-sched.sch_delay.max.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect
      9.23           -12.5%       8.08        perf-sched.total_wait_and_delay.average.ms
    139892           +15.3%     161338        perf-sched.total_wait_and_delay.count.ms
      9.18           -12.5%       8.03        perf-sched.total_wait_time.average.ms
      0.70 ±  8%     -14.5%       0.60 ±  6%  perf-sched.wait_and_delay.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect
      0.11 ±  8%     -20.1%       0.09 ±  4%  perf-sched.wait_and_delay.avg.ms.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect
    429.48 ± 44%     +63.6%     702.60 ± 11%  perf-sched.wait_and_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
      4.97           -14.0%       4.28        perf-sched.wait_and_delay.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept
      0.23 ± 21%    +104.2%       0.46 ± 12%  perf-sched.wait_and_delay.avg.ms.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked
     48576 ±  5%     +36.3%      66215 ±  2%  perf-sched.wait_and_delay.count.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect
     81.83            +9.8%      89.83 ±  2%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll
     64098           +16.3%      74560        perf-sched.wait_and_delay.count.schedule_timeout.inet_csk_accept.inet_accept.do_accept
     15531 ± 17%     -46.2%       8355 ±  6%  perf-sched.wait_and_delay.count.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked
      0.36 ±  8%     -14.2%       0.31 ±  6%  perf-sched.wait_time.avg.ms.__cond_resched.__inet_hash_connect.tcp_v4_connect.__inet_stream_connect.inet_stream_connect
      0.06 ±  7%     -20.2%       0.04 ±  4%  perf-sched.wait_time.avg.ms.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect
      0.04 ±178%     -94.4%       0.00 ±130%  perf-sched.wait_time.avg.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary
      0.16 ±218%    +798.5%       1.44 ± 40%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file
      0.11 ±166%    +568.6%       0.75 ± 45%  perf-sched.wait_time.avg.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect
    427.69 ± 45%     +63.1%     697.48 ± 10%  perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
      4.95           -14.0%       4.26        perf-sched.wait_time.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept
      0.12 ± 20%     +99.9%       0.23 ± 12%  perf-sched.wait_time.avg.ms.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked
      0.16 ±218%   +1286.4%       2.22 ± 25%  perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_empty_file.alloc_file_pseudo.sock_alloc_file
      0.13 ±153%    +911.4%       1.27 ± 34%  perf-sched.wait_time.max.ms.__cond_resched.lock_sock_nested.inet_stream_connect.__sys_connect.__x64_sys_connect


***************************************************************************************************
lkp-spr-r02: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/sockmany/stress-ng/60s

commit: 
  f252f23ab6 ("net: Prevent use after free in netif_napi_set_irq_locked()")
  265acc444f ("inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()")

f252f23ab657cd22 265acc444f8a96246e9d42b54b6 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    205766            +3.2%     212279        vmstat.system.cs
    309724 ±  5%     +63.6%     506684 ±  9%  sched_debug.cfs_rq:/.avg_vruntime.stddev
    309724 ±  5%     +63.6%     506684 ±  9%  sched_debug.cfs_rq:/.min_vruntime.stddev
   1307371 ±  8%     -14.5%    1117523 ±  7%  sched_debug.cpu.avg_idle.max
   4333131            +4.4%    4525951        stress-ng.sockmany.ops
     71816            +4.4%      74988        stress-ng.sockmany.ops_per_sec
   7639150            +3.6%    7910527        stress-ng.time.voluntary_context_switches
    693603           -18.6%     564616 ±  3%  perf-c2c.DRAM.local
    611374           -16.8%     508688 ±  2%  perf-c2c.DRAM.remote
     19509          +994.2%     213470 ±  7%  perf-c2c.HITM.local
     20252          +957.6%     214187 ±  7%  perf-c2c.HITM.total
    204521            +3.1%     210765        proc-vmstat.nr_shmem
    938137            +2.9%     965493        proc-vmstat.nr_slab_reclaimable
   3102658            +3.0%    3196837        proc-vmstat.nr_slab_unreclaimable
   2113801            +1.8%    2151131        proc-vmstat.numa_hit
   1881174            +2.0%    1919223        proc-vmstat.numa_local
   6186586            +3.6%    6406837        proc-vmstat.pgalloc_normal
      0.76 ± 46%     -83.0%       0.13 ±144%  perf-sched.sch_delay.avg.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
      0.02 ±  2%      -6.3%       0.02 ±  2%  perf-sched.sch_delay.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept
     15.43           -12.6%      13.48        perf-sched.total_wait_and_delay.average.ms
    234971           +15.6%     271684        perf-sched.total_wait_and_delay.count.ms
     15.37           -12.6%      13.43        perf-sched.total_wait_time.average.ms
    140.18 ±  5%     -37.2%      88.02 ± 11%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll
     10.17           -14.1%       8.74        perf-sched.wait_and_delay.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept
      4.02          -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    104089           +16.4%     121193        perf-sched.wait_and_delay.count.__cond_resched.__release_sock.release_sock.__inet_stream_connect.inet_stream_connect
     88.17 ±  6%     +68.1%     148.17 ± 13%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll
    108724           +16.8%     127034        perf-sched.wait_and_delay.count.schedule_timeout.inet_csk_accept.inet_accept.do_accept
      1232          -100.0%       0.00        perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      4592 ± 12%     +26.1%       5792 ± 14%  perf-sched.wait_and_delay.count.schedule_timeout.wait_woken.sk_wait_data.tcp_recvmsg_locked
     11.29 ± 68%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      9.99           -13.3%       8.66        perf-sched.wait_time.avg.ms.__cond_resched.__release_sock.release_sock.tcp_sendmsg.__sys_sendto
    139.53 ±  6%     -37.2%      87.60 ± 11%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll
     10.15           -14.1%       8.72        perf-sched.wait_time.avg.ms.schedule_timeout.inet_csk_accept.inet_accept.do_accept
     41.10           -17.2%      34.03        perf-stat.i.MPKI
 1.424e+10           +14.6%  1.631e+10        perf-stat.i.branch-instructions
      2.28            -0.1        2.17        perf-stat.i.branch-miss-rate%
 3.193e+08            +9.4%  3.492e+08        perf-stat.i.branch-misses
     77.01            -9.5       67.48        perf-stat.i.cache-miss-rate%
 2.981e+09            -5.1%   2.83e+09        perf-stat.i.cache-misses
 3.806e+09            +8.4%  4.127e+09        perf-stat.i.cache-references
    217129            +3.2%     224056        perf-stat.i.context-switches
      8.68           -12.7%       7.58        perf-stat.i.cpi
    242.24            +4.0%     251.97        perf-stat.i.cycles-between-cache-misses
 7.608e+10           +14.1%  8.679e+10        perf-stat.i.instructions
      0.13           +13.3%       0.15        perf-stat.i.ipc
     39.15           -16.8%      32.58        perf-stat.overall.MPKI
      2.24            -0.1        2.14        perf-stat.overall.branch-miss-rate%
     78.30            -9.7       68.56        perf-stat.overall.cache-miss-rate%
      8.35           -12.4%       7.31        perf-stat.overall.cpi
    213.17            +5.3%     224.53        perf-stat.overall.cycles-between-cache-misses
      0.12           +14.1%       0.14        perf-stat.overall.ipc
 1.401e+10           +14.6%  1.604e+10        perf-stat.ps.branch-instructions
 3.139e+08            +9.4%  3.434e+08        perf-stat.ps.branch-misses
 2.931e+09            -5.1%  2.782e+09        perf-stat.ps.cache-misses
 3.743e+09            +8.4%  4.058e+09        perf-stat.ps.cache-references
    213541            +3.3%     220574        perf-stat.ps.context-switches
 7.485e+10           +14.1%  8.539e+10        perf-stat.ps.instructions
 4.597e+12           +13.9%  5.235e+12        perf-stat.total.instructions





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-17 13:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-05  3:45 [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
2025-03-05  3:45 ` [PATCH net-next 1/2] inet: change lport contribution to inet_ehashfn() and inet6_ehashfn() Eric Dumazet
2025-03-06  4:24   ` Kuniyuki Iwashima
2025-03-06  7:54   ` Jason Xing
2025-03-06  8:14     ` Eric Dumazet
2025-03-06  8:19       ` Jason Xing
2025-03-17 13:44   ` kernel test robot
2025-03-05  3:45 ` [PATCH net-next 2/2] inet: call inet6_ehashfn() once from inet6_hash_connect() Eric Dumazet
2025-03-06  4:26   ` Kuniyuki Iwashima
2025-03-06  8:22   ` Jason Xing
2025-03-05  4:01 ` [PATCH net-next 0/2] tcp: even faster connect() under stress Eric Dumazet
2025-03-06 23:40 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).