netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 0/5] tcp: Introduce optional per-netns ehash.
@ 2022-08-30 19:15 Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 1/5] tcp: Clean up some functions Kuniyuki Iwashima
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-30 19:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The more sockets we have in the hash table, the longer we spend looking
up the socket.  While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.

The root cause might be a single workload that consumes much more
resources than the others.  It often happens on a cloud service where
different workloads share the same computing resource.

On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.

 thash_entries		sockets		length		Gbps
	524288		      1		     1		50.7
			   24Mi		    48		45.1

It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.

 thash_entries		sockets		length		Gbps
        131072		      1		     1		50.7
			    1Mi		     8		49.9
			    2Mi		    16		48.9
			    4Mi		    32		47.3
			    8Mi		    64		44.6
			   16Mi		   128		40.6
			   24Mi		   192		36.3
			   32Mi		   256		32.5
			   40Mi		   320		27.0
			   48Mi		   384		25.0

To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.

With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours.  Also, we can reduce lock contention.

For details, please see the last patch.

  patch 1 - 3: prep for per-netns ehash
  patch     4: small optimisation for netns dismantle without TIME_WAIT sockets
  patch     5: add per-netns ehash


Changes:
  v3:
    * Patch 3
      * Drop mellanox and netronome driver changes (Eric Dumazet)
    * Patch 4
      * Add test results in the changelog
    * Patch 5
      * Move roundup_pow_of_two() into tcp_set_hashinfo() (Eric Dumazet)
      * Remove proc_tcp_child_ehash_entries() and use proc_douintvec_minmax()

  v2: https://lore.kernel.org/netdev/20220829161920.99409-1-kuniyu@amazon.com/
    * Drop flock() and UDP stuff
    * Patch 2
      * Rename inet_get_hashinfo() to tcp_or_dccp_get_hashinfo() (Eric Dumazet)
    * Patch 4
      * Remove unnecessary inet_twsk_purge() calls for unshare()
      * Factorise inet_twsk_purge() calls (Eric Dumazet)
    * Patch 5
      * Change max buckets size as 16Mi
      * Use unsigned int for ehash size (Eric Dumazet)
      * Use GFP_KERNEL_ACCOUNT for the per-netns ehash allocation (Eric Dumazet)
      * Use current->nsproxy->net_ns for parent netns (Eric Dumazet)

  v1: https://lore.kernel.org/netdev/20220826000445.46552-1-kuniyu@amazon.com/


Kuniyuki Iwashima (5):
  tcp: Clean up some functions.
  tcp: Set NULL to sk->sk_prot->h.hashinfo.
  tcp: Access &tcp_hashinfo via net.
  tcp: Save unnecessary inet_twsk_purge() calls.
  tcp: Introduce optional per-netns ehash.

 Documentation/networking/ip-sysctl.rst        |  23 ++++
 .../chelsio/inline_crypto/chtls/chtls_cm.c    |   5 +-
 include/net/inet_hashtables.h                 |  16 +++
 include/net/netns/ipv4.h                      |   1 +
 include/net/tcp.h                             |   1 +
 net/core/filter.c                             |   5 +-
 net/dccp/proto.c                              |   2 +
 net/ipv4/af_inet.c                            |   2 +-
 net/ipv4/esp4.c                               |   3 +-
 net/ipv4/inet_connection_sock.c               |  22 ++--
 net/ipv4/inet_hashtables.c                    | 102 ++++++++++++----
 net/ipv4/inet_timewait_sock.c                 |   4 +-
 net/ipv4/netfilter/nf_socket_ipv4.c           |   2 +-
 net/ipv4/netfilter/nf_tproxy_ipv4.c           |  17 ++-
 net/ipv4/sysctl_net_ipv4.c                    |  39 ++++++
 net/ipv4/tcp.c                                |   1 +
 net/ipv4/tcp_diag.c                           |  18 ++-
 net/ipv4/tcp_ipv4.c                           | 114 ++++++++++++------
 net/ipv4/tcp_minisocks.c                      |  31 ++++-
 net/ipv6/esp6.c                               |   3 +-
 net/ipv6/inet6_hashtables.c                   |   4 +-
 net/ipv6/netfilter/nf_socket_ipv6.c           |   2 +-
 net/ipv6/netfilter/nf_tproxy_ipv6.c           |   5 +-
 net/ipv6/tcp_ipv6.c                           |  20 +--
 net/mptcp/mptcp_diag.c                        |   7 +-
 25 files changed, 339 insertions(+), 110 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 net-next 1/5] tcp: Clean up some functions.
  2022-08-30 19:15 [PATCH v3 net-next 0/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
@ 2022-08-30 19:15 ` Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 2/5] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-30 19:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

This patch adds no functional change and cleans up some functions
that the following patches touch around so that we make them tidy
and easy to review/revert.  The changes are

  - Keep reverse christmas tree order
  - Remove unnecessary init of port in inet_csk_find_open_port()
  - Use req_to_sk() once in reqsk_queue_unlink()

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/inet_connection_sock.c | 21 ++++++++++-----------
 net/ipv4/inet_hashtables.c      | 29 +++++++++++++++--------------
 net/ipv4/tcp_ipv4.c             |  4 ++--
 3 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f0038043b661..8e71d65cfad4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -286,15 +286,13 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 			struct inet_bind_hashbucket **head2_ret, int *port_ret)
 {
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int port = 0;
+	int i, low, high, attempt_half, port, l3mdev;
 	struct inet_bind_hashbucket *head, *head2;
 	struct net *net = sock_net(sk);
-	bool relax = false;
-	int i, low, high, attempt_half;
 	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
-	int l3mdev;
+	bool relax = false;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 ports_exhausted:
@@ -471,15 +469,14 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 {
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int ret = 1, port = snum;
-	struct net *net = sock_net(sk);
 	bool found_port = false, check_bind_conflict = true;
 	bool bhash_created = false, bhash2_created = false;
 	struct inet_bind_hashbucket *head, *head2;
 	struct inet_bind2_bucket *tb2 = NULL;
 	struct inet_bind_bucket *tb = NULL;
 	bool head2_lock_acquired = false;
-	int l3mdev;
+	int ret = 1, port = snum, l3mdev;
+	struct net *net = sock_net(sk);
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 
@@ -909,14 +906,16 @@ static void reqsk_migrate_reset(struct request_sock *req)
 /* return true if req was found in the ehash table */
 static bool reqsk_queue_unlink(struct request_sock *req)
 {
-	struct inet_hashinfo *hashinfo = req_to_sk(req)->sk_prot->h.hashinfo;
+	struct sock *sk = req_to_sk(req);
 	bool found = false;
 
-	if (sk_hashed(req_to_sk(req))) {
-		spinlock_t *lock = inet_ehash_lockp(hashinfo, req->rsk_hash);
+	if (sk_hashed(sk)) {
+		struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+		spinlock_t *lock;
 
+		lock = inet_ehash_lockp(hashinfo, req->rsk_hash);
 		spin_lock(lock);
-		found = __sk_nulls_del_node_init_rcu(req_to_sk(req));
+		found = __sk_nulls_del_node_init_rcu(sk);
 		spin_unlock(lock);
 	}
 	if (timer_pending(&req->rsk_timer) && del_timer_sync(&req->rsk_timer))
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 60d77e234a68..29dce78de179 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -169,13 +169,14 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 static void __inet_put_port(struct sock *sk)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
-	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num,
-			hashinfo->bhash_size);
-	struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
-	struct inet_bind_hashbucket *head2 =
-		inet_bhashfn_portaddr(hashinfo, sk, sock_net(sk),
-				      inet_sk(sk)->inet_num);
+	struct inet_bind_hashbucket *head, *head2;
+	struct net *net = sock_net(sk);
 	struct inet_bind_bucket *tb;
+	int bhash;
+
+	bhash = inet_bhashfn(net, inet_sk(sk)->inet_num, hashinfo->bhash_size);
+	head = &hashinfo->bhash[bhash];
+	head2 = inet_bhashfn_portaddr(hashinfo, sk, net, inet_sk(sk)->inet_num);
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
@@ -209,17 +210,17 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 {
 	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
 	unsigned short port = inet_sk(child)->inet_num;
-	const int bhash = inet_bhashfn(sock_net(sk), port,
-			table->bhash_size);
-	struct inet_bind_hashbucket *head = &table->bhash[bhash];
-	struct inet_bind_hashbucket *head2 =
-		inet_bhashfn_portaddr(table, child, sock_net(sk), port);
+	struct inet_bind_hashbucket *head, *head2;
 	bool created_inet_bind_bucket = false;
-	bool update_fastreuse = false;
 	struct net *net = sock_net(sk);
+	bool update_fastreuse = false;
 	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
-	int l3mdev;
+	int bhash, l3mdev;
+
+	bhash = inet_bhashfn(net, port, table->bhash_size);
+	head = &table->bhash[bhash];
+	head2 = inet_bhashfn_portaddr(table, child, net, port);
 
 	spin_lock(&head->lock);
 	spin_lock(&head2->lock);
@@ -629,8 +630,8 @@ static bool inet_ehash_lookup_by_sk(struct sock *sk,
 bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
-	struct hlist_nulls_head *list;
 	struct inet_ehash_bucket *head;
+	struct hlist_nulls_head *list;
 	spinlock_t *lock;
 	bool ret = true;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cc2ad67f75be..61a9bf661814 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2406,9 +2406,9 @@ static void *established_get_first(struct seq_file *seq)
 
 static void *established_get_next(struct seq_file *seq, void *cur)
 {
-	struct sock *sk = cur;
-	struct hlist_nulls_node *node;
 	struct tcp_iter_state *st = seq->private;
+	struct hlist_nulls_node *node;
+	struct sock *sk = cur;
 
 	++st->num;
 	++st->offset;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 net-next 2/5] tcp: Set NULL to sk->sk_prot->h.hashinfo.
  2022-08-30 19:15 [PATCH v3 net-next 0/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 1/5] tcp: Clean up some functions Kuniyuki Iwashima
@ 2022-08-30 19:15 ` Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-30 19:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will soon introduce an optional per-netns ehash.

This means we cannot use the global sk->sk_prot->h.hashinfo
to fetch a TCP hashinfo.

Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
a proper hashinfo from net->ipv4.tcp_death_row->hashinfo.

Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
disabled.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_hashtables.h   | 10 ++++++++++
 net/ipv4/af_inet.c              |  2 +-
 net/ipv4/inet_connection_sock.c |  9 ++++-----
 net/ipv4/inet_hashtables.c      | 14 +++++++-------
 net/ipv4/tcp_ipv4.c             |  2 +-
 net/ipv6/tcp_ipv6.c             |  2 +-
 6 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 44a419b9e3d5..1ff5603cddff 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -170,6 +170,16 @@ struct inet_hashinfo {
 	struct inet_listen_hashbucket	*lhash2;
 };
 
+static inline struct inet_hashinfo *tcp_or_dccp_get_hashinfo(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IP_DCCP)
+	return sk->sk_prot->h.hashinfo ? :
+		sock_net(sk)->ipv4.tcp_death_row->hashinfo;
+#else
+	return sock_net(sk)->ipv4.tcp_death_row->hashinfo;
+#endif
+}
+
 static inline struct inet_listen_hashbucket *
 inet_lhash2_bucket(struct inet_hashinfo *h, u32 hash)
 {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d3ab1ae32ef5..e2c219382345 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1250,7 +1250,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	}
 
 	prev_addr_hashbucket =
-		inet_bhashfn_portaddr(sk->sk_prot->h.hashinfo, sk,
+		inet_bhashfn_portaddr(tcp_or_dccp_get_hashinfo(sk), sk,
 				      sock_net(sk), inet->inet_num);
 
 	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8e71d65cfad4..ebca860e113f 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -285,7 +285,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 			struct inet_bind2_bucket **tb2_ret,
 			struct inet_bind_hashbucket **head2_ret, int *port_ret)
 {
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	int i, low, high, attempt_half, port, l3mdev;
 	struct inet_bind_hashbucket *head, *head2;
 	struct net *net = sock_net(sk);
@@ -467,8 +467,8 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
  */
 int inet_csk_get_port(struct sock *sk, unsigned short snum)
 {
+	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
 	bool found_port = false, check_bind_conflict = true;
 	bool bhash_created = false, bhash2_created = false;
 	struct inet_bind_hashbucket *head, *head2;
@@ -910,10 +910,9 @@ static bool reqsk_queue_unlink(struct request_sock *req)
 	bool found = false;
 
 	if (sk_hashed(sk)) {
-		struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
-		spinlock_t *lock;
+		struct inet_hashinfo *hashinfo = tcp_or_dccp_get_hashinfo(sk);
+		spinlock_t *lock = inet_ehash_lockp(hashinfo, req->rsk_hash);
 
-		lock = inet_ehash_lockp(hashinfo, req->rsk_hash);
 		spin_lock(lock);
 		found = __sk_nulls_del_node_init_rcu(sk);
 		spin_unlock(lock);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 29dce78de179..bdb5427a7a3d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -168,7 +168,7 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
  */
 static void __inet_put_port(struct sock *sk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_bind_hashbucket *head, *head2;
 	struct net *net = sock_net(sk);
 	struct inet_bind_bucket *tb;
@@ -208,7 +208,7 @@ EXPORT_SYMBOL(inet_put_port);
 
 int __inet_inherit_port(const struct sock *sk, struct sock *child)
 {
-	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *table = tcp_or_dccp_get_hashinfo(sk);
 	unsigned short port = inet_sk(child)->inet_num;
 	struct inet_bind_hashbucket *head, *head2;
 	bool created_inet_bind_bucket = false;
@@ -629,7 +629,7 @@ static bool inet_ehash_lookup_by_sk(struct sock *sk,
  */
 bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_ehash_bucket *head;
 	struct hlist_nulls_head *list;
 	spinlock_t *lock;
@@ -701,7 +701,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
 
 int __inet_hash(struct sock *sk, struct sock *osk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_listen_hashbucket *ilb2;
 	int err = 0;
 
@@ -747,7 +747,7 @@ EXPORT_SYMBOL_GPL(inet_hash);
 
 void inet_unhash(struct sock *sk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = tcp_or_dccp_get_hashinfo(sk);
 
 	if (sk_unhashed(sk))
 		return;
@@ -834,7 +834,7 @@ inet_bind2_bucket_find(const struct inet_bind_hashbucket *head, const struct net
 struct inet_bind_hashbucket *
 inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, int port)
 {
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	u32 hash;
 #if IS_ENABLED(CONFIG_IPV6)
 	struct in6_addr addr_any = {};
@@ -850,7 +850,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
 
 int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
 {
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hinfo = tcp_or_dccp_get_hashinfo(sk);
 	struct inet_bind2_bucket *tb2, *new_tb2;
 	int l3mdev = inet_sk_bound_l3mdev(sk);
 	struct inet_bind_hashbucket *head2;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 61a9bf661814..7c3b3ce85a5e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3083,7 +3083,7 @@ struct proto tcp_prot = {
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
 	.twsk_prot		= &tcp_timewait_sock_ops,
 	.rsk_prot		= &tcp_request_sock_ops,
-	.h.hashinfo		= &tcp_hashinfo,
+	.h.hashinfo		= NULL,
 	.no_autobind		= true,
 	.diag_destroy		= tcp_abort,
 };
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index ff5c4fc135fc..791f24da9212 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2193,7 +2193,7 @@ struct proto tcpv6_prot = {
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
 	.twsk_prot		= &tcp6_timewait_sock_ops,
 	.rsk_prot		= &tcp6_request_sock_ops,
-	.h.hashinfo		= &tcp_hashinfo,
+	.h.hashinfo		= NULL,
 	.no_autobind		= true,
 	.diag_destroy		= tcp_abort,
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-08-30 19:15 [PATCH v3 net-next 0/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 1/5] tcp: Clean up some functions Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 2/5] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
@ 2022-08-30 19:15 ` Kuniyuki Iwashima
  2022-09-01 10:57   ` Paolo Abeni
  2022-08-30 19:15 ` [PATCH v3 net-next 4/5] tcp: Save unnecessary inet_twsk_purge() calls Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 5/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
  4 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-30 19:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will soon introduce an optional per-netns ehash.

This means we cannot use tcp_hashinfo directly in most places.

Instead, access it via net->ipv4.tcp_death_row->hashinfo.

The access will be valid only while initialising tcp_hashinfo
itself and creating/destroying each netns.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 .../chelsio/inline_crypto/chtls/chtls_cm.c    |  5 +-
 net/core/filter.c                             |  5 +-
 net/ipv4/esp4.c                               |  3 +-
 net/ipv4/inet_hashtables.c                    |  2 +-
 net/ipv4/netfilter/nf_socket_ipv4.c           |  2 +-
 net/ipv4/netfilter/nf_tproxy_ipv4.c           | 17 +++--
 net/ipv4/tcp_diag.c                           | 18 +++++-
 net/ipv4/tcp_ipv4.c                           | 63 +++++++++++--------
 net/ipv4/tcp_minisocks.c                      |  2 +-
 net/ipv6/esp6.c                               |  3 +-
 net/ipv6/inet6_hashtables.c                   |  4 +-
 net/ipv6/netfilter/nf_socket_ipv6.c           |  2 +-
 net/ipv6/netfilter/nf_tproxy_ipv6.c           |  5 +-
 net/ipv6/tcp_ipv6.c                           | 16 ++---
 net/mptcp/mptcp_diag.c                        |  7 ++-
 15 files changed, 92 insertions(+), 62 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
index ddfe9208529a..f90bfba4b303 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
@@ -1069,8 +1069,7 @@ static void chtls_pass_accept_rpl(struct sk_buff *skb,
 	cxgb4_l2t_send(csk->egress_dev, skb, csk->l2t_entry);
 }
 
-static void inet_inherit_port(struct inet_hashinfo *hash_info,
-			      struct sock *lsk, struct sock *newsk)
+static void inet_inherit_port(struct sock *lsk, struct sock *newsk)
 {
 	local_bh_disable();
 	__inet_inherit_port(lsk, newsk);
@@ -1240,7 +1239,7 @@ static struct sock *chtls_recv_sock(struct sock *lsk,
 						     ipv4.sysctl_tcp_window_scaling),
 					   tp->window_clamp);
 	neigh_release(n);
-	inet_inherit_port(&tcp_hashinfo, lsk, newsk);
+	inet_inherit_port(lsk, newsk);
 	csk_set_flag(csk, CSK_CONN_INLINE);
 	bh_unlock_sock(newsk); /* tcp_create_openreq_child ->sk_clone_lock */
 
diff --git a/net/core/filter.c b/net/core/filter.c
index c191db80ce93..f4ec4ba1e8cc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6469,6 +6469,7 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 			      int dif, int sdif, u8 family, u8 proto)
 {
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
 	bool refcounted = false;
 	struct sock *sk = NULL;
 
@@ -6477,7 +6478,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 		__be32 dst4 = tuple->ipv4.daddr;
 
 		if (proto == IPPROTO_TCP)
-			sk = __inet_lookup(net, &tcp_hashinfo, NULL, 0,
+			sk = __inet_lookup(net, hinfo, NULL, 0,
 					   src4, tuple->ipv4.sport,
 					   dst4, tuple->ipv4.dport,
 					   dif, sdif, &refcounted);
@@ -6491,7 +6492,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 		struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
 
 		if (proto == IPPROTO_TCP)
-			sk = __inet6_lookup(net, &tcp_hashinfo, NULL, 0,
+			sk = __inet6_lookup(net, hinfo, NULL, 0,
 					    src6, tuple->ipv6.sport,
 					    dst6, ntohs(tuple->ipv6.dport),
 					    dif, sdif, &refcounted);
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 5c03eba787e5..951c965d9322 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -134,6 +134,7 @@ static void esp_free_tcp_sk(struct rcu_head *head)
 static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
 {
 	struct xfrm_encap_tmpl *encap = x->encap;
+	struct net *net = xs_net(x);
 	struct esp_tcp_sk *esk;
 	__be16 sport, dport;
 	struct sock *nsk;
@@ -160,7 +161,7 @@ static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
 	}
 	spin_unlock_bh(&x->lock);
 
-	sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4,
+	sk = inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, x->id.daddr.a4,
 				     dport, x->props.saddr.a4, sport, 0);
 	if (!sk)
 		return ERR_PTR(-ENOENT);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bdb5427a7a3d..643d3a7031eb 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -386,7 +386,7 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
 	struct sock *sk, *reuse_sk;
 	bool no_reuseport;
 
-	if (hashinfo != &tcp_hashinfo)
+	if (hashinfo != net->ipv4.tcp_death_row->hashinfo)
 		return NULL; /* only TCP is supported */
 
 	no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP, saddr, sport,
diff --git a/net/ipv4/netfilter/nf_socket_ipv4.c b/net/ipv4/netfilter/nf_socket_ipv4.c
index 2d42e4c35a20..9ba496d061cc 100644
--- a/net/ipv4/netfilter/nf_socket_ipv4.c
+++ b/net/ipv4/netfilter/nf_socket_ipv4.c
@@ -71,7 +71,7 @@ nf_socket_get_sock_v4(struct net *net, struct sk_buff *skb, const int doff,
 {
 	switch (protocol) {
 	case IPPROTO_TCP:
-		return inet_lookup(net, &tcp_hashinfo, skb, doff,
+		return inet_lookup(net, net->ipv4.tcp_death_row->hashinfo, skb, doff,
 				   saddr, sport, daddr, dport,
 				   in->ifindex);
 	case IPPROTO_UDP:
diff --git a/net/ipv4/netfilter/nf_tproxy_ipv4.c b/net/ipv4/netfilter/nf_tproxy_ipv4.c
index b2bae0b0e42a..ce4e0d50a55c 100644
--- a/net/ipv4/netfilter/nf_tproxy_ipv4.c
+++ b/net/ipv4/netfilter/nf_tproxy_ipv4.c
@@ -79,6 +79,7 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb,
 		      const struct net_device *in,
 		      const enum nf_tproxy_lookup_t lookup_type)
 {
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
 	struct sock *sk;
 
 	switch (protocol) {
@@ -92,12 +93,10 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb,
 
 		switch (lookup_type) {
 		case NF_TPROXY_LOOKUP_LISTENER:
-			sk = inet_lookup_listener(net, &tcp_hashinfo, skb,
-						    ip_hdrlen(skb) +
-						      __tcp_hdrlen(hp),
-						    saddr, sport,
-						    daddr, dport,
-						    in->ifindex, 0);
+			sk = inet_lookup_listener(net, hinfo, skb,
+						  ip_hdrlen(skb) + __tcp_hdrlen(hp),
+						  saddr, sport, daddr, dport,
+						  in->ifindex, 0);
 
 			if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
 				sk = NULL;
@@ -108,9 +107,9 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb,
 			 */
 			break;
 		case NF_TPROXY_LOOKUP_ESTABLISHED:
-			sk = inet_lookup_established(net, &tcp_hashinfo,
-						    saddr, sport, daddr, dport,
-						    in->ifindex);
+			sk = inet_lookup_established(net, hinfo,
+						     saddr, sport, daddr, dport,
+						     in->ifindex);
 			break;
 		default:
 			BUG();
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index 75a1c985f49a..958c923020c0 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -181,13 +181,21 @@ static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
 static void tcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			  const struct inet_diag_req_v2 *r)
 {
-	inet_diag_dump_icsk(&tcp_hashinfo, skb, cb, r);
+	struct inet_hashinfo *hinfo;
+
+	hinfo = sock_net(cb->skb->sk)->ipv4.tcp_death_row->hashinfo;
+
+	inet_diag_dump_icsk(hinfo, skb, cb, r);
 }
 
 static int tcp_diag_dump_one(struct netlink_callback *cb,
 			     const struct inet_diag_req_v2 *req)
 {
-	return inet_diag_dump_one_icsk(&tcp_hashinfo, cb, req);
+	struct inet_hashinfo *hinfo;
+
+	hinfo = sock_net(cb->skb->sk)->ipv4.tcp_death_row->hashinfo;
+
+	return inet_diag_dump_one_icsk(hinfo, cb, req);
 }
 
 #ifdef CONFIG_INET_DIAG_DESTROY
@@ -195,9 +203,13 @@ static int tcp_diag_destroy(struct sk_buff *in_skb,
 			    const struct inet_diag_req_v2 *req)
 {
 	struct net *net = sock_net(in_skb->sk);
-	struct sock *sk = inet_diag_find_one_icsk(net, &tcp_hashinfo, req);
+	struct inet_hashinfo *hinfo;
+	struct sock *sk;
 	int err;
 
+	hinfo = net->ipv4.tcp_death_row->hashinfo;
+	sk = inet_diag_find_one_icsk(net, hinfo, req);
+
 	if (IS_ERR(sk))
 		return PTR_ERR(sk);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7c3b3ce85a5e..b07930643b11 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -249,7 +249,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 	if (!inet->inet_saddr) {
 		if (inet_csk(sk)->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo,
+			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
 								     sk, sock_net(sk),
 								     inet->inet_num);
 			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
@@ -494,7 +494,7 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
 	int err;
 	struct net *net = dev_net(skb->dev);
 
-	sk = __inet_lookup_established(net, &tcp_hashinfo, iph->daddr,
+	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, iph->daddr,
 				       th->dest, iph->saddr, ntohs(th->source),
 				       inet_iif(skb), 0);
 	if (!sk) {
@@ -759,7 +759,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
 		 * Incoming packet is checked with md5 hash with finding key,
 		 * no RST generated if md5 hash doesn't match.
 		 */
-		sk1 = __inet_lookup_listener(net, &tcp_hashinfo, NULL, 0,
+		sk1 = __inet_lookup_listener(net, net->ipv4.tcp_death_row->hashinfo, NULL, 0,
 					     ip_hdr(skb)->saddr,
 					     th->source, ip_hdr(skb)->daddr,
 					     ntohs(th->source), dif, sdif);
@@ -1728,6 +1728,7 @@ EXPORT_SYMBOL(tcp_v4_do_rcv);
 
 int tcp_v4_early_demux(struct sk_buff *skb)
 {
+	struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph;
 	const struct tcphdr *th;
 	struct sock *sk;
@@ -1744,7 +1745,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 	if (th->doff < sizeof(struct tcphdr) / 4)
 		return 0;
 
-	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
+	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 				       iph->saddr, th->source,
 				       iph->daddr, ntohs(th->dest),
 				       skb->skb_iif, inet_sdif(skb));
@@ -1970,7 +1971,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	th = (const struct tcphdr *)skb->data;
 	iph = ip_hdr(skb);
 lookup:
-	sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source,
+	sk = __inet_lookup_skb(net->ipv4.tcp_death_row->hashinfo,
+			       skb, __tcp_hdrlen(th), th->source,
 			       th->dest, sdif, &refcounted);
 	if (!sk)
 		goto no_tcp_socket;
@@ -2152,9 +2154,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	}
 	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
 	case TCP_TW_SYN: {
-		struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
-							&tcp_hashinfo, skb,
-							__tcp_hdrlen(th),
+		struct sock *sk2 = inet_lookup_listener(net,
+							net->ipv4.tcp_death_row->hashinfo,
+							skb, __tcp_hdrlen(th),
 							iph->saddr, th->source,
 							iph->daddr, th->dest,
 							inet_iif(skb),
@@ -2304,15 +2306,16 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
  */
 static void *listening_get_first(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 
 	st->offset = 0;
-	for (; st->bucket <= tcp_hashinfo.lhash2_mask; st->bucket++) {
+	for (; st->bucket <= hinfo->lhash2_mask; st->bucket++) {
 		struct inet_listen_hashbucket *ilb2;
 		struct hlist_nulls_node *node;
 		struct sock *sk;
 
-		ilb2 = &tcp_hashinfo.lhash2[st->bucket];
+		ilb2 = &hinfo->lhash2[st->bucket];
 		if (hlist_nulls_empty(&ilb2->nulls_head))
 			continue;
 
@@ -2337,6 +2340,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct tcp_iter_state *st = seq->private;
 	struct inet_listen_hashbucket *ilb2;
 	struct hlist_nulls_node *node;
+	struct inet_hashinfo *hinfo;
 	struct sock *sk = cur;
 
 	++st->num;
@@ -2348,7 +2352,8 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 			return sk;
 	}
 
-	ilb2 = &tcp_hashinfo.lhash2[st->bucket];
+	hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
+	ilb2 = &hinfo->lhash2[st->bucket];
 	spin_unlock(&ilb2->lock);
 	++st->bucket;
 	return listening_get_first(seq);
@@ -2370,9 +2375,10 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos)
 	return rc;
 }
 
-static inline bool empty_bucket(const struct tcp_iter_state *st)
+static inline bool empty_bucket(struct inet_hashinfo *hinfo,
+				const struct tcp_iter_state *st)
 {
-	return hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].chain);
+	return hlist_nulls_empty(&hinfo->ehash[st->bucket].chain);
 }
 
 /*
@@ -2381,20 +2387,21 @@ static inline bool empty_bucket(const struct tcp_iter_state *st)
  */
 static void *established_get_first(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 
 	st->offset = 0;
-	for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) {
+	for (; st->bucket <= hinfo->ehash_mask; ++st->bucket) {
 		struct sock *sk;
 		struct hlist_nulls_node *node;
-		spinlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, st->bucket);
+		spinlock_t *lock = inet_ehash_lockp(hinfo, st->bucket);
 
 		/* Lockless fast path for the common case of empty buckets */
-		if (empty_bucket(st))
+		if (empty_bucket(hinfo, st))
 			continue;
 
 		spin_lock_bh(lock);
-		sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
+		sk_nulls_for_each(sk, node, &hinfo->ehash[st->bucket].chain) {
 			if (seq_sk_match(seq, sk))
 				return sk;
 		}
@@ -2406,6 +2413,7 @@ static void *established_get_first(struct seq_file *seq)
 
 static void *established_get_next(struct seq_file *seq, void *cur)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 	struct hlist_nulls_node *node;
 	struct sock *sk = cur;
@@ -2420,7 +2428,7 @@ static void *established_get_next(struct seq_file *seq, void *cur)
 			return sk;
 	}
 
-	spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+	spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket));
 	++st->bucket;
 	return established_get_first(seq);
 }
@@ -2458,6 +2466,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
 
 static void *tcp_seek_last_pos(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 	int bucket = st->bucket;
 	int offset = st->offset;
@@ -2466,7 +2475,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
-		if (st->bucket > tcp_hashinfo.lhash2_mask)
+		if (st->bucket > hinfo->lhash2_mask)
 			break;
 		st->state = TCP_SEQ_STATE_LISTENING;
 		rc = listening_get_first(seq);
@@ -2478,7 +2487,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 		st->state = TCP_SEQ_STATE_ESTABLISHED;
 		fallthrough;
 	case TCP_SEQ_STATE_ESTABLISHED:
-		if (st->bucket > tcp_hashinfo.ehash_mask)
+		if (st->bucket > hinfo->ehash_mask)
 			break;
 		rc = established_get_first(seq);
 		while (offset-- && rc && bucket == st->bucket)
@@ -2546,16 +2555,17 @@ EXPORT_SYMBOL(tcp_seq_next);
 
 void tcp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
 		if (v != SEQ_START_TOKEN)
-			spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock);
+			spin_unlock(&hinfo->lhash2[st->bucket].lock);
 		break;
 	case TCP_SEQ_STATE_ESTABLISHED:
 		if (v)
-			spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+			spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket));
 		break;
 	}
 }
@@ -2750,6 +2760,7 @@ static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
 static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq,
 						 struct sock *start_sk)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct tcp_iter_state *st = &iter->state;
 	struct hlist_nulls_node *node;
@@ -2769,7 +2780,7 @@ static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq,
 			expected++;
 		}
 	}
-	spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock);
+	spin_unlock(&hinfo->lhash2[st->bucket].lock);
 
 	return expected;
 }
@@ -2777,6 +2788,7 @@ static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq,
 static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
 						   struct sock *start_sk)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct tcp_iter_state *st = &iter->state;
 	struct hlist_nulls_node *node;
@@ -2796,13 +2808,14 @@ static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
 			expected++;
 		}
 	}
-	spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+	spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket));
 
 	return expected;
 }
 
 static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct tcp_iter_state *st = &iter->state;
 	unsigned int expected;
@@ -2818,7 +2831,7 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 		st->offset = 0;
 		st->bucket++;
 		if (st->state == TCP_SEQ_STATE_LISTENING &&
-		    st->bucket > tcp_hashinfo.lhash2_mask) {
+		    st->bucket > hinfo->lhash2_mask) {
 			st->state = TCP_SEQ_STATE_ESTABLISHED;
 			st->bucket = 0;
 		}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index cb95d88497ae..361aad67c6d6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -319,7 +319,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, &tcp_hashinfo);
+		inet_twsk_hashdance(tw, sk, tcp_death_row->hashinfo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 8220923a12f7..3cdeed3d2e1a 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -151,6 +151,7 @@ static void esp_free_tcp_sk(struct rcu_head *head)
 static struct sock *esp6_find_tcp_sk(struct xfrm_state *x)
 {
 	struct xfrm_encap_tmpl *encap = x->encap;
+	struct net *net = xs_net(x);
 	struct esp_tcp_sk *esk;
 	__be16 sport, dport;
 	struct sock *nsk;
@@ -177,7 +178,7 @@ static struct sock *esp6_find_tcp_sk(struct xfrm_state *x)
 	}
 	spin_unlock_bh(&x->lock);
 
-	sk = __inet6_lookup_established(xs_net(x), &tcp_hashinfo, &x->id.daddr.in6,
+	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, &x->id.daddr.in6,
 					dport, &x->props.saddr.in6, ntohs(sport), 0, 0);
 	if (!sk)
 		return ERR_PTR(-ENOENT);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 7d53d62783b1..52513a6a54fd 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -21,8 +21,6 @@
 #include <net/ip.h>
 #include <net/sock_reuseport.h>
 
-extern struct inet_hashinfo tcp_hashinfo;
-
 u32 inet6_ehashfn(const struct net *net,
 		  const struct in6_addr *laddr, const u16 lport,
 		  const struct in6_addr *faddr, const __be16 fport)
@@ -169,7 +167,7 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
 	struct sock *sk, *reuse_sk;
 	bool no_reuseport;
 
-	if (hashinfo != &tcp_hashinfo)
+	if (hashinfo != net->ipv4.tcp_death_row->hashinfo)
 		return NULL; /* only TCP is supported */
 
 	no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_TCP, saddr, sport,
diff --git a/net/ipv6/netfilter/nf_socket_ipv6.c b/net/ipv6/netfilter/nf_socket_ipv6.c
index aa5bb8789ba0..6356407065c7 100644
--- a/net/ipv6/netfilter/nf_socket_ipv6.c
+++ b/net/ipv6/netfilter/nf_socket_ipv6.c
@@ -83,7 +83,7 @@ nf_socket_get_sock_v6(struct net *net, struct sk_buff *skb, int doff,
 {
 	switch (protocol) {
 	case IPPROTO_TCP:
-		return inet6_lookup(net, &tcp_hashinfo, skb, doff,
+		return inet6_lookup(net, net->ipv4.tcp_death_row->hashinfo, skb, doff,
 				    saddr, sport, daddr, dport,
 				    in->ifindex);
 	case IPPROTO_UDP:
diff --git a/net/ipv6/netfilter/nf_tproxy_ipv6.c b/net/ipv6/netfilter/nf_tproxy_ipv6.c
index 6bac68fb27a3..b0e5fb3e4d95 100644
--- a/net/ipv6/netfilter/nf_tproxy_ipv6.c
+++ b/net/ipv6/netfilter/nf_tproxy_ipv6.c
@@ -80,6 +80,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff,
 		      const struct net_device *in,
 		      const enum nf_tproxy_lookup_t lookup_type)
 {
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
 	struct sock *sk;
 
 	switch (protocol) {
@@ -93,7 +94,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff,
 
 		switch (lookup_type) {
 		case NF_TPROXY_LOOKUP_LISTENER:
-			sk = inet6_lookup_listener(net, &tcp_hashinfo, skb,
+			sk = inet6_lookup_listener(net, hinfo, skb,
 						   thoff + __tcp_hdrlen(hp),
 						   saddr, sport,
 						   daddr, ntohs(dport),
@@ -108,7 +109,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff,
 			 */
 			break;
 		case NF_TPROXY_LOOKUP_ESTABLISHED:
-			sk = __inet6_lookup_established(net, &tcp_hashinfo,
+			sk = __inet6_lookup_established(net, hinfo,
 							saddr, sport, daddr, ntohs(dport),
 							in->ifindex, 0);
 			break;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 791f24da9212..27b2fd98a2c4 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -286,12 +286,14 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 		goto failure;
 	}
 
+	tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
+
 	if (!saddr) {
 		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
 		struct in6_addr prev_v6_rcv_saddr;
 
 		if (icsk->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo,
+			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
 								     sk, sock_net(sk),
 								     inet->inet_num);
 			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
@@ -325,7 +327,6 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	inet->inet_dport = usin->sin6_port;
 
 	tcp_set_state(sk, TCP_SYN_SENT);
-	tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
 	err = inet6_hash_connect(tcp_death_row, sk);
 	if (err)
 		goto late_failure;
@@ -403,7 +404,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	bool fatal;
 	int err;
 
-	sk = __inet6_lookup_established(net, &tcp_hashinfo,
+	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 					&hdr->daddr, th->dest,
 					&hdr->saddr, ntohs(th->source),
 					skb->dev->ifindex, inet6_sdif(skb));
@@ -1037,7 +1038,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb)
 		 * no RST generated if md5 hash doesn't match.
 		 */
 		sk1 = inet6_lookup_listener(net,
-					   &tcp_hashinfo, NULL, 0,
+					   net->ipv4.tcp_death_row->hashinfo, NULL, 0,
 					   &ipv6h->saddr,
 					   th->source, &ipv6h->daddr,
 					   ntohs(th->source), dif, sdif);
@@ -1636,7 +1637,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 	hdr = ipv6_hdr(skb);
 
 lookup:
-	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th),
+	sk = __inet6_lookup_skb(net->ipv4.tcp_death_row->hashinfo, skb, __tcp_hdrlen(th),
 				th->source, th->dest, inet6_iif(skb), sdif,
 				&refcounted);
 	if (!sk)
@@ -1811,7 +1812,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 	{
 		struct sock *sk2;
 
-		sk2 = inet6_lookup_listener(dev_net(skb->dev), &tcp_hashinfo,
+		sk2 = inet6_lookup_listener(net, net->ipv4.tcp_death_row->hashinfo,
 					    skb, __tcp_hdrlen(th),
 					    &ipv6_hdr(skb)->saddr, th->source,
 					    &ipv6_hdr(skb)->daddr,
@@ -1844,6 +1845,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 void tcp_v6_early_demux(struct sk_buff *skb)
 {
+	struct net *net = dev_net(skb->dev);
 	const struct ipv6hdr *hdr;
 	const struct tcphdr *th;
 	struct sock *sk;
@@ -1861,7 +1863,7 @@ void tcp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	/* Note : We use inet6_iif() here, not tcp_v6_iif() */
-	sk = __inet6_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
+	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 					&hdr->saddr, th->source,
 					&hdr->daddr, ntohs(th->dest),
 					inet6_iif(skb), inet6_sdif(skb));
diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c
index 7f9a71780437..12b0aac25a39 100644
--- a/net/mptcp/mptcp_diag.c
+++ b/net/mptcp/mptcp_diag.c
@@ -81,15 +81,18 @@ static void mptcp_diag_dump_listeners(struct sk_buff *skb, struct netlink_callba
 	struct mptcp_diag_ctx *diag_ctx = (void *)cb->ctx;
 	struct nlattr *bc = cb_data->inet_diag_nla_bc;
 	struct net *net = sock_net(skb->sk);
+	struct inet_hashinfo *hinfo;
 	int i;
 
-	for (i = diag_ctx->l_slot; i <= tcp_hashinfo.lhash2_mask; i++) {
+	hinfo = net->ipv4.tcp_death_row->hashinfo;
+
+	for (i = diag_ctx->l_slot; i <= hinfo->lhash2_mask; i++) {
 		struct inet_listen_hashbucket *ilb;
 		struct hlist_nulls_node *node;
 		struct sock *sk;
 		int num = 0;
 
-		ilb = &tcp_hashinfo.lhash2[i];
+		ilb = &hinfo->lhash2[i];
 
 		rcu_read_lock();
 		spin_lock(&ilb->lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 net-next 4/5] tcp: Save unnecessary inet_twsk_purge() calls.
  2022-08-30 19:15 [PATCH v3 net-next 0/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2022-08-30 19:15 ` [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
@ 2022-08-30 19:15 ` Kuniyuki Iwashima
  2022-08-30 19:15 ` [PATCH v3 net-next 5/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
  4 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-30 19:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
and tcpv6_net_exit_batch() for AF_INET and AF_INET6.  These commands
trigger the kernel to walk through the potentially big ehash twice even
though the netns has no TIME_WAIT sockets.

  # ip netns add test
  # ip netns del test

  or

  # unshare -n /bin/true >/dev/null

AF_INET6 uses module_init() to be loaded after AF_INET which uses
fs_initcall(), so tcpv6_net_ops is always registered after tcp_sk_ops.
Also, we clean up netns in the reverse order, so tcpv6_net_exit_batch()
is always called before tcp_sk_exit_batch().

The characteristic enables us to save such unneeded iterations when all
netns in net_exit_list have no tw sockets.  This change eliminates the
tax by the additional unshare() described in the next patch to guarantee
the per-netns ehash size.

Tested:

  # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter
  # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter
  # echo function > /sys/kernel/debug/tracing/current_tracer
  # cat ./add_del_unshare.sh
  for i in `seq 1 40`
  do
      (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
  done
  wait;
  # ./add_del_unshare.sh

Before the patch:

  # cat /sys/kernel/debug/tracing/trace_pipe
    kworker/u128:0-8       [031] ...1.   174.162765: cleanup_net <-process_one_work
    kworker/u128:0-8       [031] ...1.   174.240796: inet_twsk_purge <-cleanup_net
    kworker/u128:0-8       [032] ...1.   174.244759: inet_twsk_purge <-tcp_sk_exit_batch
    kworker/u128:0-8       [034] ...1.   174.290861: cleanup_net <-process_one_work
    kworker/u128:0-8       [039] ...1.   175.245027: inet_twsk_purge <-cleanup_net
    kworker/u128:0-8       [046] ...1.   175.290541: inet_twsk_purge <-tcp_sk_exit_batch
    kworker/u128:0-8       [037] ...1.   175.321046: cleanup_net <-process_one_work
    kworker/u128:0-8       [024] ...1.   175.941633: inet_twsk_purge <-cleanup_net
    kworker/u128:0-8       [025] ...1.   176.242539: inet_twsk_purge <-tcp_sk_exit_batch

After:

  # cat /sys/kernel/debug/tracing/trace_pipe
    kworker/u128:0-8       [038] ...1.   428.116174: cleanup_net <-process_one_work
    kworker/u128:0-8       [038] ...1.   428.262532: cleanup_net <-process_one_work
    kworker/u128:0-8       [030] ...1.   429.292645: cleanup_net <-process_one_work

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h        |  1 +
 net/ipv4/tcp_ipv4.c      |  6 ++++--
 net/ipv4/tcp_minisocks.c | 24 ++++++++++++++++++++++++
 net/ipv6/tcp_ipv6.c      |  2 +-
 4 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d10962b9f0d0..f60996c1d7b3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -346,6 +346,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
 void tcp_rcv_space_adjust(struct sock *sk);
 int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp);
 void tcp_twsk_destructor(struct sock *sk);
+void tcp_twsk_purge(struct list_head *net_exit_list, int family);
 ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
 			struct pipe_inode_info *pipe, size_t len,
 			unsigned int flags);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b07930643b11..f4a502d57d45 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3109,8 +3109,10 @@ static void __net_exit tcp_sk_exit(struct net *net)
 	if (net->ipv4.tcp_congestion_control)
 		bpf_module_put(net->ipv4.tcp_congestion_control,
 			       net->ipv4.tcp_congestion_control->owner);
-	if (refcount_dec_and_test(&tcp_death_row->tw_refcount))
+	if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) {
 		kfree(tcp_death_row);
+		net->ipv4.tcp_death_row = NULL;
+	}
 }
 
 static int __net_init tcp_sk_init(struct net *net)
@@ -3210,7 +3212,7 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
 {
 	struct net *net;
 
-	inet_twsk_purge(&tcp_hashinfo, AF_INET);
+	tcp_twsk_purge(net_exit_list, AF_INET);
 
 	list_for_each_entry(net, net_exit_list, exit_list)
 		tcp_fastopen_ctx_destroy(net);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 361aad67c6d6..9168c5a33344 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -347,6 +347,30 @@ void tcp_twsk_destructor(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(tcp_twsk_destructor);
 
+void tcp_twsk_purge(struct list_head *net_exit_list, int family)
+{
+	struct net *net;
+
+	list_for_each_entry(net, net_exit_list, exit_list) {
+		if (!net->ipv4.tcp_death_row)
+			continue;
+
+		/* AF_INET6 using module_init() is always registered after
+		 * AF_INET using fs_initcall() and cleaned up in the reverse
+		 * order.
+		 *
+		 * The last refcount is decremented later in tcp_sk_exit().
+		 */
+		if (IS_ENABLED(CONFIG_IPV6) && family == AF_INET6 &&
+		    refcount_read(&net->ipv4.tcp_death_row->tw_refcount) == 1)
+			continue;
+
+		inet_twsk_purge(&tcp_hashinfo, family);
+		break;
+	}
+}
+EXPORT_SYMBOL_GPL(tcp_twsk_purge);
+
 /* Warning : This function is called without sk_listener being locked.
  * Be sure to read socket fields once, as their value could change under us.
  */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 27b2fd98a2c4..9cbc7f0d7149 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2229,7 +2229,7 @@ static void __net_exit tcpv6_net_exit(struct net *net)
 
 static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list)
 {
-	inet_twsk_purge(&tcp_hashinfo, AF_INET6);
+	tcp_twsk_purge(net_exit_list, AF_INET6);
 }
 
 static struct pernet_operations tcpv6_net_ops = {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 net-next 5/5] tcp: Introduce optional per-netns ehash.
  2022-08-30 19:15 [PATCH v3 net-next 0/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2022-08-30 19:15 ` [PATCH v3 net-next 4/5] tcp: Save unnecessary inet_twsk_purge() calls Kuniyuki Iwashima
@ 2022-08-30 19:15 ` Kuniyuki Iwashima
  4 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-30 19:15 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The more sockets we have in the hash table, the longer we spend looking
up the socket.  While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.

The root cause might be a single workload that consumes much more
resources than the others.  It often happens on a cloud service where
different workloads share the same computing resource.

On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.

 thash_entries		sockets		length		Gbps
	524288		      1		     1		50.7
			   24Mi		    48		45.1

It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.

 thash_entries		sockets		length		Gbps
        131072		      1		     1		50.7
			    1Mi		     8		49.9
			    2Mi		    16		48.9
			    4Mi		    32		47.3
			    8Mi		    64		44.6
			   16Mi		   128		40.6
			   24Mi		   192		36.3
			   32Mi		   256		32.5
			   40Mi		   320		27.0
			   48Mi		   384		25.0

To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.

With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours.  Also, we can reduce lock contention.

We can control the ehash size by a new sysctl knob.  However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash.  The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.

We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).

  # dmesg | cut -d ' ' -f 5- | grep "established hash"
  TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)

  # sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = -524288  # share the global ehash

  # sysctl -w net.ipv4.tcp_child_ehash_entries=100
  net.ipv4.tcp_child_ehash_entries = 100

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets

When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:

  1) Share the global ehash and create per-netns ehash

  First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
  netns sysctl knobs where we can safely change tcp_child_ehash_entries
  and clone()/unshare() to create a per-netns ehash.

  2) Control write on sysctl by BPF

  We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
  sysctl knobs.

Note the default values of two sysctl knobs depend on the ehash size and
should be tuned carefully:

  tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
  tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)

Also, we could optimise ehash lookup/iteration further by removing netns
comparison for the per-netns ehash in the future.

As a bonus, we can dismantle netns faster.  Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns.  Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 Documentation/networking/ip-sysctl.rst | 23 +++++++++++
 include/net/inet_hashtables.h          |  6 +++
 include/net/netns/ipv4.h               |  1 +
 net/dccp/proto.c                       |  2 +
 net/ipv4/inet_hashtables.c             | 57 ++++++++++++++++++++++++++
 net/ipv4/inet_timewait_sock.c          |  4 +-
 net/ipv4/sysctl_net_ipv4.c             | 39 ++++++++++++++++++
 net/ipv4/tcp.c                         |  1 +
 net/ipv4/tcp_ipv4.c                    | 39 +++++++++++++++---
 net/ipv4/tcp_minisocks.c               |  9 +++-
 10 files changed, 172 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 56cd4ea059b2..72547df430a1 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1037,6 +1037,29 @@ tcp_challenge_ack_limit - INTEGER
 	in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
 	Default: 1000
 
+tcp_ehash_entries - INTEGER
+	Read-only number of hash buckets for TCP sockets in the current
+	networking namespace.
+
+	A negative value means the networking namespace does not own its
+	hash buckets and shares the initial networking namespace's one.
+
+tcp_child_ehash_entries - INTEGER
+	Control the number of hash buckets for TCP sockets in the child
+	networking namespace, which must be set before clone() or unshare().
+
+	If the value is not 0, the kernel uses a value rounded up to 2^n
+	as the actual hash bucket size.  0 is a special value, meaning
+	the child networking namespace will share the initial networking
+	namespace's hash buckets.
+
+	Note that the child will use the global one in case the kernel
+	fails to allocate enough memory.
+
+	Possible values: 0, 2^n (n: 0 - 24 (16Mi))
+
+	Default: 0
+
 UDP variables
 =============
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 1ff5603cddff..8800fa2f9401 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -168,6 +168,8 @@ struct inet_hashinfo {
 	/* The 2nd listener table hashed by local port and address */
 	unsigned int			lhash2_mask;
 	struct inet_listen_hashbucket	*lhash2;
+
+	bool				pernet;
 };
 
 static inline struct inet_hashinfo *tcp_or_dccp_get_hashinfo(const struct sock *sk)
@@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
 	hashinfo->ehash_locks = NULL;
 }
 
+struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
+						 unsigned int ehash_entries);
+void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo);
+
 struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c7320ef356d9..6d9c01879027 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -170,6 +170,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_pacing_ca_ratio;
 	int sysctl_tcp_wmem[3];
 	int sysctl_tcp_rmem[3];
+	unsigned int sysctl_tcp_child_ehash_entries;
 	unsigned long sysctl_tcp_comp_sack_delay_ns;
 	unsigned long sysctl_tcp_comp_sack_slack_ns;
 	int sysctl_max_syn_backlog;
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7cd4a6cc99fc..c548ca3e9b0e 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1197,6 +1197,8 @@ static int __init dccp_init(void)
 		INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
 	}
 
+	dccp_hashinfo.pernet = false;
+
 	rc = dccp_mib_init();
 	if (rc)
 		goto out_free_dccp_bhash2;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 643d3a7031eb..706abe7a3021 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
 	return 0;
 }
 EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc);
+
+struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
+						 unsigned int ehash_entries)
+{
+	struct inet_hashinfo *new_hashinfo;
+	int i;
+
+	new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL);
+	if (!new_hashinfo)
+		goto err;
+
+	new_hashinfo->ehash = kvmalloc_array(ehash_entries,
+					     sizeof(struct inet_ehash_bucket),
+					     GFP_KERNEL_ACCOUNT);
+	if (!new_hashinfo->ehash)
+		goto free_hashinfo;
+
+	new_hashinfo->ehash_mask = ehash_entries - 1;
+
+	if (inet_ehash_locks_alloc(new_hashinfo))
+		goto free_ehash;
+
+	for (i = 0; i < ehash_entries; i++)
+		INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i);
+
+	new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep;
+	new_hashinfo->bhash = hashinfo->bhash;
+	new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep;
+	new_hashinfo->bhash2 = hashinfo->bhash2;
+	new_hashinfo->bhash_size = hashinfo->bhash_size;
+
+	new_hashinfo->lhash2_mask = hashinfo->lhash2_mask;
+	new_hashinfo->lhash2 = hashinfo->lhash2;
+
+	new_hashinfo->pernet = true;
+
+	return new_hashinfo;
+
+free_ehash:
+	kvfree(new_hashinfo->ehash);
+free_hashinfo:
+	kfree(new_hashinfo);
+err:
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc);
+
+void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo)
+{
+	if (!hashinfo->pernet)
+		return;
+
+	inet_ehash_locks_free(hashinfo);
+	kvfree(hashinfo->ehash);
+	kfree(hashinfo);
+}
+EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 47ccc343c9fb..a5d40acde9d6 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 	inet_twsk_bind_unhash(tw, hashinfo);
 	spin_unlock(&bhead->lock);
 
-	if (refcount_dec_and_test(&tw->tw_dr->tw_refcount))
+	if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) {
+		inet_pernet_hashinfo_free(hashinfo);
 		kfree(tw->tw_dr);
+	}
 
 	inet_twsk_put(tw);
 }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5490c285668b..5ce6945d4802 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -39,6 +39,7 @@ static u32 u32_max_div_HZ = UINT_MAX / HZ;
 static int one_day_secs = 24 * 3600;
 static u32 fib_multipath_hash_fields_all_mask __maybe_unused =
 	FIB_MULTIPATH_HASH_FIELD_ALL_MASK;
+static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024;
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -382,6 +383,29 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
 	return ret;
 }
 
+static int proc_tcp_ehash_entries(struct ctl_table *table, int write,
+				  void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv4.sysctl_tcp_child_ehash_entries);
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
+	int tcp_ehash_entries;
+	struct ctl_table tbl;
+
+	tcp_ehash_entries = hinfo->ehash_mask + 1;
+
+	/* A negative number indicates that the child netns
+	 * shares the global ehash.
+	 */
+	if (!net_eq(net, &init_net) && !hinfo->pernet)
+		tcp_ehash_entries *= -1;
+
+	tbl.data = &tcp_ehash_entries;
+	tbl.maxlen = sizeof(int);
+
+	return proc_dointvec(&tbl, write, buffer, lenp, ppos);
+}
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
 					  void *buffer, size_t *lenp,
@@ -1321,6 +1345,21 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1         = SYSCTL_ZERO,
 		.extra2         = SYSCTL_ONE,
 	},
+	{
+		.procname	= "tcp_ehash_entries",
+		.data		= &init_net.ipv4.sysctl_tcp_child_ehash_entries,
+		.mode		= 0444,
+		.proc_handler	= proc_tcp_ehash_entries,
+	},
+	{
+		.procname	= "tcp_child_ehash_entries",
+		.data		= &init_net.ipv4.sysctl_tcp_child_ehash_entries,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &tcp_child_ehash_entries_max,
+	},
 	{
 		.procname	= "udp_rmem_min",
 		.data		= &init_net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 306b94dedc8d..08baf7f7ca96 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4788,6 +4788,7 @@ void __init tcp_init(void)
 		INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
 	}
 
+	tcp_hashinfo.pernet = false;
 
 	cnt = tcp_hashinfo.ehash_mask + 1;
 	sysctl_tcp_max_orphans = cnt / 2;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f4a502d57d45..fed5ac256b18 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3110,15 +3110,44 @@ static void __net_exit tcp_sk_exit(struct net *net)
 		bpf_module_put(net->ipv4.tcp_congestion_control,
 			       net->ipv4.tcp_congestion_control->owner);
 	if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) {
+		inet_pernet_hashinfo_free(tcp_death_row->hashinfo);
 		kfree(tcp_death_row);
 		net->ipv4.tcp_death_row = NULL;
 	}
 }
 
-static int __net_init tcp_sk_init(struct net *net)
+static void __net_init tcp_set_hashinfo(struct net *net)
 {
-	int cnt;
+	struct inet_hashinfo *hinfo;
+	unsigned int ehash_entries;
+	struct net *old_net;
+
+	if (net_eq(net, &init_net))
+		goto fallback;
+
+	old_net = current->nsproxy->net_ns;
+	ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries);
+	if (!ehash_entries)
+		goto fallback;
+
+	ehash_entries = roundup_pow_of_two(ehash_entries);
+	hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries);
+	if (!hinfo) {
+		pr_warn("Failed to allocate TCP ehash (entries: %u) "
+			"for a netns, fallback to the global one\n",
+			ehash_entries);
+fallback:
+		hinfo = &tcp_hashinfo;
+		ehash_entries = tcp_hashinfo.ehash_mask + 1;
+	}
 
+	net->ipv4.tcp_death_row->hashinfo = hinfo;
+	net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2;
+	net->ipv4.sysctl_max_syn_backlog = max(128U, ehash_entries / 128);
+}
+
+static int __net_init tcp_sk_init(struct net *net)
+{
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_ecn_fallback = 1;
 
@@ -3147,12 +3176,10 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL);
 	if (!net->ipv4.tcp_death_row)
 		return -ENOMEM;
+
 	refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1);
-	cnt = tcp_hashinfo.ehash_mask + 1;
-	net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2;
-	net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo;
+	tcp_set_hashinfo(net);
 
-	net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128);
 	net->ipv4.sysctl_tcp_sack = 1;
 	net->ipv4.sysctl_tcp_window_scaling = 1;
 	net->ipv4.sysctl_tcp_timestamps = 1;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 9168c5a33344..9b35752478a2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -349,6 +349,7 @@ EXPORT_SYMBOL_GPL(tcp_twsk_destructor);
 
 void tcp_twsk_purge(struct list_head *net_exit_list, int family)
 {
+	bool purged_once = false;
 	struct net *net;
 
 	list_for_each_entry(net, net_exit_list, exit_list) {
@@ -365,8 +366,12 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family)
 		    refcount_read(&net->ipv4.tcp_death_row->tw_refcount) == 1)
 			continue;
 
-		inet_twsk_purge(&tcp_hashinfo, family);
-		break;
+		if (net->ipv4.tcp_death_row->hashinfo->pernet) {
+			inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, family);
+		} else if (!purged_once) {
+			inet_twsk_purge(&tcp_hashinfo, family);
+			purged_once = true;
+		}
 	}
 }
 EXPORT_SYMBOL_GPL(tcp_twsk_purge);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-08-30 19:15 ` [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
@ 2022-09-01 10:57   ` Paolo Abeni
  2022-09-01 21:25     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Abeni @ 2022-09-01 10:57 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski
  Cc: Kuniyuki Iwashima, netdev

On Tue, 2022-08-30 at 12:15 -0700, Kuniyuki Iwashima wrote:
> We will soon introduce an optional per-netns ehash.
> 
> This means we cannot use tcp_hashinfo directly in most places.
> 
> Instead, access it via net->ipv4.tcp_death_row->hashinfo.
> 
> The access will be valid only while initialising tcp_hashinfo
> itself and creating/destroying each netns.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  .../chelsio/inline_crypto/chtls/chtls_cm.c    |  5 +-
>  net/core/filter.c                             |  5 +-
>  net/ipv4/esp4.c                               |  3 +-
>  net/ipv4/inet_hashtables.c                    |  2 +-
>  net/ipv4/netfilter/nf_socket_ipv4.c           |  2 +-
>  net/ipv4/netfilter/nf_tproxy_ipv4.c           | 17 +++--
>  net/ipv4/tcp_diag.c                           | 18 +++++-
>  net/ipv4/tcp_ipv4.c                           | 63 +++++++++++--------
>  net/ipv4/tcp_minisocks.c                      |  2 +-
>  net/ipv6/esp6.c                               |  3 +-
>  net/ipv6/inet6_hashtables.c                   |  4 +-
>  net/ipv6/netfilter/nf_socket_ipv6.c           |  2 +-
>  net/ipv6/netfilter/nf_tproxy_ipv6.c           |  5 +-
>  net/ipv6/tcp_ipv6.c                           | 16 ++---
>  net/mptcp/mptcp_diag.c                        |  7 ++-
>  15 files changed, 92 insertions(+), 62 deletions(-)
> 
> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> index ddfe9208529a..f90bfba4b303 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> @@ -1069,8 +1069,7 @@ static void chtls_pass_accept_rpl(struct sk_buff *skb,
>  	cxgb4_l2t_send(csk->egress_dev, skb, csk->l2t_entry);
>  }
>  
> -static void inet_inherit_port(struct inet_hashinfo *hash_info,
> -			      struct sock *lsk, struct sock *newsk)
> +static void inet_inherit_port(struct sock *lsk, struct sock *newsk)
>  {
>  	local_bh_disable();
>  	__inet_inherit_port(lsk, newsk);
> @@ -1240,7 +1239,7 @@ static struct sock *chtls_recv_sock(struct sock *lsk,
>  						     ipv4.sysctl_tcp_window_scaling),
>  					   tp->window_clamp);
>  	neigh_release(n);
> -	inet_inherit_port(&tcp_hashinfo, lsk, newsk);
> +	inet_inherit_port(lsk, newsk);
>  	csk_set_flag(csk, CSK_CONN_INLINE);
>  	bh_unlock_sock(newsk); /* tcp_create_openreq_child ->sk_clone_lock */

I looks to me that the above chunks are functionally a no-op and I
think that omitting the 2 drivers from the v2:

https://lore.kernel.org/netdev/20220829161920.99409-4-kuniyu@amazon.com/

should break mlx5/nfp inside a netns. I don't understand why including
the above and skipping the latters?!? I guess is a question mostly for
Eric :)

> @@ -1728,6 +1728,7 @@ EXPORT_SYMBOL(tcp_v4_do_rcv);
>  
>  int tcp_v4_early_demux(struct sk_buff *skb)
>  {
> +	struct net *net = dev_net(skb->dev);
>  	const struct iphdr *iph;
>  	const struct tcphdr *th;
>  	struct sock *sk;
> @@ -1744,7 +1745,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
>  	if (th->doff < sizeof(struct tcphdr) / 4)
>  		return 0;
>  
> -	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
> +	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
>  				       iph->saddr, th->source,
>  				       iph->daddr, ntohs(th->dest),
>  				       skb->skb_iif, inet_sdif(skb));

/Me is thinking aloud...

I'm wondering if the above has some measurable negative effect for
large deployments using only the main netns?

Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
>hashinfo already into the working set data for established socket?
Would the above increase the WSS by 2 cache-lines?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-01 10:57   ` Paolo Abeni
@ 2022-09-01 21:25     ` Kuniyuki Iwashima
  2022-09-01 21:30       ` Eric Dumazet
  2022-09-01 21:49       ` Jakub Kicinski
  0 siblings, 2 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-01 21:25 UTC (permalink / raw)
  To: pabeni, edumazet; +Cc: davem, kuba, kuni1840, kuniyu, netdev

From:   Paolo Abeni <pabeni@redhat.com>
Date:   Thu, 01 Sep 2022 12:57:33 +0200
> On Tue, 2022-08-30 at 12:15 -0700, Kuniyuki Iwashima wrote:
> > We will soon introduce an optional per-netns ehash.
> > 
> > This means we cannot use tcp_hashinfo directly in most places.
> > 
> > Instead, access it via net->ipv4.tcp_death_row->hashinfo.
> > 
> > The access will be valid only while initialising tcp_hashinfo
> > itself and creating/destroying each netns.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  .../chelsio/inline_crypto/chtls/chtls_cm.c    |  5 +-
> >  net/core/filter.c                             |  5 +-
> >  net/ipv4/esp4.c                               |  3 +-
> >  net/ipv4/inet_hashtables.c                    |  2 +-
> >  net/ipv4/netfilter/nf_socket_ipv4.c           |  2 +-
> >  net/ipv4/netfilter/nf_tproxy_ipv4.c           | 17 +++--
> >  net/ipv4/tcp_diag.c                           | 18 +++++-
> >  net/ipv4/tcp_ipv4.c                           | 63 +++++++++++--------
> >  net/ipv4/tcp_minisocks.c                      |  2 +-
> >  net/ipv6/esp6.c                               |  3 +-
> >  net/ipv6/inet6_hashtables.c                   |  4 +-
> >  net/ipv6/netfilter/nf_socket_ipv6.c           |  2 +-
> >  net/ipv6/netfilter/nf_tproxy_ipv6.c           |  5 +-
> >  net/ipv6/tcp_ipv6.c                           | 16 ++---
> >  net/mptcp/mptcp_diag.c                        |  7 ++-
> >  15 files changed, 92 insertions(+), 62 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > index ddfe9208529a..f90bfba4b303 100644
> > --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > @@ -1069,8 +1069,7 @@ static void chtls_pass_accept_rpl(struct sk_buff *skb,
> >  	cxgb4_l2t_send(csk->egress_dev, skb, csk->l2t_entry);
> >  }
> >  
> > -static void inet_inherit_port(struct inet_hashinfo *hash_info,
> > -			      struct sock *lsk, struct sock *newsk)
> > +static void inet_inherit_port(struct sock *lsk, struct sock *newsk)
> >  {
> >  	local_bh_disable();
> >  	__inet_inherit_port(lsk, newsk);
> > @@ -1240,7 +1239,7 @@ static struct sock *chtls_recv_sock(struct sock *lsk,
> >  						     ipv4.sysctl_tcp_window_scaling),
> >  					   tp->window_clamp);
> >  	neigh_release(n);
> > -	inet_inherit_port(&tcp_hashinfo, lsk, newsk);
> > +	inet_inherit_port(lsk, newsk);
> >  	csk_set_flag(csk, CSK_CONN_INLINE);
> >  	bh_unlock_sock(newsk); /* tcp_create_openreq_child ->sk_clone_lock */
> 
> I looks to me that the above chunks are functionally a no-op and I
> think that omitting the 2 drivers from the v2:
> 
> https://lore.kernel.org/netdev/20220829161920.99409-4-kuniyu@amazon.com/
> 
> should break mlx5/nfp inside a netns. I don't understand why including
> the above and skipping the latters?!? I guess is a question mostly for
> Eric :)

My best guess is that it's ok unless it does not touch TCP stack deeply
and if it does, the driver developer must catch up with the core changes
not to burden maintainers...?

If so, I understand that take.  OTOH, I also don't want to break anything
when we know the change would do.

So, I'm fine to either stay as is or add the change in v4 again.


> 
> > @@ -1728,6 +1728,7 @@ EXPORT_SYMBOL(tcp_v4_do_rcv);
> >  
> >  int tcp_v4_early_demux(struct sk_buff *skb)
> >  {
> > +	struct net *net = dev_net(skb->dev);
> >  	const struct iphdr *iph;
> >  	const struct tcphdr *th;
> >  	struct sock *sk;
> > @@ -1744,7 +1745,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
> >  	if (th->doff < sizeof(struct tcphdr) / 4)
> >  		return 0;
> >  
> > -	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
> > +	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
> >  				       iph->saddr, th->source,
> >  				       iph->daddr, ntohs(th->dest),
> >  				       skb->skb_iif, inet_sdif(skb));
> 
> /Me is thinking aloud...
> 
> I'm wondering if the above has some measurable negative effect for
> large deployments using only the main netns?
> 
> Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> >hashinfo already into the working set data for established socket?
> Would the above increase the WSS by 2 cache-lines?

Currently, the death_row and hashinfo are touched around tw sockets or
connect().  If connections on the deployment are short-lived or frequently
initiated by itself, that would be host and included in WSS.

If the workload is server and there's no active-close() socket or
connections are long-lived, then it might not be included in WSS.
But I think it's not likely than the former if the deployment is
large enough.

If this change had large impact, then we could revert fbb8295248e1
which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
that tried to fire a TW timer after netns is freed, but 0dad4087a86a
has already reverted.


> 
> Thanks!
> 
> Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-01 21:25     ` Kuniyuki Iwashima
@ 2022-09-01 21:30       ` Eric Dumazet
  2022-09-01 22:12         ` Kuniyuki Iwashima
  2022-09-01 21:49       ` Jakub Kicinski
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2022-09-01 21:30 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Paolo Abeni, David Miller, Jakub Kicinski, Kuniyuki Iwashima,
	netdev

On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Paolo Abeni <pabeni@redhat.com>

> > /Me is thinking aloud...
> >
> > I'm wondering if the above has some measurable negative effect for
> > large deployments using only the main netns?
> >
> > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > >hashinfo already into the working set data for established socket?
> > Would the above increase the WSS by 2 cache-lines?
>
> Currently, the death_row and hashinfo are touched around tw sockets or
> connect().  If connections on the deployment are short-lived or frequently
> initiated by itself, that would be host and included in WSS.
>
> If the workload is server and there's no active-close() socket or
> connections are long-lived, then it might not be included in WSS.
> But I think it's not likely than the former if the deployment is
> large enough.
>
> If this change had large impact, then we could revert fbb8295248e1
> which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> has already reverted.


Concern was fast path.

Each incoming packet does a socket lookup.

Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
field in 'struct net' might inccurr a new cache line miss.

Previously, first cache line of tcp_info was enough to bring a lot of
fields in cpu cache.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-01 21:25     ` Kuniyuki Iwashima
  2022-09-01 21:30       ` Eric Dumazet
@ 2022-09-01 21:49       ` Jakub Kicinski
  2022-09-01 22:19         ` Kuniyuki Iwashima
  1 sibling, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2022-09-01 21:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: pabeni, edumazet, davem, kuni1840, netdev

On Thu, 1 Sep 2022 14:25:20 -0700 Kuniyuki Iwashima wrote:
> > I looks to me that the above chunks are functionally a no-op and I
> > think that omitting the 2 drivers from the v2:
> > 
> > https://lore.kernel.org/netdev/20220829161920.99409-4-kuniyu@amazon.com/
> > 
> > should break mlx5/nfp inside a netns. I don't understand why including
> > the above and skipping the latters?!? I guess is a question mostly for
> > Eric :)  
> 
> My best guess is that it's ok unless it does not touch TCP stack deeply
> and if it does, the driver developer must catch up with the core changes
> not to burden maintainers...?
> 
> If so, I understand that take.  OTOH, I also don't want to break anything
> when we know the change would do.
> 
> So, I'm fine to either stay as is or add the change in v4 again.

FWIW I share Paolo's concern. If we don't want the drivers to be
twiddling with the hash tables we should factor out that code to
a common helper in net/tls/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-01 21:30       ` Eric Dumazet
@ 2022-09-01 22:12         ` Kuniyuki Iwashima
  2022-09-03  0:44           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-01 22:12 UTC (permalink / raw)
  To: edumazet; +Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni

From:   Eric Dumazet <edumazet@google.com>
Date:   Thu, 1 Sep 2022 14:30:43 -0700
> On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Paolo Abeni <pabeni@redhat.com>
> 
> > > /Me is thinking aloud...
> > >
> > > I'm wondering if the above has some measurable negative effect for
> > > large deployments using only the main netns?
> > >
> > > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > > >hashinfo already into the working set data for established socket?
> > > Would the above increase the WSS by 2 cache-lines?
> >
> > Currently, the death_row and hashinfo are touched around tw sockets or
> > connect().  If connections on the deployment are short-lived or frequently
> > initiated by itself, that would be host and included in WSS.
> >
> > If the workload is server and there's no active-close() socket or
> > connections are long-lived, then it might not be included in WSS.
> > But I think it's not likely than the former if the deployment is
> > large enough.
> >
> > If this change had large impact, then we could revert fbb8295248e1
> > which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> > that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> > has already reverted.
> 
> 
> Concern was fast path.
> 
> Each incoming packet does a socket lookup.
> 
> Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
> field in 'struct net' might inccurr a new cache line miss.
> 
> Previously, first cache line of tcp_info was enough to bring a lot of
> fields in cpu cache.

Ok, let me test on that if there could be regressions.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-01 21:49       ` Jakub Kicinski
@ 2022-09-01 22:19         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-01 22:19 UTC (permalink / raw)
  To: kuba; +Cc: davem, edumazet, kuni1840, kuniyu, netdev, pabeni

From:   Jakub Kicinski <kuba@kernel.org>
Date:   Thu, 1 Sep 2022 14:49:36 -0700
> On Thu, 1 Sep 2022 14:25:20 -0700 Kuniyuki Iwashima wrote:
> > > I looks to me that the above chunks are functionally a no-op and I
> > > think that omitting the 2 drivers from the v2:
> > > 
> > > https://lore.kernel.org/netdev/20220829161920.99409-4-kuniyu@amazon.com/
> > > 
> > > should break mlx5/nfp inside a netns. I don't understand why including
> > > the above and skipping the latters?!? I guess is a question mostly for
> > > Eric :)  
> > 
> > My best guess is that it's ok unless it does not touch TCP stack deeply
> > and if it does, the driver developer must catch up with the core changes
> > not to burden maintainers...?
> > 
> > If so, I understand that take.  OTOH, I also don't want to break anything
> > when we know the change would do.
> > 
> > So, I'm fine to either stay as is or add the change in v4 again.
> 
> FWIW I share Paolo's concern. If we don't want the drivers to be
> twiddling with the hash tables we should factor out that code to
> a common helper in net/tls/

That makes sense.
For the moment, I'll add the changes back in v4.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-01 22:12         ` Kuniyuki Iwashima
@ 2022-09-03  0:44           ` Kuniyuki Iwashima
  2022-09-03  0:53             ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-03  0:44 UTC (permalink / raw)
  To: kuniyu; +Cc: davem, edumazet, kuba, kuni1840, netdev, pabeni

From:   Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Thu, 1 Sep 2022 15:12:16 -0700
> From:   Eric Dumazet <edumazet@google.com>
> Date:   Thu, 1 Sep 2022 14:30:43 -0700
> > On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From:   Paolo Abeni <pabeni@redhat.com>
> > 
> > > > /Me is thinking aloud...
> > > >
> > > > I'm wondering if the above has some measurable negative effect for
> > > > large deployments using only the main netns?
> > > >
> > > > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > > > >hashinfo already into the working set data for established socket?
> > > > Would the above increase the WSS by 2 cache-lines?
> > >
> > > Currently, the death_row and hashinfo are touched around tw sockets or
> > > connect().  If connections on the deployment are short-lived or frequently
> > > initiated by itself, that would be host and included in WSS.
> > >
> > > If the workload is server and there's no active-close() socket or
> > > connections are long-lived, then it might not be included in WSS.
> > > But I think it's not likely than the former if the deployment is
> > > large enough.
> > >
> > > If this change had large impact, then we could revert fbb8295248e1
> > > which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> > > that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> > > has already reverted.
> > 
> > 
> > Concern was fast path.
> > 
> > Each incoming packet does a socket lookup.
> > 
> > Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
> > field in 'struct net' might inccurr a new cache line miss.
> > 
> > Previously, first cache line of tcp_info was enough to bring a lot of
> > fields in cpu cache.
> 
> Ok, let me test on that if there could be regressions.

I tested tcp_hashinfo vs tcp_death_row->hashinfo with super_netperf
and collected HW cache-related metrics with perf.

After the patch the number of L1 miss seems to increase, but the
instructions per cycle also increases, and cache miss rate did not
change.  Also, there was not performance regression for netperf.


Tested:

# cat perf_super_netperf
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 3 > /proc/sys/vm/drop_caches

perf stat -a \
     -e cycles,instructions,cache-references,cache-misses,bus-cycles \
     -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
     -e dTLB-loads,dTLB-load-misses \
     -e LLC-loads,LLC-load-misses,LLC-stores \
     ./super_netperf $(($(nproc) * 2)) -H 10.0.0.142 -l 60 -fM

echo 1 > /proc/sys/kernel/nmi_watchdog


Before:

# ./perf_super_netperf
2929.81

 Performance counter stats for 'system wide':

   494,002,600,338      cycles                                                        (23.07%)
   241,230,662,890      instructions              #    0.49  insn per cycle           (30.76%)
     6,303,603,008      cache-references                                              (38.45%)
     1,421,440,332      cache-misses              #   22.550 % of all cache refs      (46.15%)
     4,861,179,308      bus-cycles                                                    (46.15%)
    65,410,735,599      L1-dcache-loads                                               (46.15%)
    12,647,247,339      L1-dcache-load-misses     #   19.34% of all L1-dcache accesses  (30.77%)
    32,912,656,369      L1-dcache-stores                                              (30.77%)
    66,015,779,361      dTLB-loads                                                    (30.77%)
        81,293,994      dTLB-load-misses          #    0.12% of all dTLB cache accesses  (30.77%)
     2,946,386,949      LLC-loads                                                     (30.77%)
       257,223,942      LLC-load-misses           #    8.73% of all LL-cache accesses  (30.77%)
     1,183,820,461      LLC-stores                                                    (15.38%)

      62.132250590 seconds time elapsed


After:

# ./perf_super_netperf
2930.17

 Performance counter stats for 'system wide':

   479,595,776,631      cycles                                                        (23.07%)
   243,318,957,230      instructions              #    0.51  insn per cycle           (30.76%)
     6,169,892,840      cache-references                                              (38.46%)
     1,381,992,694      cache-misses              #   22.399 % of all cache refs      (46.15%)
     4,534,304,190      bus-cycles                                                    (46.16%)
    66,059,178,377      L1-dcache-loads                                               (46.17%)
    12,759,529,139      L1-dcache-load-misses     #   19.32% of all L1-dcache accesses  (30.78%)
    33,292,513,002      L1-dcache-stores                                              (30.78%)
    66,482,176,008      dTLB-loads                                                    (30.77%)
        72,877,970      dTLB-load-misses          #    0.11% of all dTLB cache accesses  (30.76%)
     2,984,881,101      LLC-loads                                                     (30.76%)
       234,747,930      LLC-load-misses           #    7.86% of all LL-cache accesses  (30.76%)
     1,165,606,022      LLC-stores                                                    (15.38%)

      62.110708964 seconds time elapsed


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-03  0:44           ` Kuniyuki Iwashima
@ 2022-09-03  0:53             ` Eric Dumazet
  2022-09-03  1:12               ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2022-09-03  0:53 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David Miller, Jakub Kicinski, Kuniyuki Iwashima, netdev,
	Paolo Abeni

On Fri, Sep 2, 2022 at 5:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Kuniyuki Iwashima <kuniyu@amazon.com>
> Date:   Thu, 1 Sep 2022 15:12:16 -0700
> > From:   Eric Dumazet <edumazet@google.com>
> > Date:   Thu, 1 Sep 2022 14:30:43 -0700
> > > On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From:   Paolo Abeni <pabeni@redhat.com>
> > >
> > > > > /Me is thinking aloud...
> > > > >
> > > > > I'm wondering if the above has some measurable negative effect for
> > > > > large deployments using only the main netns?
> > > > >
> > > > > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > > > > >hashinfo already into the working set data for established socket?
> > > > > Would the above increase the WSS by 2 cache-lines?
> > > >
> > > > Currently, the death_row and hashinfo are touched around tw sockets or
> > > > connect().  If connections on the deployment are short-lived or frequently
> > > > initiated by itself, that would be host and included in WSS.
> > > >
> > > > If the workload is server and there's no active-close() socket or
> > > > connections are long-lived, then it might not be included in WSS.
> > > > But I think it's not likely than the former if the deployment is
> > > > large enough.
> > > >
> > > > If this change had large impact, then we could revert fbb8295248e1
> > > > which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> > > > that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> > > > has already reverted.
> > >
> > >
> > > Concern was fast path.
> > >
> > > Each incoming packet does a socket lookup.
> > >
> > > Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
> > > field in 'struct net' might inccurr a new cache line miss.
> > >
> > > Previously, first cache line of tcp_info was enough to bring a lot of
> > > fields in cpu cache.
> >
> > Ok, let me test on that if there could be regressions.
>
> I tested tcp_hashinfo vs tcp_death_row->hashinfo with super_netperf
> and collected HW cache-related metrics with perf.
>
> After the patch the number of L1 miss seems to increase, but the
> instructions per cycle also increases, and cache miss rate did not
> change.  Also, there was not performance regression for netperf.
>
>
> Tested:
>
> # cat perf_super_netperf
> echo 0 > /proc/sys/kernel/nmi_watchdog
> echo 3 > /proc/sys/vm/drop_caches
>
> perf stat -a \
>      -e cycles,instructions,cache-references,cache-misses,bus-cycles \
>      -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
>      -e dTLB-loads,dTLB-load-misses \
>      -e LLC-loads,LLC-load-misses,LLC-stores \
>      ./super_netperf $(($(nproc) * 2)) -H 10.0.0.142 -l 60 -fM
>
> echo 1 > /proc/sys/kernel/nmi_watchdog
>
>
> Before:
>
> # ./perf_super_netperf
> 2929.81
>
>  Performance counter stats for 'system wide':
>
>    494,002,600,338      cycles                                                        (23.07%)
>    241,230,662,890      instructions              #    0.49  insn per cycle           (30.76%)
>      6,303,603,008      cache-references                                              (38.45%)
>      1,421,440,332      cache-misses              #   22.550 % of all cache refs      (46.15%)
>      4,861,179,308      bus-cycles                                                    (46.15%)
>     65,410,735,599      L1-dcache-loads                                               (46.15%)
>     12,647,247,339      L1-dcache-load-misses     #   19.34% of all L1-dcache accesses  (30.77%)
>     32,912,656,369      L1-dcache-stores                                              (30.77%)
>     66,015,779,361      dTLB-loads                                                    (30.77%)
>         81,293,994      dTLB-load-misses          #    0.12% of all dTLB cache accesses  (30.77%)
>      2,946,386,949      LLC-loads                                                     (30.77%)
>        257,223,942      LLC-load-misses           #    8.73% of all LL-cache accesses  (30.77%)
>      1,183,820,461      LLC-stores                                                    (15.38%)
>
>       62.132250590 seconds time elapsed
>

This test will not be able to see a difference really...

What is needed is to measure the latency when nothing at all is in the caches.

Vast majority of real world TCP traffic is light or moderate.
Packets are received and cpu has to bring X cache lines into L1 in
order to process one packet.

We slowly are increasing X over time :/

pahole is your friend, more than a stress-test.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-03  0:53             ` Eric Dumazet
@ 2022-09-03  1:12               ` Kuniyuki Iwashima
  2022-09-03  1:44                 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-03  1:12 UTC (permalink / raw)
  To: edumazet; +Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 2 Sep 2022 17:53:18 -0700
> On Fri, Sep 2, 2022 at 5:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date:   Thu, 1 Sep 2022 15:12:16 -0700
> > > From:   Eric Dumazet <edumazet@google.com>
> > > Date:   Thu, 1 Sep 2022 14:30:43 -0700
> > > > On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From:   Paolo Abeni <pabeni@redhat.com>
> > > >
> > > > > > /Me is thinking aloud...
> > > > > >
> > > > > > I'm wondering if the above has some measurable negative effect for
> > > > > > large deployments using only the main netns?
> > > > > >
> > > > > > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > > > > > >hashinfo already into the working set data for established socket?
> > > > > > Would the above increase the WSS by 2 cache-lines?
> > > > >
> > > > > Currently, the death_row and hashinfo are touched around tw sockets or
> > > > > connect().  If connections on the deployment are short-lived or frequently
> > > > > initiated by itself, that would be host and included in WSS.
> > > > >
> > > > > If the workload is server and there's no active-close() socket or
> > > > > connections are long-lived, then it might not be included in WSS.
> > > > > But I think it's not likely than the former if the deployment is
> > > > > large enough.
> > > > >
> > > > > If this change had large impact, then we could revert fbb8295248e1
> > > > > which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> > > > > that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> > > > > has already reverted.
> > > >
> > > >
> > > > Concern was fast path.
> > > >
> > > > Each incoming packet does a socket lookup.
> > > >
> > > > Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
> > > > field in 'struct net' might inccurr a new cache line miss.
> > > >
> > > > Previously, first cache line of tcp_info was enough to bring a lot of
> > > > fields in cpu cache.
> > >
> > > Ok, let me test on that if there could be regressions.
> >
> > I tested tcp_hashinfo vs tcp_death_row->hashinfo with super_netperf
> > and collected HW cache-related metrics with perf.
> >
> > After the patch the number of L1 miss seems to increase, but the
> > instructions per cycle also increases, and cache miss rate did not
> > change.  Also, there was not performance regression for netperf.
> >
> >
> > Tested:
> >
> > # cat perf_super_netperf
> > echo 0 > /proc/sys/kernel/nmi_watchdog
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > perf stat -a \
> >      -e cycles,instructions,cache-references,cache-misses,bus-cycles \
> >      -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
> >      -e dTLB-loads,dTLB-load-misses \
> >      -e LLC-loads,LLC-load-misses,LLC-stores \
> >      ./super_netperf $(($(nproc) * 2)) -H 10.0.0.142 -l 60 -fM
> >
> > echo 1 > /proc/sys/kernel/nmi_watchdog
> >
> >
> > Before:
> >
> > # ./perf_super_netperf
> > 2929.81
> >
> >  Performance counter stats for 'system wide':
> >
> >    494,002,600,338      cycles                                                        (23.07%)
> >    241,230,662,890      instructions              #    0.49  insn per cycle           (30.76%)
> >      6,303,603,008      cache-references                                              (38.45%)
> >      1,421,440,332      cache-misses              #   22.550 % of all cache refs      (46.15%)
> >      4,861,179,308      bus-cycles                                                    (46.15%)
> >     65,410,735,599      L1-dcache-loads                                               (46.15%)
> >     12,647,247,339      L1-dcache-load-misses     #   19.34% of all L1-dcache accesses  (30.77%)
> >     32,912,656,369      L1-dcache-stores                                              (30.77%)
> >     66,015,779,361      dTLB-loads                                                    (30.77%)
> >         81,293,994      dTLB-load-misses          #    0.12% of all dTLB cache accesses  (30.77%)
> >      2,946,386,949      LLC-loads                                                     (30.77%)
> >        257,223,942      LLC-load-misses           #    8.73% of all LL-cache accesses  (30.77%)
> >      1,183,820,461      LLC-stores                                                    (15.38%)
> >
> >       62.132250590 seconds time elapsed
> >
> 
> This test will not be able to see a difference really...
> 
> What is needed is to measure the latency when nothing at all is in the caches.
> 
> Vast majority of real world TCP traffic is light or moderate.
> Packets are received and cpu has to bring X cache lines into L1 in
> order to process one packet.
> 
> We slowly are increasing X over time :/
> 
> pahole is your friend, more than a stress-test.

Here's pahole result on my local build.  As Paolo said, we
need 2 cachelines for tcp_death_row and the hashinfo?

How about moving hashinfo as the first member of struct
inet_timewait_death_row and convert it to just struct
instead of pointer so that we need 1 cache line to read
hashinfo?

$ pahole -C netns_ipv4,inet_timewait_death_row vmlinux
struct netns_ipv4 {
	struct inet_timewait_death_row * tcp_death_row;  /*     0     8 */
	struct ctl_table_header *  forw_hdr;             /*     8     8 */
	struct ctl_table_header *  frags_hdr;            /*    16     8 */
	struct ctl_table_header *  ipv4_hdr;             /*    24     8 */
	struct ctl_table_header *  route_hdr;            /*    32     8 */
	struct ctl_table_header *  xfrm4_hdr;            /*    40     8 */
	struct ipv4_devconf *      devconf_all;          /*    48     8 */
	struct ipv4_devconf *      devconf_dflt;         /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	...
};
struct inet_timewait_death_row {
	refcount_t                 tw_refcount;          /*     0     4 */

	/* XXX 60 bytes hole, try to pack */

	/* --- cacheline 1 boundary (64 bytes) --- */
	struct inet_hashinfo *     hashinfo __attribute__((__aligned__(64))); /*    64     8 */
	int                        sysctl_max_tw_buckets; /*    72     4 */

	/* size: 128, cachelines: 2, members: 3 */
	/* sum members: 16, holes: 1, sum holes: 60 */
	/* padding: 52 */
	/* forced alignments: 1, forced holes: 1, sum forced holes: 60 */
} __attribute__((__aligned__(64)));

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-03  1:12               ` Kuniyuki Iwashima
@ 2022-09-03  1:44                 ` Kuniyuki Iwashima
  2022-09-03  2:30                   ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-03  1:44 UTC (permalink / raw)
  To: edumazet; +Cc: davem, kuniyu, kuba, kuni1840, netdev, pabeni

From:   Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Fri, 2 Sep 2022 18:12:43 -0700
> From:   Eric Dumazet <edumazet@google.com>
> Date:   Fri, 2 Sep 2022 17:53:18 -0700
> > On Fri, Sep 2, 2022 at 5:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From:   Kuniyuki Iwashima <kuniyu@amazon.com>
> > > Date:   Thu, 1 Sep 2022 15:12:16 -0700
> > > > From:   Eric Dumazet <edumazet@google.com>
> > > > Date:   Thu, 1 Sep 2022 14:30:43 -0700
> > > > > On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > From:   Paolo Abeni <pabeni@redhat.com>
> > > > >
> > > > > > > /Me is thinking aloud...
> > > > > > >
> > > > > > > I'm wondering if the above has some measurable negative effect for
> > > > > > > large deployments using only the main netns?
> > > > > > >
> > > > > > > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > > > > > > >hashinfo already into the working set data for established socket?
> > > > > > > Would the above increase the WSS by 2 cache-lines?
> > > > > >
> > > > > > Currently, the death_row and hashinfo are touched around tw sockets or
> > > > > > connect().  If connections on the deployment are short-lived or frequently
> > > > > > initiated by itself, that would be host and included in WSS.
> > > > > >
> > > > > > If the workload is server and there's no active-close() socket or
> > > > > > connections are long-lived, then it might not be included in WSS.
> > > > > > But I think it's not likely than the former if the deployment is
> > > > > > large enough.
> > > > > >
> > > > > > If this change had large impact, then we could revert fbb8295248e1
> > > > > > which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> > > > > > that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> > > > > > has already reverted.
> > > > >
> > > > >
> > > > > Concern was fast path.
> > > > >
> > > > > Each incoming packet does a socket lookup.
> > > > >
> > > > > Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
> > > > > field in 'struct net' might inccurr a new cache line miss.
> > > > >
> > > > > Previously, first cache line of tcp_info was enough to bring a lot of
> > > > > fields in cpu cache.
> > > >
> > > > Ok, let me test on that if there could be regressions.
> > >
> > > I tested tcp_hashinfo vs tcp_death_row->hashinfo with super_netperf
> > > and collected HW cache-related metrics with perf.
> > >
> > > After the patch the number of L1 miss seems to increase, but the
> > > instructions per cycle also increases, and cache miss rate did not
> > > change.  Also, there was not performance regression for netperf.
> > >
> > >
> > > Tested:
> > >
> > > # cat perf_super_netperf
> > > echo 0 > /proc/sys/kernel/nmi_watchdog
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > perf stat -a \
> > >      -e cycles,instructions,cache-references,cache-misses,bus-cycles \
> > >      -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
> > >      -e dTLB-loads,dTLB-load-misses \
> > >      -e LLC-loads,LLC-load-misses,LLC-stores \
> > >      ./super_netperf $(($(nproc) * 2)) -H 10.0.0.142 -l 60 -fM
> > >
> > > echo 1 > /proc/sys/kernel/nmi_watchdog
> > >
> > >
> > > Before:
> > >
> > > # ./perf_super_netperf
> > > 2929.81
> > >
> > >  Performance counter stats for 'system wide':
> > >
> > >    494,002,600,338      cycles                                                        (23.07%)
> > >    241,230,662,890      instructions              #    0.49  insn per cycle           (30.76%)
> > >      6,303,603,008      cache-references                                              (38.45%)
> > >      1,421,440,332      cache-misses              #   22.550 % of all cache refs      (46.15%)
> > >      4,861,179,308      bus-cycles                                                    (46.15%)
> > >     65,410,735,599      L1-dcache-loads                                               (46.15%)
> > >     12,647,247,339      L1-dcache-load-misses     #   19.34% of all L1-dcache accesses  (30.77%)
> > >     32,912,656,369      L1-dcache-stores                                              (30.77%)
> > >     66,015,779,361      dTLB-loads                                                    (30.77%)
> > >         81,293,994      dTLB-load-misses          #    0.12% of all dTLB cache accesses  (30.77%)
> > >      2,946,386,949      LLC-loads                                                     (30.77%)
> > >        257,223,942      LLC-load-misses           #    8.73% of all LL-cache accesses  (30.77%)
> > >      1,183,820,461      LLC-stores                                                    (15.38%)
> > >
> > >       62.132250590 seconds time elapsed
> > >
> > 
> > This test will not be able to see a difference really...
> > 
> > What is needed is to measure the latency when nothing at all is in the caches.
> > 
> > Vast majority of real world TCP traffic is light or moderate.
> > Packets are received and cpu has to bring X cache lines into L1 in
> > order to process one packet.
> > 
> > We slowly are increasing X over time :/
> > 
> > pahole is your friend, more than a stress-test.
> 
> Here's pahole result on my local build.  As Paolo said, we
> need 2 cachelines for tcp_death_row and the hashinfo?
> 
> How about moving hashinfo as the first member of struct
> inet_timewait_death_row and convert it to just struct
> instead of pointer so that we need 1 cache line to read
> hashinfo?

Like this.

$ pahole -EC netns_ipv4 vmlinux
struct netns_ipv4 {
	struct inet_timewait_death_row {
		struct inet_hashinfo * hashinfo __attribute__((__aligned__(64)));        /*     0     8 */
		/* typedef refcount_t */ struct refcount_struct {
			/* typedef atomic_t */ struct {
				int counter;                                             /*     8     4 */
			} refs; /*     8     4 */
		} tw_refcount; /*     8     4 */
		int                sysctl_max_tw_buckets;                                /*    12     4 */
	} tcp_death_row __attribute__((__aligned__(64))) __attribute__((__aligned__(64))); /*     0    64 */

	/* XXX last struct has 48 bytes of padding */

	/* --- cacheline 1 boundary (64 bytes) --- */
...
} __attribute__((__aligned__(64)));


---8<---
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 6320a76cefdc..dee53193d258 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -32,16 +32,15 @@ struct ping_group_range {
 struct inet_hashinfo;
 
 struct inet_timewait_death_row {
-	refcount_t		tw_refcount;
-
 	struct inet_hashinfo 	*hashinfo ____cacheline_aligned_in_smp;
+	refcount_t		tw_refcount;
 	int			sysctl_max_tw_buckets;
 };
 
 struct tcp_fastopen_context;
 
 struct netns_ipv4 {
-	struct inet_timewait_death_row *tcp_death_row;
+	struct inet_timewait_death_row tcp_death_row;
 
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*forw_hdr;
---8<---


> 
> $ pahole -C netns_ipv4,inet_timewait_death_row vmlinux
> struct netns_ipv4 {
> 	struct inet_timewait_death_row * tcp_death_row;  /*     0     8 */
> 	struct ctl_table_header *  forw_hdr;             /*     8     8 */
> 	struct ctl_table_header *  frags_hdr;            /*    16     8 */
> 	struct ctl_table_header *  ipv4_hdr;             /*    24     8 */
> 	struct ctl_table_header *  route_hdr;            /*    32     8 */
> 	struct ctl_table_header *  xfrm4_hdr;            /*    40     8 */
> 	struct ipv4_devconf *      devconf_all;          /*    48     8 */
> 	struct ipv4_devconf *      devconf_dflt;         /*    56     8 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	...
> };
> struct inet_timewait_death_row {
> 	refcount_t                 tw_refcount;          /*     0     4 */
> 
> 	/* XXX 60 bytes hole, try to pack */
> 
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct inet_hashinfo *     hashinfo __attribute__((__aligned__(64))); /*    64     8 */
> 	int                        sysctl_max_tw_buckets; /*    72     4 */
> 
> 	/* size: 128, cachelines: 2, members: 3 */
> 	/* sum members: 16, holes: 1, sum holes: 60 */
> 	/* padding: 52 */
> 	/* forced alignments: 1, forced holes: 1, sum forced holes: 60 */
> } __attribute__((__aligned__(64)));

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-03  1:44                 ` Kuniyuki Iwashima
@ 2022-09-03  2:30                   ` Eric Dumazet
  2022-09-03  2:50                     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2022-09-03  2:30 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David Miller, Jakub Kicinski, Kuniyuki Iwashima, netdev,
	Paolo Abeni

On Fri, Sep 2, 2022 at 6:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From:   Kuniyuki Iwashima <kuniyu@amazon.com>
> Date:   Fri, 2 Sep 2022 18:12:43 -0700
> > From:   Eric Dumazet <edumazet@google.com>
> > Date:   Fri, 2 Sep 2022 17:53:18 -0700
> > > On Fri, Sep 2, 2022 at 5:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From:   Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > Date:   Thu, 1 Sep 2022 15:12:16 -0700
> > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > Date:   Thu, 1 Sep 2022 14:30:43 -0700
> > > > > > On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > From:   Paolo Abeni <pabeni@redhat.com>
> > > > > >
> > > > > > > > /Me is thinking aloud...
> > > > > > > >
> > > > > > > > I'm wondering if the above has some measurable negative effect for
> > > > > > > > large deployments using only the main netns?
> > > > > > > >
> > > > > > > > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > > > > > > > >hashinfo already into the working set data for established socket?
> > > > > > > > Would the above increase the WSS by 2 cache-lines?
> > > > > > >
> > > > > > > Currently, the death_row and hashinfo are touched around tw sockets or
> > > > > > > connect().  If connections on the deployment are short-lived or frequently
> > > > > > > initiated by itself, that would be host and included in WSS.
> > > > > > >
> > > > > > > If the workload is server and there's no active-close() socket or
> > > > > > > connections are long-lived, then it might not be included in WSS.
> > > > > > > But I think it's not likely than the former if the deployment is
> > > > > > > large enough.
> > > > > > >
> > > > > > > If this change had large impact, then we could revert fbb8295248e1
> > > > > > > which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> > > > > > > that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> > > > > > > has already reverted.
> > > > > >
> > > > > >
> > > > > > Concern was fast path.
> > > > > >
> > > > > > Each incoming packet does a socket lookup.
> > > > > >
> > > > > > Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
> > > > > > field in 'struct net' might inccurr a new cache line miss.
> > > > > >
> > > > > > Previously, first cache line of tcp_info was enough to bring a lot of
> > > > > > fields in cpu cache.
> > > > >
> > > > > Ok, let me test on that if there could be regressions.
> > > >
> > > > I tested tcp_hashinfo vs tcp_death_row->hashinfo with super_netperf
> > > > and collected HW cache-related metrics with perf.
> > > >
> > > > After the patch the number of L1 miss seems to increase, but the
> > > > instructions per cycle also increases, and cache miss rate did not
> > > > change.  Also, there was not performance regression for netperf.
> > > >
> > > >
> > > > Tested:
> > > >
> > > > # cat perf_super_netperf
> > > > echo 0 > /proc/sys/kernel/nmi_watchdog
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > >
> > > > perf stat -a \
> > > >      -e cycles,instructions,cache-references,cache-misses,bus-cycles \
> > > >      -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
> > > >      -e dTLB-loads,dTLB-load-misses \
> > > >      -e LLC-loads,LLC-load-misses,LLC-stores \
> > > >      ./super_netperf $(($(nproc) * 2)) -H 10.0.0.142 -l 60 -fM
> > > >
> > > > echo 1 > /proc/sys/kernel/nmi_watchdog
> > > >
> > > >
> > > > Before:
> > > >
> > > > # ./perf_super_netperf
> > > > 2929.81
> > > >
> > > >  Performance counter stats for 'system wide':
> > > >
> > > >    494,002,600,338      cycles                                                        (23.07%)
> > > >    241,230,662,890      instructions              #    0.49  insn per cycle           (30.76%)
> > > >      6,303,603,008      cache-references                                              (38.45%)
> > > >      1,421,440,332      cache-misses              #   22.550 % of all cache refs      (46.15%)
> > > >      4,861,179,308      bus-cycles                                                    (46.15%)
> > > >     65,410,735,599      L1-dcache-loads                                               (46.15%)
> > > >     12,647,247,339      L1-dcache-load-misses     #   19.34% of all L1-dcache accesses  (30.77%)
> > > >     32,912,656,369      L1-dcache-stores                                              (30.77%)
> > > >     66,015,779,361      dTLB-loads                                                    (30.77%)
> > > >         81,293,994      dTLB-load-misses          #    0.12% of all dTLB cache accesses  (30.77%)
> > > >      2,946,386,949      LLC-loads                                                     (30.77%)
> > > >        257,223,942      LLC-load-misses           #    8.73% of all LL-cache accesses  (30.77%)
> > > >      1,183,820,461      LLC-stores                                                    (15.38%)
> > > >
> > > >       62.132250590 seconds time elapsed
> > > >
> > >
> > > This test will not be able to see a difference really...
> > >
> > > What is needed is to measure the latency when nothing at all is in the caches.
> > >
> > > Vast majority of real world TCP traffic is light or moderate.
> > > Packets are received and cpu has to bring X cache lines into L1 in
> > > order to process one packet.
> > >
> > > We slowly are increasing X over time :/
> > >
> > > pahole is your friend, more than a stress-test.
> >
> > Here's pahole result on my local build.  As Paolo said, we
> > need 2 cachelines for tcp_death_row and the hashinfo?
> >
> > How about moving hashinfo as the first member of struct
> > inet_timewait_death_row and convert it to just struct
> > instead of pointer so that we need 1 cache line to read
> > hashinfo?
>
> Like this.
>
> $ pahole -EC netns_ipv4 vmlinux
> struct netns_ipv4 {
>         struct inet_timewait_death_row {
>                 struct inet_hashinfo * hashinfo __attribute__((__aligned__(64)));        /*     0     8 */
>                 /* typedef refcount_t */ struct refcount_struct {
>                         /* typedef atomic_t */ struct {
>                                 int counter;                                             /*     8     4 */
>                         } refs; /*     8     4 */
>                 } tw_refcount; /*     8     4 */
>                 int                sysctl_max_tw_buckets;                                /*    12     4 */
>         } tcp_death_row __attribute__((__aligned__(64))) __attribute__((__aligned__(64))); /*     0    64 */
>
>         /* XXX last struct has 48 bytes of padding */
>
>         /* --- cacheline 1 boundary (64 bytes) --- */
> ...
> } __attribute__((__aligned__(64)));
>
>
> ---8<---
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 6320a76cefdc..dee53193d258 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -32,16 +32,15 @@ struct ping_group_range {
>  struct inet_hashinfo;
>
>  struct inet_timewait_death_row {
> -       refcount_t              tw_refcount;
> -
>         struct inet_hashinfo    *hashinfo ____cacheline_aligned_in_smp;
> +       refcount_t              tw_refcount;

This would be very bad. tw_refcount would share a cache line with hashinfo.

false sharing is more problematic than a cache line miss in read mode.

>         int                     sysctl_max_tw_buckets;
>  };
>
>  struct tcp_fastopen_context;
>
>  struct netns_ipv4 {
> -       struct inet_timewait_death_row *tcp_death_row;
> +       struct inet_timewait_death_row tcp_death_row;
>
>  #ifdef CONFIG_SYSCTL
>         struct ctl_table_header *forw_hdr;
> ---8<---
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-03  2:30                   ` Eric Dumazet
@ 2022-09-03  2:50                     ` Kuniyuki Iwashima
  2022-09-03  3:16                       ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-03  2:50 UTC (permalink / raw)
  To: edumazet; +Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 2 Sep 2022 19:30:31 -0700
> On Fri, Sep 2, 2022 at 6:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From:   Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date:   Fri, 2 Sep 2022 18:12:43 -0700
> > > From:   Eric Dumazet <edumazet@google.com>
> > > Date:   Fri, 2 Sep 2022 17:53:18 -0700
> > > > On Fri, Sep 2, 2022 at 5:44 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From:   Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > > Date:   Thu, 1 Sep 2022 15:12:16 -0700
> > > > > > From:   Eric Dumazet <edumazet@google.com>
> > > > > > Date:   Thu, 1 Sep 2022 14:30:43 -0700
> > > > > > > On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > >
> > > > > > > > From:   Paolo Abeni <pabeni@redhat.com>
> > > > > > >
> > > > > > > > > /Me is thinking aloud...
> > > > > > > > >
> > > > > > > > > I'm wondering if the above has some measurable negative effect for
> > > > > > > > > large deployments using only the main netns?
> > > > > > > > >
> > > > > > > > > Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> > > > > > > > > >hashinfo already into the working set data for established socket?
> > > > > > > > > Would the above increase the WSS by 2 cache-lines?
> > > > > > > >
> > > > > > > > Currently, the death_row and hashinfo are touched around tw sockets or
> > > > > > > > connect().  If connections on the deployment are short-lived or frequently
> > > > > > > > initiated by itself, that would be host and included in WSS.
> > > > > > > >
> > > > > > > > If the workload is server and there's no active-close() socket or
> > > > > > > > connections are long-lived, then it might not be included in WSS.
> > > > > > > > But I think it's not likely than the former if the deployment is
> > > > > > > > large enough.
> > > > > > > >
> > > > > > > > If this change had large impact, then we could revert fbb8295248e1
> > > > > > > > which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
> > > > > > > > that tried to fire a TW timer after netns is freed, but 0dad4087a86a
> > > > > > > > has already reverted.
> > > > > > >
> > > > > > >
> > > > > > > Concern was fast path.
> > > > > > >
> > > > > > > Each incoming packet does a socket lookup.
> > > > > > >
> > > > > > > Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
> > > > > > > field in 'struct net' might inccurr a new cache line miss.
> > > > > > >
> > > > > > > Previously, first cache line of tcp_info was enough to bring a lot of
> > > > > > > fields in cpu cache.
> > > > > >
> > > > > > Ok, let me test on that if there could be regressions.
> > > > >
> > > > > I tested tcp_hashinfo vs tcp_death_row->hashinfo with super_netperf
> > > > > and collected HW cache-related metrics with perf.
> > > > >
> > > > > After the patch the number of L1 miss seems to increase, but the
> > > > > instructions per cycle also increases, and cache miss rate did not
> > > > > change.  Also, there was not performance regression for netperf.
> > > > >
> > > > >
> > > > > Tested:
> > > > >
> > > > > # cat perf_super_netperf
> > > > > echo 0 > /proc/sys/kernel/nmi_watchdog
> > > > > echo 3 > /proc/sys/vm/drop_caches
> > > > >
> > > > > perf stat -a \
> > > > >      -e cycles,instructions,cache-references,cache-misses,bus-cycles \
> > > > >      -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
> > > > >      -e dTLB-loads,dTLB-load-misses \
> > > > >      -e LLC-loads,LLC-load-misses,LLC-stores \
> > > > >      ./super_netperf $(($(nproc) * 2)) -H 10.0.0.142 -l 60 -fM
> > > > >
> > > > > echo 1 > /proc/sys/kernel/nmi_watchdog
> > > > >
> > > > >
> > > > > Before:
> > > > >
> > > > > # ./perf_super_netperf
> > > > > 2929.81
> > > > >
> > > > >  Performance counter stats for 'system wide':
> > > > >
> > > > >    494,002,600,338      cycles                                                        (23.07%)
> > > > >    241,230,662,890      instructions              #    0.49  insn per cycle           (30.76%)
> > > > >      6,303,603,008      cache-references                                              (38.45%)
> > > > >      1,421,440,332      cache-misses              #   22.550 % of all cache refs      (46.15%)
> > > > >      4,861,179,308      bus-cycles                                                    (46.15%)
> > > > >     65,410,735,599      L1-dcache-loads                                               (46.15%)
> > > > >     12,647,247,339      L1-dcache-load-misses     #   19.34% of all L1-dcache accesses  (30.77%)
> > > > >     32,912,656,369      L1-dcache-stores                                              (30.77%)
> > > > >     66,015,779,361      dTLB-loads                                                    (30.77%)
> > > > >         81,293,994      dTLB-load-misses          #    0.12% of all dTLB cache accesses  (30.77%)
> > > > >      2,946,386,949      LLC-loads                                                     (30.77%)
> > > > >        257,223,942      LLC-load-misses           #    8.73% of all LL-cache accesses  (30.77%)
> > > > >      1,183,820,461      LLC-stores                                                    (15.38%)
> > > > >
> > > > >       62.132250590 seconds time elapsed
> > > > >
> > > >
> > > > This test will not be able to see a difference really...
> > > >
> > > > What is needed is to measure the latency when nothing at all is in the caches.
> > > >
> > > > Vast majority of real world TCP traffic is light or moderate.
> > > > Packets are received and cpu has to bring X cache lines into L1 in
> > > > order to process one packet.
> > > >
> > > > We slowly are increasing X over time :/
> > > >
> > > > pahole is your friend, more than a stress-test.
> > >
> > > Here's pahole result on my local build.  As Paolo said, we
> > > need 2 cachelines for tcp_death_row and the hashinfo?
> > >
> > > How about moving hashinfo as the first member of struct
> > > inet_timewait_death_row and convert it to just struct
> > > instead of pointer so that we need 1 cache line to read
> > > hashinfo?
> >
> > Like this.
> >
> > $ pahole -EC netns_ipv4 vmlinux
> > struct netns_ipv4 {
> >         struct inet_timewait_death_row {
> >                 struct inet_hashinfo * hashinfo __attribute__((__aligned__(64)));        /*     0     8 */
> >                 /* typedef refcount_t */ struct refcount_struct {
> >                         /* typedef atomic_t */ struct {
> >                                 int counter;                                             /*     8     4 */
> >                         } refs; /*     8     4 */
> >                 } tw_refcount; /*     8     4 */
> >                 int                sysctl_max_tw_buckets;                                /*    12     4 */
> >         } tcp_death_row __attribute__((__aligned__(64))) __attribute__((__aligned__(64))); /*     0    64 */
> >
> >         /* XXX last struct has 48 bytes of padding */
> >
> >         /* --- cacheline 1 boundary (64 bytes) --- */
> > ...
> > } __attribute__((__aligned__(64)));
> >
> >
> > ---8<---
> > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> > index 6320a76cefdc..dee53193d258 100644
> > --- a/include/net/netns/ipv4.h
> > +++ b/include/net/netns/ipv4.h
> > @@ -32,16 +32,15 @@ struct ping_group_range {
> >  struct inet_hashinfo;
> >
> >  struct inet_timewait_death_row {
> > -       refcount_t              tw_refcount;
> > -
> >         struct inet_hashinfo    *hashinfo ____cacheline_aligned_in_smp;
> > +       refcount_t              tw_refcount;
> 
> This would be very bad. tw_refcount would share a cache line with hashinfo.
> 
> false sharing is more problematic than a cache line miss in read mode.

Ah, exactly.
Then I will add ____cacheline_aligned_in_smp to tw_refcount, instead of
keeping the original order, to avoid invalidation by sysctl_max_tw_buckets
change, which wouldn't be so frequently done though.

 struct inet_timewait_death_row {
-       refcount_t              tw_refcount;
-
-       struct inet_hashinfo    *hashinfo ____cacheline_aligned_in_smp;
+       struct inet_hashinfo    *hashinfo;
+       refcount_t              tw_refcount ____cacheline_aligned_in_smp;
        int                     sysctl_max_tw_buckets;
 };


> 
> >         int                     sysctl_max_tw_buckets;
> >  };
> >
> >  struct tcp_fastopen_context;
> >
> >  struct netns_ipv4 {
> > -       struct inet_timewait_death_row *tcp_death_row;
> > +       struct inet_timewait_death_row tcp_death_row;
> >
> >  #ifdef CONFIG_SYSCTL
> >         struct ctl_table_header *forw_hdr;
> > ---8<---
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-03  2:50                     ` Kuniyuki Iwashima
@ 2022-09-03  3:16                       ` Eric Dumazet
  2022-09-03  3:25                         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2022-09-03  3:16 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David Miller, Jakub Kicinski, Kuniyuki Iwashima, netdev,
	Paolo Abeni

On Fri, Sep 2, 2022 at 7:50 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:

> Ah, exactly.
> Then I will add ____cacheline_aligned_in_smp to tw_refcount, instead of
> keeping the original order, to avoid invalidation by sysctl_max_tw_buckets
> change, which wouldn't be so frequently done though.
>
>  struct inet_timewait_death_row {
> -       refcount_t              tw_refcount;
> -
> -       struct inet_hashinfo    *hashinfo ____cacheline_aligned_in_smp;
> +       struct inet_hashinfo    *hashinfo;
> +       refcount_t              tw_refcount ____cacheline_aligned_in_smp;
>         int                     sysctl_max_tw_buckets;

This would move sysctl_max_tw_buckets in a separate/dedicated cache line :/

-->

{
   refcount_t              tw_refcount;
   /* Padding to avoid false sharing, tw_refcount can be often written */
    struct inet_hashinfo    *hashinfo ____cacheline_aligned_in_smp;
   int                     sysctl_max_tw_buckets;

  .. other read only fields could fit here.

>  };

Explicit alignment of the structure or first field is not needed,
they will already be cache line aligned.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
  2022-09-03  3:16                       ` Eric Dumazet
@ 2022-09-03  3:25                         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 20+ messages in thread
From: Kuniyuki Iwashima @ 2022-09-03  3:25 UTC (permalink / raw)
  To: edumazet; +Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 2 Sep 2022 20:16:11 -0700
> On Fri, Sep 2, 2022 at 7:50 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> 
> > Ah, exactly.
> > Then I will add ____cacheline_aligned_in_smp to tw_refcount, instead of
> > keeping the original order, to avoid invalidation by sysctl_max_tw_buckets
> > change, which wouldn't be so frequently done though.
> >
> >  struct inet_timewait_death_row {
> > -       refcount_t              tw_refcount;
> > -
> > -       struct inet_hashinfo    *hashinfo ____cacheline_aligned_in_smp;
> > +       struct inet_hashinfo    *hashinfo;
> > +       refcount_t              tw_refcount ____cacheline_aligned_in_smp;
> >         int                     sysctl_max_tw_buckets;
> 
> This would move sysctl_max_tw_buckets in a separate/dedicated cache line :/
> 
> -->
> 
> {
>    refcount_t              tw_refcount;
>    /* Padding to avoid false sharing, tw_refcount can be often written */
>     struct inet_hashinfo    *hashinfo ____cacheline_aligned_in_smp;
>    int                     sysctl_max_tw_buckets;
> 
>   .. other read only fields could fit here.
> 
> >  };
> 
> Explicit alignment of the structure or first field is not needed,
> they will already be cache line aligned.

I got it.  I'll add that change.
Thank you so much!

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-09-03  3:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-30 19:15 [PATCH v3 net-next 0/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
2022-08-30 19:15 ` [PATCH v3 net-next 1/5] tcp: Clean up some functions Kuniyuki Iwashima
2022-08-30 19:15 ` [PATCH v3 net-next 2/5] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
2022-08-30 19:15 ` [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
2022-09-01 10:57   ` Paolo Abeni
2022-09-01 21:25     ` Kuniyuki Iwashima
2022-09-01 21:30       ` Eric Dumazet
2022-09-01 22:12         ` Kuniyuki Iwashima
2022-09-03  0:44           ` Kuniyuki Iwashima
2022-09-03  0:53             ` Eric Dumazet
2022-09-03  1:12               ` Kuniyuki Iwashima
2022-09-03  1:44                 ` Kuniyuki Iwashima
2022-09-03  2:30                   ` Eric Dumazet
2022-09-03  2:50                     ` Kuniyuki Iwashima
2022-09-03  3:16                       ` Eric Dumazet
2022-09-03  3:25                         ` Kuniyuki Iwashima
2022-09-01 21:49       ` Jakub Kicinski
2022-09-01 22:19         ` Kuniyuki Iwashima
2022-08-30 19:15 ` [PATCH v3 net-next 4/5] tcp: Save unnecessary inet_twsk_purge() calls Kuniyuki Iwashima
2022-08-30 19:15 ` [PATCH v3 net-next 5/5] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).