netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 0/2] tcp: Update bind bucket state on port release
@ 2025-09-13 10:09 Jakub Sitnicki
  2025-09-13 10:09 ` [PATCH net-next v4 1/2] " Jakub Sitnicki
  2025-09-13 10:09 ` [PATCH net-next v4 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
  0 siblings, 2 replies; 8+ messages in thread
From: Jakub Sitnicki @ 2025-09-13 10:09 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
	Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine

TL;DR
-----

This is another take on addressing the issue we already raised earlier [1].

This time around, instead of trying to relax the bind-conflict checks in
connect(), we make an attempt to fix the tcp bind bucket state accounting.

The goal of this patch set is to make the bind buckets return to "port
reusable by ephemeral connections" state when all sockets blocking the port
from reuse get unhashed.

Changelog
---------
Changes in v4:
- Drop redundant sk_is_connect_bind helper doc comment
- Link to v3: https://lore.kernel.org/r/20250910-update-bind-bucket-state-on-unhash-v3-0-023caaf4ae3c@cloudflare.com

Changes in v3:
- Move the flag from inet_flags to sk_userlocks (Kuniyuki)
- Rename the flag from AUTOBIND to CONNECT_BIND to avoid a name clash (Kuniyuki)
- Drop unreachable code for sk_state == TCP_NEW_SYN_RECV (Kuniyuki)
- Move the helper to inet_hashtables where it's used
- Reword patch 1 description for conciseness
- Link to v2: https://lore.kernel.org/r/20250821-update-bind-bucket-state-on-unhash-v2-0-0c204543a522@cloudflare.com

Changes in v2:
- Rename the inet_sock flag from LAZY_BIND to AUTOBIND (Eric)
- Clear the AUTOBIND flag on disconnect path (Eric)
- Add a test to cover the disconnect case (Eric)
- Link to RFC v1: https://lore.kernel.org/r/20250808-update-bind-bucket-state-on-unhash-v1-0-faf85099d61b@cloudflare.com

Situation
---------

We observe the following scenario in production:

                                                  inet_bind_bucket
                                                state for port 54321
                                                --------------------

                                                (bucket doesn't exist)

// Process A opens a long-lived connection:
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s1.bind(192.0.2.10, 0)
s1.connect(192.51.100.1, 443)
                                                tb->fastreuse = -1
                                                tb->fastreuseport = -1
s1.getsockname() -> 192.0.2.10:54321
s1.send()
s1.recv()
// ... s1 stays open.

// Process B opens a short-lived connection:
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SO_REUSEADDR)
s2.bind(192.0.2.20, 0)
                                                tb->fastreuse = 0
                                                tb->fastreuseport = 0
s2.connect(192.51.100.2, 53)
s2.getsockname() -> 192.0.2.20:54321
s2.send()
s2.recv()
s2.close()

                                                // bucket remains in this
                                                // state even though port
                                                // was released by s2
                                                tb->fastreuse = 0
                                                tb->fastreuseport = 0

// Process A attempts to open another connection
// when there is connection pressure from
// 192.0.2.30:54000..54500 to 192.51.100.1:443.
// Assume only port 54321 is still available.

s3 = socket(AF_INET, SOCK_STREAM)
s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s3.bind(192.0.2.30, 0)
s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)

Problem
-------

We end up in a state where Process A can't reuse ephemeral port 54321 for
as long as there are sockets, like s1, that keep the bind bucket alive. The
bucket does not return to "reusable" state even when all sockets which
blocked it from reuse, like s2, are gone.

The ephemeral port becomes available for use again only after all sockets
bound to it are gone and the bind bucket is destroyed.

Programs which behave like Process B in this scenario - that is, binding to
an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
poorly written. However, the reality is that such implementation is not
actually uncommon. Trying to fix each and every such program is like
playing whack-a-mole.

For instance, it could be any software using Golang's net.Dialer with
LocalAddr provided:

        dialer := &net.Dialer{
                LocalAddr: &net.TCPAddr{IP: srcIP},
        }
        conn, err := dialer.Dial("tcp4", dialTarget)

Or even a ubiquitous tool like dig when using a specific local address:

        $ dig -b 127.1.1.1 +tcp +short example.com

Hence, we are proposing a systematic fix in the network stack itself.

Solution
--------

Please see the description in patch 1.

[1] https://lore.kernel.org/r/20250714-connect-port-search-harder-v3-0-b1a41f249865@cloudflare.com

Reported-by: Lee Valentine <lvalentine@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Jakub Sitnicki (2):
      tcp: Update bind bucket state on port release
      selftests/net: Test tcp port reuse after unbinding a socket

 include/net/inet_connection_sock.h           |   5 +-
 include/net/inet_hashtables.h                |   2 +
 include/net/inet_timewait_sock.h             |   3 +-
 include/net/sock.h                           |   4 +
 net/ipv4/inet_connection_sock.c              |  12 +-
 net/ipv4/inet_hashtables.c                   |  40 ++++-
 net/ipv4/inet_timewait_sock.c                |   1 +
 tools/testing/selftests/net/Makefile         |   1 +
 tools/testing/selftests/net/tcp_port_share.c | 258 +++++++++++++++++++++++++++
 9 files changed, 318 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next v4 1/2] tcp: Update bind bucket state on port release
  2025-09-13 10:09 [PATCH net-next v4 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
@ 2025-09-13 10:09 ` Jakub Sitnicki
  2025-09-16  5:45   ` Kuniyuki Iwashima
  2025-09-16 10:14   ` Paolo Abeni
  2025-09-13 10:09 ` [PATCH net-next v4 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
  1 sibling, 2 replies; 8+ messages in thread
From: Jakub Sitnicki @ 2025-09-13 10:09 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
	Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine

Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
in that state until all sockets are removed and the bucket is destroyed.

In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a reduced ephemeral port
range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
exhaustion since blocked buckets are excluded from reuse.

The reason the bucket state isn't updated on port release is unclear.
Possibly a performance trade-off to avoid scanning bucket owners, or just
an oversight.

Fix it by recalculating the bucket state when a socket releases a port. To
limit overhead, each inet_bind2_bucket stores its own (fastreuse,
fastreuseport) state. On port release, only the relevant port-addr bucket
is scanned, and the overall state is derived from these.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_connection_sock.h |  5 +++--
 include/net/inet_hashtables.h      |  2 ++
 include/net/inet_timewait_sock.h   |  3 ++-
 include/net/sock.h                 |  4 ++++
 net/ipv4/inet_connection_sock.c    | 12 ++++++++----
 net/ipv4/inet_hashtables.c         | 40 +++++++++++++++++++++++++++++++++++++-
 net/ipv4/inet_timewait_sock.c      |  1 +
 7 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 0737d8e178dd..b4b886647607 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -316,8 +316,9 @@ int inet_csk_listen_start(struct sock *sk);
 void inet_csk_listen_stop(struct sock *sk);
 
 /* update the fast reuse flag when adding a socket */
-void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
-			       struct sock *sk);
+void inet_csk_update_fastreuse(const struct sock *sk,
+			       struct inet_bind_bucket *tb,
+			       struct inet_bind2_bucket *tb2);
 
 struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index a3b32241c2f2..1875853e97a4 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -108,6 +108,8 @@ struct inet_bind2_bucket {
 	struct hlist_node	bhash_node;
 	/* List of sockets hashed to this bucket */
 	struct hlist_head	owners;
+	signed char		fastreuse;
+	signed char		fastreuseport;
 };
 
 static inline struct net *ib_net(const struct inet_bind_bucket *ib)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 67a313575780..baafef24318e 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -70,7 +70,8 @@ struct inet_timewait_sock {
 	unsigned int		tw_transparent  : 1,
 				tw_flowlabel	: 20,
 				tw_usec_ts	: 1,
-				tw_pad		: 2,	/* 2 bits hole */
+				tw_connect_bind	: 1,
+				tw_pad		: 1,	/* 1 bit hole */
 				tw_tos		: 8;
 	u32			tw_txhash;
 	u32			tw_priority;
diff --git a/include/net/sock.h b/include/net/sock.h
index 896bec2d2176..15d9fa438337 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1495,6 +1495,10 @@ static inline int __sk_prot_rehash(struct sock *sk)
 
 #define SOCK_BINDADDR_LOCK	4
 #define SOCK_BINDPORT_LOCK	8
+/**
+ * define SOCK_CONNECT_BIND - &sock->sk_userlocks flag for auto-bind at connect() time
+ */
+#define SOCK_CONNECT_BIND	16
 
 struct socket_alloc {
 	struct socket socket;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 142ff8d86fc2..cdd1e12aac8c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -423,7 +423,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 }
 
 static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
-				     struct sock *sk)
+				     const struct sock *sk)
 {
 	if (tb->fastreuseport <= 0)
 		return 0;
@@ -453,8 +453,9 @@ static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
 				    ipv6_only_sock(sk), true, false);
 }
 
-void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
-			       struct sock *sk)
+void inet_csk_update_fastreuse(const struct sock *sk,
+			       struct inet_bind_bucket *tb,
+			       struct inet_bind2_bucket *tb2)
 {
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 
@@ -501,6 +502,9 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
 			tb->fastreuseport = 0;
 		}
 	}
+
+	tb2->fastreuse = tb->fastreuse;
+	tb2->fastreuseport = tb->fastreuseport;
 }
 
 /* Obtain a reference to a local port for the given sock,
@@ -582,7 +586,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	}
 
 success:
-	inet_csk_update_fastreuse(tb, sk);
+	inet_csk_update_fastreuse(sk, tb, tb2);
 
 	if (!inet_csk(sk)->icsk_bind_hash)
 		inet_bind_hash(sk, tb, tb2, port);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ef4ccfd46ff6..b9979508d5da 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -58,6 +58,14 @@ static u32 sk_ehashfn(const struct sock *sk)
 			    sk->sk_daddr, sk->sk_dport);
 }
 
+static bool sk_is_connect_bind(const struct sock *sk)
+{
+	if (sk->sk_state == TCP_TIME_WAIT)
+		return inet_twsk(sk)->tw_connect_bind;
+	else
+		return sk->sk_userlocks & SOCK_CONNECT_BIND;
+}
+
 /*
  * Allocate and initialize a new local port bind bucket.
  * The bindhash mutex for snum's hash chain must be held here.
@@ -87,10 +95,22 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
  */
 void inet_bind_bucket_destroy(struct inet_bind_bucket *tb)
 {
+	const struct inet_bind2_bucket *tb2;
+
 	if (hlist_empty(&tb->bhash2)) {
 		hlist_del_rcu(&tb->node);
 		kfree_rcu(tb, rcu);
+		return;
 	}
+
+	if (tb->fastreuse == -1 && tb->fastreuseport == -1)
+		return;
+	hlist_for_each_entry(tb2, &tb->bhash2, bhash_node) {
+		if (tb2->fastreuse != -1 || tb2->fastreuseport != -1)
+			return;
+	}
+	tb->fastreuse = -1;
+	tb->fastreuseport = -1;
 }
 
 bool inet_bind_bucket_match(const struct inet_bind_bucket *tb, const struct net *net,
@@ -121,6 +141,8 @@ static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb2,
 #else
 	tb2->rcv_saddr = sk->sk_rcv_saddr;
 #endif
+	tb2->fastreuse = 0;
+	tb2->fastreuseport = 0;
 	INIT_HLIST_HEAD(&tb2->owners);
 	hlist_add_head(&tb2->node, &head->chain);
 	hlist_add_head(&tb2->bhash_node, &tb->bhash2);
@@ -143,11 +165,23 @@ struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
 /* Caller must hold hashbucket lock for this tb with local BH disabled */
 void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
 {
+	const struct sock *sk;
+
 	if (hlist_empty(&tb->owners)) {
 		__hlist_del(&tb->node);
 		__hlist_del(&tb->bhash_node);
 		kmem_cache_free(cachep, tb);
+		return;
+	}
+
+	if (tb->fastreuse == -1 && tb->fastreuseport == -1)
+		return;
+	sk_for_each_bound(sk, &tb->owners) {
+		if (!sk_is_connect_bind(sk))
+			return;
 	}
+	tb->fastreuse = -1;
+	tb->fastreuseport = -1;
 }
 
 static bool inet_bind2_bucket_addr_match(const struct inet_bind2_bucket *tb2,
@@ -191,6 +225,7 @@ static void __inet_put_port(struct sock *sk)
 	tb = inet_csk(sk)->icsk_bind_hash;
 	inet_csk(sk)->icsk_bind_hash = NULL;
 	inet_sk(sk)->inet_num = 0;
+	sk->sk_userlocks &= ~SOCK_CONNECT_BIND;
 
 	spin_lock(&head2->lock);
 	if (inet_csk(sk)->icsk_bind2_hash) {
@@ -277,7 +312,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 		}
 	}
 	if (update_fastreuse)
-		inet_csk_update_fastreuse(tb, child);
+		inet_csk_update_fastreuse(child, tb, tb2);
 	inet_bind_hash(child, tb, tb2, port);
 	spin_unlock(&head2->lock);
 	spin_unlock(&head->lock);
@@ -1136,6 +1171,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 					       head2, tb, sk);
 		if (!tb2)
 			goto error;
+		tb2->fastreuse = -1;
+		tb2->fastreuseport = -1;
 	}
 
 	/* Here we want to add a little bit of randomness to the next source
@@ -1148,6 +1185,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 	/* Head lock still held and bh's disabled */
 	inet_bind_hash(sk, tb, tb2, port);
+	sk->sk_userlocks |= SOCK_CONNECT_BIND;
 
 	if (sk_unhashed(sk)) {
 		inet_sk(sk)->inet_sport = htons(port);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5b5426b8ee92..3d51f751876d 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -207,6 +207,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 		tw->tw_hash	    = sk->sk_hash;
 		tw->tw_ipv6only	    = 0;
 		tw->tw_transparent  = inet_test_bit(TRANSPARENT, sk);
+		tw->tw_connect_bind = !!(sk->sk_userlocks & SOCK_CONNECT_BIND);
 		tw->tw_prot	    = sk->sk_prot_creator;
 		atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
 		twsk_net_set(tw, sock_net(sk));

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next v4 2/2] selftests/net: Test tcp port reuse after unbinding a socket
  2025-09-13 10:09 [PATCH net-next v4 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
  2025-09-13 10:09 ` [PATCH net-next v4 1/2] " Jakub Sitnicki
@ 2025-09-13 10:09 ` Jakub Sitnicki
  2025-09-16  5:50   ` Kuniyuki Iwashima
  1 sibling, 1 reply; 8+ messages in thread
From: Jakub Sitnicki @ 2025-09-13 10:09 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
	Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine

Exercise the scenario described in detail in the cover letter:

  1) socket A: connect() from ephemeral port X
  2) socket B: explicitly bind() to port X
  3) check that port X is now excluded from ephemeral ports
  4) close socket B to release the port bind
  5) socket C: connect() from ephemeral port X

As well as a corner case to test that the connect-bind flag is cleared:

  1) connect() from ephemeral port X
  2) disconnect the socket with connect(AF_UNSPEC)
  3) bind() it explicitly to port X
  4) check that port X is now excluded from ephemeral ports

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/net/Makefile         |   1 +
 tools/testing/selftests/net/tcp_port_share.c | 258 +++++++++++++++++++++++++++
 2 files changed, 259 insertions(+)

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index ae1afe75bc86..5d9d96515c4a 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -122,6 +122,7 @@ TEST_PROGS += broadcast_pmtu.sh
 TEST_PROGS += ipv6_force_forwarding.sh
 TEST_GEN_PROGS += ipv6_fragmentation
 TEST_PROGS += route_hint.sh
+TEST_GEN_PROGS += tcp_port_share
 
 # YNL files, must be before "include ..lib.mk"
 YNL_GEN_FILES := busy_poller
diff --git a/tools/testing/selftests/net/tcp_port_share.c b/tools/testing/selftests/net/tcp_port_share.c
new file mode 100644
index 000000000000..4c39d599dfce
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_port_share.c
@@ -0,0 +1,258 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2025 Cloudflare, Inc.
+
+/* Tests for TCP port sharing (bind bucket reuse). */
+
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/ioctl.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdlib.h>
+
+#include "../kselftest_harness.h"
+
+#define DST_PORT 30000
+#define SRC_PORT 40000
+
+struct sockaddr_inet {
+	union {
+		struct sockaddr_storage ss;
+		struct sockaddr_in6 v6;
+		struct sockaddr_in v4;
+		struct sockaddr sa;
+	};
+	socklen_t len;
+	char str[INET6_ADDRSTRLEN + __builtin_strlen("[]:65535") + 1];
+};
+
+const int one = 1;
+
+static int disconnect(int fd)
+{
+	return connect(fd, &(struct sockaddr){ AF_UNSPEC }, sizeof(struct sockaddr));
+}
+
+static int getsockname_port(int fd)
+{
+	struct sockaddr_inet addr = {};
+	int err;
+
+	addr.len = sizeof(addr);
+	err = getsockname(fd, &addr.sa, &addr.len);
+	if (err)
+		return -1;
+
+	switch (addr.sa.sa_family) {
+	case AF_INET:
+		return ntohs(addr.v4.sin_port);
+	case AF_INET6:
+		return ntohs(addr.v6.sin6_port);
+	default:
+		errno = EAFNOSUPPORT;
+		return -1;
+	}
+}
+
+static void make_inet_addr(int af, const char *ip, __u16 port,
+			   struct sockaddr_inet *addr)
+{
+	const char *fmt = "";
+
+	memset(addr, 0, sizeof(*addr));
+
+	switch (af) {
+	case AF_INET:
+		addr->len = sizeof(addr->v4);
+		addr->v4.sin_family = af;
+		addr->v4.sin_port = htons(port);
+		inet_pton(af, ip, &addr->v4.sin_addr);
+		fmt = "%s:%hu";
+		break;
+	case AF_INET6:
+		addr->len = sizeof(addr->v6);
+		addr->v6.sin6_family = af;
+		addr->v6.sin6_port = htons(port);
+		inet_pton(af, ip, &addr->v6.sin6_addr);
+		fmt = "[%s]:%hu";
+		break;
+	}
+
+	snprintf(addr->str, sizeof(addr->str), fmt, ip, port);
+}
+
+FIXTURE(tcp_port_share) {};
+
+FIXTURE_VARIANT(tcp_port_share) {
+	int domain;
+	/* IP to listen on and connect to */
+	const char *dst_ip;
+	/* Primary IP to connect from */
+	const char *src1_ip;
+	/* Secondary IP to connect from */
+	const char *src2_ip;
+	/* IP to bind to in order to block the source port */
+	const char *bind_ip;
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_share, ipv4) {
+	.domain = AF_INET,
+	.dst_ip = "127.0.0.1",
+	.src1_ip = "127.1.1.1",
+	.src2_ip = "127.2.2.2",
+	.bind_ip = "127.3.3.3",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_share, ipv6) {
+	.domain = AF_INET6,
+	.dst_ip = "::1",
+	.src1_ip = "2001:db8::1",
+	.src2_ip = "2001:db8::2",
+	.bind_ip = "2001:db8::3",
+};
+
+FIXTURE_SETUP(tcp_port_share)
+{
+	int sc;
+
+	ASSERT_EQ(unshare(CLONE_NEWNET), 0);
+	ASSERT_EQ(system("ip link set dev lo up"), 0);
+	ASSERT_EQ(system("ip addr add dev lo 2001:db8::1/32 nodad"), 0);
+	ASSERT_EQ(system("ip addr add dev lo 2001:db8::2/32 nodad"), 0);
+	ASSERT_EQ(system("ip addr add dev lo 2001:db8::3/32 nodad"), 0);
+
+	sc = open("/proc/sys/net/ipv4/ip_local_port_range", O_WRONLY);
+	ASSERT_GE(sc, 0);
+	ASSERT_GT(dprintf(sc, "%hu %hu\n", SRC_PORT, SRC_PORT), 0);
+	ASSERT_EQ(close(sc), 0);
+}
+
+FIXTURE_TEARDOWN(tcp_port_share) {}
+
+/* Verify that an ephemeral port becomes available again after the socket
+ * bound to it and blocking it from reuse is closed.
+ */
+TEST_F(tcp_port_share, can_reuse_port_after_bind_and_close)
+{
+	const typeof(variant) v = variant;
+	struct sockaddr_inet addr;
+	int c1, c2, ln, pb;
+
+	/* Listen on <dst_ip>:<DST_PORT> */
+	ln = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(ln, 0) TH_LOG("socket(): %m");
+	ASSERT_EQ(setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+	ASSERT_EQ(bind(ln, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+	ASSERT_EQ(listen(ln, 2), 0);
+
+	/* Connect from <src1_ip>:<SRC_PORT> */
+	c1 = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(c1, 0) TH_LOG("socket(): %m");
+	ASSERT_EQ(setsockopt(c1, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->src1_ip, 0, &addr);
+	ASSERT_EQ(bind(c1, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+	make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+	ASSERT_EQ(connect(c1, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+	ASSERT_EQ(getsockname_port(c1), SRC_PORT);
+
+	/* Bind to <bind_ip>:<SRC_PORT>. Block the port from reuse. */
+	pb = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(pb, 0) TH_LOG("socket(): %m");
+	ASSERT_EQ(setsockopt(pb, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->bind_ip, SRC_PORT, &addr);
+	ASSERT_EQ(bind(pb, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+	/* Try to connect from <src2_ip>:<SRC_PORT>. Expect failure. */
+	c2 = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(c2, 0) TH_LOG("socket");
+	ASSERT_EQ(setsockopt(c2, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->src2_ip, 0, &addr);
+	ASSERT_EQ(bind(c2, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+	make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+	ASSERT_EQ(connect(c2, &addr.sa, addr.len), -1) TH_LOG("connect(%s)", addr.str);
+	ASSERT_EQ(errno, EADDRNOTAVAIL) TH_LOG("%m");
+
+	/* Unbind from <bind_ip>:<SRC_PORT>. Unblock the port for reuse. */
+	ASSERT_EQ(close(pb), 0);
+
+	/* Connect again from <src2_ip>:<SRC_PORT> */
+	EXPECT_EQ(connect(c2, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+	EXPECT_EQ(getsockname_port(c2), SRC_PORT);
+
+	ASSERT_EQ(close(c2), 0);
+	ASSERT_EQ(close(c1), 0);
+	ASSERT_EQ(close(ln), 0);
+}
+
+/* Verify that a socket auto-bound during connect() blocks port reuse after
+ * disconnect (connect(AF_UNSPEC)) followed by an explicit port bind().
+ */
+TEST_F(tcp_port_share, port_block_after_disconnect)
+{
+	const typeof(variant) v = variant;
+	struct sockaddr_inet addr;
+	int c1, c2, ln, pb;
+
+	/* Listen on <dst_ip>:<DST_PORT> */
+	ln = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(ln, 0) TH_LOG("socket(): %m");
+	ASSERT_EQ(setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+	ASSERT_EQ(bind(ln, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+	ASSERT_EQ(listen(ln, 2), 0);
+
+	/* Connect from <src1_ip>:<SRC_PORT> */
+	c1 = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(c1, 0) TH_LOG("socket(): %m");
+	ASSERT_EQ(setsockopt(c1, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->src1_ip, 0, &addr);
+	ASSERT_EQ(bind(c1, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+	make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+	ASSERT_EQ(connect(c1, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+	ASSERT_EQ(getsockname_port(c1), SRC_PORT);
+
+	/* Disconnect the socket and bind it to <bind_ip>:<SRC_PORT> to block the port */
+	ASSERT_EQ(disconnect(c1), 0) TH_LOG("disconnect: %m");
+	ASSERT_EQ(setsockopt(c1, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->bind_ip, SRC_PORT, &addr);
+	ASSERT_EQ(bind(c1, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+	/* Trigger port-addr bucket state update with another bind() and close() */
+	pb = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(pb, 0) TH_LOG("socket(): %m");
+	ASSERT_EQ(setsockopt(pb, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->bind_ip, SRC_PORT, &addr);
+	ASSERT_EQ(bind(pb, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+	ASSERT_EQ(close(pb), 0);
+
+	/* Connect from <src2_ip>:<SRC_PORT>. Expect failure. */
+	c2 = socket(v->domain, SOCK_STREAM, 0);
+	ASSERT_GE(c2, 0) TH_LOG("socket: %m");
+	ASSERT_EQ(setsockopt(c2, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+	make_inet_addr(v->domain, v->src2_ip, 0, &addr);
+	ASSERT_EQ(bind(c2, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+	make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+	EXPECT_EQ(connect(c2, &addr.sa, addr.len), -1) TH_LOG("connect(%s)", addr.str);
+	EXPECT_EQ(errno, EADDRNOTAVAIL) TH_LOG("%m");
+
+	ASSERT_EQ(close(c2), 0);
+	ASSERT_EQ(close(c1), 0);
+	ASSERT_EQ(close(ln), 0);
+}
+
+TEST_HARNESS_MAIN

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v4 1/2] tcp: Update bind bucket state on port release
  2025-09-13 10:09 ` [PATCH net-next v4 1/2] " Jakub Sitnicki
@ 2025-09-16  5:45   ` Kuniyuki Iwashima
  2025-09-16 10:14   ` Paolo Abeni
  1 sibling, 0 replies; 8+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-16  5:45 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine

On Sat, Sep 13, 2025 at 3:09 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
> fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
> in that state until all sockets are removed and the bucket is destroyed.
>
> In this state, the bucket is skipped during ephemeral port selection in
> connect(). For applications using a reduced ephemeral port
> range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
> exhaustion since blocked buckets are excluded from reuse.
>
> The reason the bucket state isn't updated on port release is unclear.
> Possibly a performance trade-off to avoid scanning bucket owners, or just
> an oversight.
>
> Fix it by recalculating the bucket state when a socket releases a port. To
> limit overhead, each inet_bind2_bucket stores its own (fastreuse,
> fastreuseport) state. On port release, only the relevant port-addr bucket
> is scanned, and the overall state is derived from these.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v4 2/2] selftests/net: Test tcp port reuse after unbinding a socket
  2025-09-13 10:09 ` [PATCH net-next v4 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
@ 2025-09-16  5:50   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 8+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-16  5:50 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine

On Sat, Sep 13, 2025 at 3:09 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Exercise the scenario described in detail in the cover letter:
>
>   1) socket A: connect() from ephemeral port X
>   2) socket B: explicitly bind() to port X
>   3) check that port X is now excluded from ephemeral ports
>   4) close socket B to release the port bind
>   5) socket C: connect() from ephemeral port X
>
> As well as a corner case to test that the connect-bind flag is cleared:
>
>   1) connect() from ephemeral port X
>   2) disconnect the socket with connect(AF_UNSPEC)
>   3) bind() it explicitly to port X
>   4) check that port X is now excluded from ephemeral ports
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v4 1/2] tcp: Update bind bucket state on port release
  2025-09-13 10:09 ` [PATCH net-next v4 1/2] " Jakub Sitnicki
  2025-09-16  5:45   ` Kuniyuki Iwashima
@ 2025-09-16 10:14   ` Paolo Abeni
  2025-09-16 13:14     ` Jakub Sitnicki
  1 sibling, 1 reply; 8+ messages in thread
From: Paolo Abeni @ 2025-09-16 10:14 UTC (permalink / raw)
  To: Jakub Sitnicki, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
	Neal Cardwell, kernel-team, Lee Valentine

On 9/13/25 12:09 PM, Jakub Sitnicki wrote:
> Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
> fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
> in that state until all sockets are removed and the bucket is destroyed.
> 
> In this state, the bucket is skipped during ephemeral port selection in
> connect(). For applications using a reduced ephemeral port
> range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
> exhaustion since blocked buckets are excluded from reuse.
> 
> The reason the bucket state isn't updated on port release is unclear.
> Possibly a performance trade-off to avoid scanning bucket owners, or just
> an oversight.
> 
> Fix it by recalculating the bucket state when a socket releases a port. To
> limit overhead, each inet_bind2_bucket stores its own (fastreuse,
> fastreuseport) state. On port release, only the relevant port-addr bucket
> is scanned, and the overall state is derived from these.

I'm possibly likely lost, but I think that the bucket state could change
even after inet_bhash2_update_saddr(), but AFAICS it's not updated there.

What am I missing?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v4 1/2] tcp: Update bind bucket state on port release
  2025-09-16 10:14   ` Paolo Abeni
@ 2025-09-16 13:14     ` Jakub Sitnicki
  2025-09-23  7:45       ` Paolo Abeni
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Sitnicki @ 2025-09-16 13:14 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Kuniyuki Iwashima, Neal Cardwell, kernel-team, Lee Valentine

On Tue, Sep 16, 2025 at 12:14 PM +02, Paolo Abeni wrote:
> On 9/13/25 12:09 PM, Jakub Sitnicki wrote:
>> Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
>> fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
>> in that state until all sockets are removed and the bucket is destroyed.
>> 
>> In this state, the bucket is skipped during ephemeral port selection in
>> connect(). For applications using a reduced ephemeral port
>> range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
>> exhaustion since blocked buckets are excluded from reuse.
>> 
>> The reason the bucket state isn't updated on port release is unclear.
>> Possibly a performance trade-off to avoid scanning bucket owners, or just
>> an oversight.
>> 
>> Fix it by recalculating the bucket state when a socket releases a port. To
>> limit overhead, each inet_bind2_bucket stores its own (fastreuse,
>> fastreuseport) state. On port release, only the relevant port-addr bucket
>> is scanned, and the overall state is derived from these.
>
> I'm possibly likely lost, but I think that the bucket state could change
> even after inet_bhash2_update_saddr(), but AFAICS it's not updated there.

Let me double check if I understand what you have in mind because now I
also feel a bit lost :-)

We already update the bucket state in inet_bhash2_update_saddr(). I
assume we are talking about the main body, not the early bailout path
when the socket is not bound yet [1].

This code gets called only in the obscure (?) case when ip_dynaddr [2]
sysctl is set, and we have a routing failure during connection setup
phase (SYN-SENT).

In such case, on source address update, call to
inet_bind2_bucket_destroy() will recalculate port-addr bucket state,
potentially "downgrading" it to (fastreuse=-1, fastreuseport=-1).

But if the "downgrade" happens, it changes nothing for the port bucket
state, as we are about to re-add the socket into another port-addr
bucket.

Now, adding a CONNECT_BIND socket to an existing port-addr bucket, that
also has no side effects. We can't "upgrade" the bucket to the shareable
state (fastreuse=-1, fastreuseport=-1).

That said, I do see an unaddressed corner case now that I audit this
code again. If we end up _creating_ a new inet_bind2_bucket
(__inet_bhash2_update_saddr->inet_bind2_bucket_init), then the bucket
state should be initialized to (fastreuse=-1, fastreuseport=-1) when the
socket has the CONNECT_BIND flag set.

The call chain I'm referring to:

tcp_connect / __tcp_retransmit_skb
  ->rebuild_header
    inet_sk_rebuild_header
      inet_sk_reselect_saddr IFF sysctl_ip_dynaddr != 0
        inet_bhash2_update_saddr
          __inet_bhash2_update_saddr
            inet_bind2_bucket_init

I propose to handle that by checking if the socket has CONNECT_BIND flag
set and overwriting the port-addr bucket state similiar to like I did in
__inet_hash_connect:

	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
	if (!tb2) {
		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, net,
					       head2, tb, sk);
		if (!tb2)
			goto error;
		tb2->fastreuse = -1;
		tb2->fastreuseport = -1;
	}

So the obscure ip_dynadr path does need a fixup. Other than that I'm not
able to poke any other holes in how we manage the bucket state.

Was that your concern or you had something else in mind?

Thanks,
-jkbs

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/inet_hashtables.c?h=v6.17-rc6#n914
[2] https://docs.kernel.org/networking/ip_dynaddr.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v4 1/2] tcp: Update bind bucket state on port release
  2025-09-16 13:14     ` Jakub Sitnicki
@ 2025-09-23  7:45       ` Paolo Abeni
  0 siblings, 0 replies; 8+ messages in thread
From: Paolo Abeni @ 2025-09-23  7:45 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Kuniyuki Iwashima, Neal Cardwell, kernel-team, Lee Valentine

Hi,

I'm sorry for the latency, I got lost in pending threads.

On 9/16/25 3:14 PM, Jakub Sitnicki wrote:
> On Tue, Sep 16, 2025 at 12:14 PM +02, Paolo Abeni wrote:
>> On 9/13/25 12:09 PM, Jakub Sitnicki wrote:
>>> Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
>>> fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
>>> in that state until all sockets are removed and the bucket is destroyed.
>>>
>>> In this state, the bucket is skipped during ephemeral port selection in
>>> connect(). For applications using a reduced ephemeral port
>>> range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
>>> exhaustion since blocked buckets are excluded from reuse.
>>>
>>> The reason the bucket state isn't updated on port release is unclear.
>>> Possibly a performance trade-off to avoid scanning bucket owners, or just
>>> an oversight.
>>>
>>> Fix it by recalculating the bucket state when a socket releases a port. To
>>> limit overhead, each inet_bind2_bucket stores its own (fastreuse,
>>> fastreuseport) state. On port release, only the relevant port-addr bucket
>>> is scanned, and the overall state is derived from these.
>>
>> I'm possibly likely lost, but I think that the bucket state could change
>> even after inet_bhash2_update_saddr(), but AFAICS it's not updated there.
> 
> Let me double check if I understand what you have in mind because now I
> also feel a bit lost :-)
> 
> We already update the bucket state in inet_bhash2_update_saddr(). I
> assume we are talking about the main body, not the early bailout path
> when the socket is not bound yet [1].
> 
> This code gets called only in the obscure (?) case when ip_dynaddr [2]
> sysctl is set, and we have a routing failure during connection setup
> phase (SYN-SENT).
> 
> In such case, on source address update, call to
> inet_bind2_bucket_destroy() will recalculate port-addr bucket state,
> potentially "downgrading" it to (fastreuse=-1, fastreuseport=-1).
> 
> But if the "downgrade" happens, it changes nothing for the port bucket
> state, as we are about to re-add the socket into another port-addr
> bucket.

This was indeed the path I was looking for. I lost track of the fact
that the port bucket affected by the removed and add is the same, so
it's state does not change.

It clear now that you pointed that out, thanks!

Paolo


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-09-23  7:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-13 10:09 [PATCH net-next v4 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-09-13 10:09 ` [PATCH net-next v4 1/2] " Jakub Sitnicki
2025-09-16  5:45   ` Kuniyuki Iwashima
2025-09-16 10:14   ` Paolo Abeni
2025-09-16 13:14     ` Jakub Sitnicki
2025-09-23  7:45       ` Paolo Abeni
2025-09-13 10:09 ` [PATCH net-next v4 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
2025-09-16  5:50   ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).