* [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release
@ 2025-09-17 13:22 Jakub Sitnicki
2025-09-17 13:22 ` [PATCH net-next v5 1/2] " Jakub Sitnicki
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Jakub Sitnicki @ 2025-09-17 13:22 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
TL;DR
-----
This is another take on addressing the issue we already raised earlier [1].
This time around, instead of trying to relax the bind-conflict checks in
connect(), we make an attempt to fix the tcp bind bucket state accounting.
The goal of this patch set is to make the bind buckets return to "port
reusable by ephemeral connections" state when all sockets blocking the port
from reuse get unhashed.
Changelog
---------
Changes in v5:
- Fix initial port-addr bucket state on saddr update with ip_dynaddr=1
- Add Kuniyuki's tag for tests
- Link to v4: https://lore.kernel.org/r/20250913-update-bind-bucket-state-on-unhash-v4-0-33a567594df7@cloudflare.com
Changes in v4:
- Drop redundant sk_is_connect_bind helper doc comment
- Link to v3: https://lore.kernel.org/r/20250910-update-bind-bucket-state-on-unhash-v3-0-023caaf4ae3c@cloudflare.com
Changes in v3:
- Move the flag from inet_flags to sk_userlocks (Kuniyuki)
- Rename the flag from AUTOBIND to CONNECT_BIND to avoid a name clash (Kuniyuki)
- Drop unreachable code for sk_state == TCP_NEW_SYN_RECV (Kuniyuki)
- Move the helper to inet_hashtables where it's used
- Reword patch 1 description for conciseness
- Link to v2: https://lore.kernel.org/r/20250821-update-bind-bucket-state-on-unhash-v2-0-0c204543a522@cloudflare.com
Changes in v2:
- Rename the inet_sock flag from LAZY_BIND to AUTOBIND (Eric)
- Clear the AUTOBIND flag on disconnect path (Eric)
- Add a test to cover the disconnect case (Eric)
- Link to RFC v1: https://lore.kernel.org/r/20250808-update-bind-bucket-state-on-unhash-v1-0-faf85099d61b@cloudflare.com
Situation
---------
We observe the following scenario in production:
inet_bind_bucket
state for port 54321
--------------------
(bucket doesn't exist)
// Process A opens a long-lived connection:
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s1.bind(192.0.2.10, 0)
s1.connect(192.51.100.1, 443)
tb->fastreuse = -1
tb->fastreuseport = -1
s1.getsockname() -> 192.0.2.10:54321
s1.send()
s1.recv()
// ... s1 stays open.
// Process B opens a short-lived connection:
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SO_REUSEADDR)
s2.bind(192.0.2.20, 0)
tb->fastreuse = 0
tb->fastreuseport = 0
s2.connect(192.51.100.2, 53)
s2.getsockname() -> 192.0.2.20:54321
s2.send()
s2.recv()
s2.close()
// bucket remains in this
// state even though port
// was released by s2
tb->fastreuse = 0
tb->fastreuseport = 0
// Process A attempts to open another connection
// when there is connection pressure from
// 192.0.2.30:54000..54500 to 192.51.100.1:443.
// Assume only port 54321 is still available.
s3 = socket(AF_INET, SOCK_STREAM)
s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s3.bind(192.0.2.30, 0)
s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)
Problem
-------
We end up in a state where Process A can't reuse ephemeral port 54321 for
as long as there are sockets, like s1, that keep the bind bucket alive. The
bucket does not return to "reusable" state even when all sockets which
blocked it from reuse, like s2, are gone.
The ephemeral port becomes available for use again only after all sockets
bound to it are gone and the bind bucket is destroyed.
Programs which behave like Process B in this scenario - that is, binding to
an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
poorly written. However, the reality is that such implementation is not
actually uncommon. Trying to fix each and every such program is like
playing whack-a-mole.
For instance, it could be any software using Golang's net.Dialer with
LocalAddr provided:
dialer := &net.Dialer{
LocalAddr: &net.TCPAddr{IP: srcIP},
}
conn, err := dialer.Dial("tcp4", dialTarget)
Or even a ubiquitous tool like dig when using a specific local address:
$ dig -b 127.1.1.1 +tcp +short example.com
Hence, we are proposing a systematic fix in the network stack itself.
Solution
--------
Please see the description in patch 1.
[1] https://lore.kernel.org/r/20250714-connect-port-search-harder-v3-0-b1a41f249865@cloudflare.com
Reported-by: Lee Valentine <lvalentine@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Jakub Sitnicki (2):
tcp: Update bind bucket state on port release
selftests/net: Test tcp port reuse after unbinding a socket
include/net/inet_connection_sock.h | 5 +-
include/net/inet_hashtables.h | 2 +
include/net/inet_timewait_sock.h | 3 +-
include/net/sock.h | 4 +
net/ipv4/inet_connection_sock.c | 12 +-
net/ipv4/inet_hashtables.c | 44 ++++-
net/ipv4/inet_timewait_sock.c | 1 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/tcp_port_share.c | 258 +++++++++++++++++++++++++++
9 files changed, 322 insertions(+), 8 deletions(-)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH net-next v5 1/2] tcp: Update bind bucket state on port release
2025-09-17 13:22 [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
@ 2025-09-17 13:22 ` Jakub Sitnicki
2025-09-19 21:15 ` Kuniyuki Iwashima
2025-09-17 13:22 ` [PATCH net-next v5 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
2025-09-23 8:30 ` [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release patchwork-bot+netdevbpf
2 siblings, 1 reply; 5+ messages in thread
From: Jakub Sitnicki @ 2025-09-17 13:22 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
in that state until all sockets are removed and the bucket is destroyed.
In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a reduced ephemeral port
range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
exhaustion since blocked buckets are excluded from reuse.
The reason the bucket state isn't updated on port release is unclear.
Possibly a performance trade-off to avoid scanning bucket owners, or just
an oversight.
Fix it by recalculating the bucket state when a socket releases a port. To
limit overhead, each inet_bind2_bucket stores its own (fastreuse,
fastreuseport) state. On port release, only the relevant port-addr bucket
is scanned, and the overall state is derived from these.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
include/net/inet_connection_sock.h | 5 +++--
include/net/inet_hashtables.h | 2 ++
include/net/inet_timewait_sock.h | 3 ++-
include/net/sock.h | 4 ++++
net/ipv4/inet_connection_sock.c | 12 +++++++----
net/ipv4/inet_hashtables.c | 44 +++++++++++++++++++++++++++++++++++++-
net/ipv4/inet_timewait_sock.c | 1 +
7 files changed, 63 insertions(+), 8 deletions(-)
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 0737d8e178dd..b4b886647607 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -316,8 +316,9 @@ int inet_csk_listen_start(struct sock *sk);
void inet_csk_listen_stop(struct sock *sk);
/* update the fast reuse flag when adding a socket */
-void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
- struct sock *sk);
+void inet_csk_update_fastreuse(const struct sock *sk,
+ struct inet_bind_bucket *tb,
+ struct inet_bind2_bucket *tb2);
struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index a3b32241c2f2..1875853e97a4 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -108,6 +108,8 @@ struct inet_bind2_bucket {
struct hlist_node bhash_node;
/* List of sockets hashed to this bucket */
struct hlist_head owners;
+ signed char fastreuse;
+ signed char fastreuseport;
};
static inline struct net *ib_net(const struct inet_bind_bucket *ib)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 67a313575780..baafef24318e 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -70,7 +70,8 @@ struct inet_timewait_sock {
unsigned int tw_transparent : 1,
tw_flowlabel : 20,
tw_usec_ts : 1,
- tw_pad : 2, /* 2 bits hole */
+ tw_connect_bind : 1,
+ tw_pad : 1, /* 1 bit hole */
tw_tos : 8;
u32 tw_txhash;
u32 tw_priority;
diff --git a/include/net/sock.h b/include/net/sock.h
index 0fd465935334..2e398ac33308 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1490,6 +1490,10 @@ static inline int __sk_prot_rehash(struct sock *sk)
#define SOCK_BINDADDR_LOCK 4
#define SOCK_BINDPORT_LOCK 8
+/**
+ * define SOCK_CONNECT_BIND - &sock->sk_userlocks flag for auto-bind at connect() time
+ */
+#define SOCK_CONNECT_BIND 16
struct socket_alloc {
struct socket socket;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 142ff8d86fc2..cdd1e12aac8c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -423,7 +423,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
}
static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
- struct sock *sk)
+ const struct sock *sk)
{
if (tb->fastreuseport <= 0)
return 0;
@@ -453,8 +453,9 @@ static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
ipv6_only_sock(sk), true, false);
}
-void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
- struct sock *sk)
+void inet_csk_update_fastreuse(const struct sock *sk,
+ struct inet_bind_bucket *tb,
+ struct inet_bind2_bucket *tb2)
{
bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
@@ -501,6 +502,9 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
tb->fastreuseport = 0;
}
}
+
+ tb2->fastreuse = tb->fastreuse;
+ tb2->fastreuseport = tb->fastreuseport;
}
/* Obtain a reference to a local port for the given sock,
@@ -582,7 +586,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
}
success:
- inet_csk_update_fastreuse(tb, sk);
+ inet_csk_update_fastreuse(sk, tb, tb2);
if (!inet_csk(sk)->icsk_bind_hash)
inet_bind_hash(sk, tb, tb2, port);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ef4ccfd46ff6..0e0b3d24c3f5 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -58,6 +58,14 @@ static u32 sk_ehashfn(const struct sock *sk)
sk->sk_daddr, sk->sk_dport);
}
+static bool sk_is_connect_bind(const struct sock *sk)
+{
+ if (sk->sk_state == TCP_TIME_WAIT)
+ return inet_twsk(sk)->tw_connect_bind;
+ else
+ return sk->sk_userlocks & SOCK_CONNECT_BIND;
+}
+
/*
* Allocate and initialize a new local port bind bucket.
* The bindhash mutex for snum's hash chain must be held here.
@@ -87,10 +95,22 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
*/
void inet_bind_bucket_destroy(struct inet_bind_bucket *tb)
{
+ const struct inet_bind2_bucket *tb2;
+
if (hlist_empty(&tb->bhash2)) {
hlist_del_rcu(&tb->node);
kfree_rcu(tb, rcu);
+ return;
+ }
+
+ if (tb->fastreuse == -1 && tb->fastreuseport == -1)
+ return;
+ hlist_for_each_entry(tb2, &tb->bhash2, bhash_node) {
+ if (tb2->fastreuse != -1 || tb2->fastreuseport != -1)
+ return;
}
+ tb->fastreuse = -1;
+ tb->fastreuseport = -1;
}
bool inet_bind_bucket_match(const struct inet_bind_bucket *tb, const struct net *net,
@@ -121,6 +141,8 @@ static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb2,
#else
tb2->rcv_saddr = sk->sk_rcv_saddr;
#endif
+ tb2->fastreuse = 0;
+ tb2->fastreuseport = 0;
INIT_HLIST_HEAD(&tb2->owners);
hlist_add_head(&tb2->node, &head->chain);
hlist_add_head(&tb2->bhash_node, &tb->bhash2);
@@ -143,11 +165,23 @@ struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
/* Caller must hold hashbucket lock for this tb with local BH disabled */
void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
{
+ const struct sock *sk;
+
if (hlist_empty(&tb->owners)) {
__hlist_del(&tb->node);
__hlist_del(&tb->bhash_node);
kmem_cache_free(cachep, tb);
+ return;
}
+
+ if (tb->fastreuse == -1 && tb->fastreuseport == -1)
+ return;
+ sk_for_each_bound(sk, &tb->owners) {
+ if (!sk_is_connect_bind(sk))
+ return;
+ }
+ tb->fastreuse = -1;
+ tb->fastreuseport = -1;
}
static bool inet_bind2_bucket_addr_match(const struct inet_bind2_bucket *tb2,
@@ -191,6 +225,7 @@ static void __inet_put_port(struct sock *sk)
tb = inet_csk(sk)->icsk_bind_hash;
inet_csk(sk)->icsk_bind_hash = NULL;
inet_sk(sk)->inet_num = 0;
+ sk->sk_userlocks &= ~SOCK_CONNECT_BIND;
spin_lock(&head2->lock);
if (inet_csk(sk)->icsk_bind2_hash) {
@@ -277,7 +312,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
}
}
if (update_fastreuse)
- inet_csk_update_fastreuse(tb, child);
+ inet_csk_update_fastreuse(child, tb, tb2);
inet_bind_hash(child, tb, tb2, port);
spin_unlock(&head2->lock);
spin_unlock(&head->lock);
@@ -966,6 +1001,10 @@ static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family,
if (!tb2) {
tb2 = new_tb2;
inet_bind2_bucket_init(tb2, net, head2, inet_csk(sk)->icsk_bind_hash, sk);
+ if (sk_is_connect_bind(sk)) {
+ tb2->fastreuse = -1;
+ tb2->fastreuseport = -1;
+ }
}
inet_csk(sk)->icsk_bind2_hash = tb2;
sk_add_bind_node(sk, &tb2->owners);
@@ -1136,6 +1175,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
head2, tb, sk);
if (!tb2)
goto error;
+ tb2->fastreuse = -1;
+ tb2->fastreuseport = -1;
}
/* Here we want to add a little bit of randomness to the next source
@@ -1148,6 +1189,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
/* Head lock still held and bh's disabled */
inet_bind_hash(sk, tb, tb2, port);
+ sk->sk_userlocks |= SOCK_CONNECT_BIND;
if (sk_unhashed(sk)) {
inet_sk(sk)->inet_sport = htons(port);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5b5426b8ee92..3d51f751876d 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -207,6 +207,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
tw->tw_hash = sk->sk_hash;
tw->tw_ipv6only = 0;
tw->tw_transparent = inet_test_bit(TRANSPARENT, sk);
+ tw->tw_connect_bind = !!(sk->sk_userlocks & SOCK_CONNECT_BIND);
tw->tw_prot = sk->sk_prot_creator;
atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
twsk_net_set(tw, sock_net(sk));
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH net-next v5 2/2] selftests/net: Test tcp port reuse after unbinding a socket
2025-09-17 13:22 [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-09-17 13:22 ` [PATCH net-next v5 1/2] " Jakub Sitnicki
@ 2025-09-17 13:22 ` Jakub Sitnicki
2025-09-23 8:30 ` [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release patchwork-bot+netdevbpf
2 siblings, 0 replies; 5+ messages in thread
From: Jakub Sitnicki @ 2025-09-17 13:22 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
Exercise the scenario described in detail in the cover letter:
1) socket A: connect() from ephemeral port X
2) socket B: explicitly bind() to port X
3) check that port X is now excluded from ephemeral ports
4) close socket B to release the port bind
5) socket C: connect() from ephemeral port X
As well as a corner case to test that the connect-bind flag is cleared:
1) connect() from ephemeral port X
2) disconnect the socket with connect(AF_UNSPEC)
3) bind() it explicitly to port X
4) check that port X is now excluded from ephemeral ports
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/tcp_port_share.c | 258 +++++++++++++++++++++++++++
2 files changed, 259 insertions(+)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index ae1afe75bc86..5d9d96515c4a 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -122,6 +122,7 @@ TEST_PROGS += broadcast_pmtu.sh
TEST_PROGS += ipv6_force_forwarding.sh
TEST_GEN_PROGS += ipv6_fragmentation
TEST_PROGS += route_hint.sh
+TEST_GEN_PROGS += tcp_port_share
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := busy_poller
diff --git a/tools/testing/selftests/net/tcp_port_share.c b/tools/testing/selftests/net/tcp_port_share.c
new file mode 100644
index 000000000000..4c39d599dfce
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_port_share.c
@@ -0,0 +1,258 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2025 Cloudflare, Inc.
+
+/* Tests for TCP port sharing (bind bucket reuse). */
+
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/ioctl.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdlib.h>
+
+#include "../kselftest_harness.h"
+
+#define DST_PORT 30000
+#define SRC_PORT 40000
+
+struct sockaddr_inet {
+ union {
+ struct sockaddr_storage ss;
+ struct sockaddr_in6 v6;
+ struct sockaddr_in v4;
+ struct sockaddr sa;
+ };
+ socklen_t len;
+ char str[INET6_ADDRSTRLEN + __builtin_strlen("[]:65535") + 1];
+};
+
+const int one = 1;
+
+static int disconnect(int fd)
+{
+ return connect(fd, &(struct sockaddr){ AF_UNSPEC }, sizeof(struct sockaddr));
+}
+
+static int getsockname_port(int fd)
+{
+ struct sockaddr_inet addr = {};
+ int err;
+
+ addr.len = sizeof(addr);
+ err = getsockname(fd, &addr.sa, &addr.len);
+ if (err)
+ return -1;
+
+ switch (addr.sa.sa_family) {
+ case AF_INET:
+ return ntohs(addr.v4.sin_port);
+ case AF_INET6:
+ return ntohs(addr.v6.sin6_port);
+ default:
+ errno = EAFNOSUPPORT;
+ return -1;
+ }
+}
+
+static void make_inet_addr(int af, const char *ip, __u16 port,
+ struct sockaddr_inet *addr)
+{
+ const char *fmt = "";
+
+ memset(addr, 0, sizeof(*addr));
+
+ switch (af) {
+ case AF_INET:
+ addr->len = sizeof(addr->v4);
+ addr->v4.sin_family = af;
+ addr->v4.sin_port = htons(port);
+ inet_pton(af, ip, &addr->v4.sin_addr);
+ fmt = "%s:%hu";
+ break;
+ case AF_INET6:
+ addr->len = sizeof(addr->v6);
+ addr->v6.sin6_family = af;
+ addr->v6.sin6_port = htons(port);
+ inet_pton(af, ip, &addr->v6.sin6_addr);
+ fmt = "[%s]:%hu";
+ break;
+ }
+
+ snprintf(addr->str, sizeof(addr->str), fmt, ip, port);
+}
+
+FIXTURE(tcp_port_share) {};
+
+FIXTURE_VARIANT(tcp_port_share) {
+ int domain;
+ /* IP to listen on and connect to */
+ const char *dst_ip;
+ /* Primary IP to connect from */
+ const char *src1_ip;
+ /* Secondary IP to connect from */
+ const char *src2_ip;
+ /* IP to bind to in order to block the source port */
+ const char *bind_ip;
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_share, ipv4) {
+ .domain = AF_INET,
+ .dst_ip = "127.0.0.1",
+ .src1_ip = "127.1.1.1",
+ .src2_ip = "127.2.2.2",
+ .bind_ip = "127.3.3.3",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_share, ipv6) {
+ .domain = AF_INET6,
+ .dst_ip = "::1",
+ .src1_ip = "2001:db8::1",
+ .src2_ip = "2001:db8::2",
+ .bind_ip = "2001:db8::3",
+};
+
+FIXTURE_SETUP(tcp_port_share)
+{
+ int sc;
+
+ ASSERT_EQ(unshare(CLONE_NEWNET), 0);
+ ASSERT_EQ(system("ip link set dev lo up"), 0);
+ ASSERT_EQ(system("ip addr add dev lo 2001:db8::1/32 nodad"), 0);
+ ASSERT_EQ(system("ip addr add dev lo 2001:db8::2/32 nodad"), 0);
+ ASSERT_EQ(system("ip addr add dev lo 2001:db8::3/32 nodad"), 0);
+
+ sc = open("/proc/sys/net/ipv4/ip_local_port_range", O_WRONLY);
+ ASSERT_GE(sc, 0);
+ ASSERT_GT(dprintf(sc, "%hu %hu\n", SRC_PORT, SRC_PORT), 0);
+ ASSERT_EQ(close(sc), 0);
+}
+
+FIXTURE_TEARDOWN(tcp_port_share) {}
+
+/* Verify that an ephemeral port becomes available again after the socket
+ * bound to it and blocking it from reuse is closed.
+ */
+TEST_F(tcp_port_share, can_reuse_port_after_bind_and_close)
+{
+ const typeof(variant) v = variant;
+ struct sockaddr_inet addr;
+ int c1, c2, ln, pb;
+
+ /* Listen on <dst_ip>:<DST_PORT> */
+ ln = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(bind(ln, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+ ASSERT_EQ(listen(ln, 2), 0);
+
+ /* Connect from <src1_ip>:<SRC_PORT> */
+ c1 = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(c1, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(c1, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->src1_ip, 0, &addr);
+ ASSERT_EQ(bind(c1, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(connect(c1, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+ ASSERT_EQ(getsockname_port(c1), SRC_PORT);
+
+ /* Bind to <bind_ip>:<SRC_PORT>. Block the port from reuse. */
+ pb = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(pb, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(pb, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->bind_ip, SRC_PORT, &addr);
+ ASSERT_EQ(bind(pb, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ /* Try to connect from <src2_ip>:<SRC_PORT>. Expect failure. */
+ c2 = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(c2, 0) TH_LOG("socket");
+ ASSERT_EQ(setsockopt(c2, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->src2_ip, 0, &addr);
+ ASSERT_EQ(bind(c2, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(connect(c2, &addr.sa, addr.len), -1) TH_LOG("connect(%s)", addr.str);
+ ASSERT_EQ(errno, EADDRNOTAVAIL) TH_LOG("%m");
+
+ /* Unbind from <bind_ip>:<SRC_PORT>. Unblock the port for reuse. */
+ ASSERT_EQ(close(pb), 0);
+
+ /* Connect again from <src2_ip>:<SRC_PORT> */
+ EXPECT_EQ(connect(c2, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+ EXPECT_EQ(getsockname_port(c2), SRC_PORT);
+
+ ASSERT_EQ(close(c2), 0);
+ ASSERT_EQ(close(c1), 0);
+ ASSERT_EQ(close(ln), 0);
+}
+
+/* Verify that a socket auto-bound during connect() blocks port reuse after
+ * disconnect (connect(AF_UNSPEC)) followed by an explicit port bind().
+ */
+TEST_F(tcp_port_share, port_block_after_disconnect)
+{
+ const typeof(variant) v = variant;
+ struct sockaddr_inet addr;
+ int c1, c2, ln, pb;
+
+ /* Listen on <dst_ip>:<DST_PORT> */
+ ln = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(bind(ln, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+ ASSERT_EQ(listen(ln, 2), 0);
+
+ /* Connect from <src1_ip>:<SRC_PORT> */
+ c1 = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(c1, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(c1, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->src1_ip, 0, &addr);
+ ASSERT_EQ(bind(c1, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(connect(c1, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+ ASSERT_EQ(getsockname_port(c1), SRC_PORT);
+
+ /* Disconnect the socket and bind it to <bind_ip>:<SRC_PORT> to block the port */
+ ASSERT_EQ(disconnect(c1), 0) TH_LOG("disconnect: %m");
+ ASSERT_EQ(setsockopt(c1, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->bind_ip, SRC_PORT, &addr);
+ ASSERT_EQ(bind(c1, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ /* Trigger port-addr bucket state update with another bind() and close() */
+ pb = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(pb, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(pb, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->bind_ip, SRC_PORT, &addr);
+ ASSERT_EQ(bind(pb, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ ASSERT_EQ(close(pb), 0);
+
+ /* Connect from <src2_ip>:<SRC_PORT>. Expect failure. */
+ c2 = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(c2, 0) TH_LOG("socket: %m");
+ ASSERT_EQ(setsockopt(c2, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->src2_ip, 0, &addr);
+ ASSERT_EQ(bind(c2, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ EXPECT_EQ(connect(c2, &addr.sa, addr.len), -1) TH_LOG("connect(%s)", addr.str);
+ EXPECT_EQ(errno, EADDRNOTAVAIL) TH_LOG("%m");
+
+ ASSERT_EQ(close(c2), 0);
+ ASSERT_EQ(close(c1), 0);
+ ASSERT_EQ(close(ln), 0);
+}
+
+TEST_HARNESS_MAIN
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH net-next v5 1/2] tcp: Update bind bucket state on port release
2025-09-17 13:22 ` [PATCH net-next v5 1/2] " Jakub Sitnicki
@ 2025-09-19 21:15 ` Kuniyuki Iwashima
0 siblings, 0 replies; 5+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-19 21:15 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
On Wed, Sep 17, 2025 at 6:22 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
> fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
> in that state until all sockets are removed and the bucket is destroyed.
>
> In this state, the bucket is skipped during ephemeral port selection in
> connect(). For applications using a reduced ephemeral port
> range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
> exhaustion since blocked buckets are excluded from reuse.
>
> The reason the bucket state isn't updated on port release is unclear.
> Possibly a performance trade-off to avoid scanning bucket owners, or just
> an oversight.
>
> Fix it by recalculating the bucket state when a socket releases a port. To
> limit overhead, each inet_bind2_bucket stores its own (fastreuse,
> fastreuseport) state. On port release, only the relevant port-addr bucket
> is scanned, and the overall state is derived from these.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
The update in __inet_bhash2_update_saddr() looks good.
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Thanks!
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release
2025-09-17 13:22 [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-09-17 13:22 ` [PATCH net-next v5 1/2] " Jakub Sitnicki
2025-09-17 13:22 ` [PATCH net-next v5 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
@ 2025-09-23 8:30 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-09-23 8:30 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, davem, edumazet, kuba, kuniyu, ncardwell, pabeni,
kernel-team, lvalentine
Hello:
This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Wed, 17 Sep 2025 15:22:03 +0200 you wrote:
> TL;DR
> -----
>
> This is another take on addressing the issue we already raised earlier [1].
>
> This time around, instead of trying to relax the bind-conflict checks in
> connect(), we make an attempt to fix the tcp bind bucket state accounting.
>
> [...]
Here is the summary with links:
- [net-next,v5,1/2] tcp: Update bind bucket state on port release
https://git.kernel.org/netdev/net-next/c/d57f4b874946
- [net-next,v5,2/2] selftests/net: Test tcp port reuse after unbinding a socket
https://git.kernel.org/netdev/net-next/c/8a8241cdaa34
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-09-23 8:30 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-17 13:22 [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-09-17 13:22 ` [PATCH net-next v5 1/2] " Jakub Sitnicki
2025-09-19 21:15 ` Kuniyuki Iwashima
2025-09-17 13:22 ` [PATCH net-next v5 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
2025-09-23 8:30 ` [PATCH net-next v5 0/2] tcp: Update bind bucket state on port release patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).