* [PATCH RFC net-next 0/2] tcp: Update bind bucket state on port release
@ 2025-08-08 9:10 Jakub Sitnicki
2025-08-08 9:10 ` [PATCH RFC net-next 1/2] " Jakub Sitnicki
2025-08-08 9:10 ` [PATCH RFC net-next 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
0 siblings, 2 replies; 5+ messages in thread
From: Jakub Sitnicki @ 2025-08-08 9:10 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
TL;DR
-----
This is another take on addressing the issue we already raised earlier [1].
This time around, instead of trying to relax the bind-conflict checks in
connect(), we make an attempt to fix the tcp bind bucket state accounting.
The goal of this patch set is to make the bind buckets return to "port
reusable by ephemeral connections" state when all sockets blocking the port
from reuse get unhashed.
Situation
---------
We observe the following scenario in production:
inet_bind_bucket
state for port 54321
--------------------
(bucket doesn't exist)
// Process A opens a long-lived connection:
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s1.bind(192.0.2.10, 0)
s1.connect(192.51.100.1, 443)
tb->fastreuse = -1
tb->fastreuseport = -1
s1.getsockname() -> 192.0.2.10:54321
s1.send()
s1.recv()
// ... s1 stays open.
// Process B opens a short-lived connection:
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SO_REUSEADDR)
s2.bind(192.0.2.20, 0)
tb->fastreuse = 0
tb->fastreuseport = 0
s2.connect(192.51.100.2, 53)
s2.getsockname() -> 192.0.2.20:54321
s2.send()
s2.recv()
s2.close()
// bucket remains in this
// state even though port
// was released by s2
tb->fastreuse = 0
tb->fastreuseport = 0
// Process A attempts to open another connection
// when there is connection pressure from
// 192.0.2.30:54000..54500 to 192.51.100.1:443.
// Assume only port 54321 is still available.
s3 = socket(AF_INET, SOCK_STREAM)
s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s3.bind(192.0.2.30, 0)
s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)
Problem
-------
We end up in a state where Process A can't reuse ephemeral port 54321 for
as long as there are sockets, like s1, that keep the bind bucket alive. The
bucket does not return to "reusable" state even when all sockets which
blocked it from reuse, like s2, are gone.
The ephemeral port becomes available for use again only after all sockets
bound to it are gone and the bind bucket is destroyed.
Programs which behave like Process B in this scenario - that is, binding to
an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
poorly written. However, the reality is that such implementation is not
actually uncommon. Trying to fix each and every such program is like
playing whack-a-mole.
For instance, it could be any software using Golang's net.Dialer with
LocalAddr provided:
dialer := &net.Dialer{
LocalAddr: &net.TCPAddr{IP: srcIP},
}
conn, err := dialer.Dial("tcp4", dialTarget)
Or even a ubiquitous tool like dig when using a specific local address:
$ dig -b 127.1.1.1 +tcp +short example.com
Hence, we are proposing a systematic fix in the network stack itself.
Solution
--------
Please see the description in patch 1.
[1] https://lore.kernel.org/r/20250714-connect-port-search-harder-v3-0-b1a41f249865@cloudflare.com
Reported-by: Lee Valentine <lvalentine@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Jakub Sitnicki (2):
tcp: Update bind bucket state on port release
selftests/net: Test tcp port reuse after unbinding a socket
include/net/inet_connection_sock.h | 5 +-
include/net/inet_hashtables.h | 2 +
include/net/inet_sock.h | 2 +
include/net/inet_timewait_sock.h | 3 +-
include/net/tcp.h | 12 ++
net/ipv4/inet_connection_sock.c | 12 +-
net/ipv4/inet_hashtables.c | 31 ++++-
net/ipv4/inet_timewait_sock.c | 1 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/tcp_port_share.c | 182 +++++++++++++++++++++++++++
10 files changed, 243 insertions(+), 8 deletions(-)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH RFC net-next 1/2] tcp: Update bind bucket state on port release
2025-08-08 9:10 [PATCH RFC net-next 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
@ 2025-08-08 9:10 ` Jakub Sitnicki
2025-08-08 11:43 ` Eric Dumazet
2025-08-08 9:10 ` [PATCH RFC net-next 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
1 sibling, 1 reply; 5+ messages in thread
From: Jakub Sitnicki @ 2025-08-08 9:10 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
Currently, when an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0, after a socket explicitly binds to a port, it stays in
that state until all associated sockets are removed and the bucket is
destroyed.
In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a small ephemeral port range (via
IP_LOCAL_PORT_RANGE option), this can lead to quicker port exhaustion
because "blocked" buckets remain excluded from reuse.
The reason for not updating the bucket state on port release is unclear. It
may have been a performance trade-off to avoid scanning bucket owners, or
simply an oversight.
Address it by recalculating the bind bucket state when a socket releases a
port. To minimize overhead, use a divide-and-conquer strategy: duplicate
the (fastreuse, fastreuseport) state in each inet_bind2_bucket. On port
release, we only need to scan the relevant port-addr bucket, and the
overall port bucket state can be derived from those.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
include/net/inet_connection_sock.h | 5 +++--
include/net/inet_hashtables.h | 2 ++
include/net/inet_sock.h | 2 ++
include/net/inet_timewait_sock.h | 3 ++-
include/net/tcp.h | 12 ++++++++++++
net/ipv4/inet_connection_sock.c | 12 ++++++++----
net/ipv4/inet_hashtables.c | 31 ++++++++++++++++++++++++++++++-
net/ipv4/inet_timewait_sock.c | 1 +
8 files changed, 60 insertions(+), 8 deletions(-)
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 1735db332aab..072347f16483 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -322,8 +322,9 @@ int inet_csk_listen_start(struct sock *sk);
void inet_csk_listen_stop(struct sock *sk);
/* update the fast reuse flag when adding a socket */
-void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
- struct sock *sk);
+void inet_csk_update_fastreuse(const struct sock *sk,
+ struct inet_bind_bucket *tb,
+ struct inet_bind2_bucket *tb2);
struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 19dbd9081d5a..d6676746dabf 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -108,6 +108,8 @@ struct inet_bind2_bucket {
struct hlist_node bhash_node;
/* List of sockets hashed to this bucket */
struct hlist_head owners;
+ signed char fastreuse;
+ signed char fastreuseport;
};
static inline struct net *ib_net(const struct inet_bind_bucket *ib)
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 1086256549fa..73f1dbc1a04b 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -279,6 +279,8 @@ enum {
INET_FLAGS_RTALERT_ISOLATE = 28,
INET_FLAGS_SNDFLOW = 29,
INET_FLAGS_RTALERT = 30,
+ /* socket bound to a port at connect() time */
+ INET_FLAGS_LAZY_BIND = 31,
};
/* cmsg flags for inet */
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 67a313575780..9e5f1d08cc12 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -70,7 +70,8 @@ struct inet_timewait_sock {
unsigned int tw_transparent : 1,
tw_flowlabel : 20,
tw_usec_ts : 1,
- tw_pad : 2, /* 2 bits hole */
+ tw_lazy_bind : 1,
+ tw_pad : 1, /* 1 bit hole */
tw_tos : 8;
u32 tw_txhash;
u32 tw_priority;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b3815d104340..a8a7f14769f7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2225,6 +2225,18 @@ static inline bool inet_sk_transparent(const struct sock *sk)
return inet_test_bit(TRANSPARENT, sk);
}
+/* Check if socket was bound to a port at connect() time */
+static inline bool inet_sk_lazy_bind(const struct sock *sk)
+{
+ switch (sk->sk_state) {
+ case TCP_TIME_WAIT:
+ return inet_twsk(sk)->tw_lazy_bind;
+ case TCP_NEW_SYN_RECV:
+ return false; /* n/a to request sock */
+ }
+ return inet_test_bit(LAZY_BIND, sk);
+}
+
/* Determines whether this is a thin stream (which may suffer from
* increased latency). Used to trigger latency-reducing mechanisms.
*/
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1e2df51427fe..0076c67d9bd4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -423,7 +423,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
}
static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
- struct sock *sk)
+ const struct sock *sk)
{
if (tb->fastreuseport <= 0)
return 0;
@@ -453,8 +453,9 @@ static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
ipv6_only_sock(sk), true, false);
}
-void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
- struct sock *sk)
+void inet_csk_update_fastreuse(const struct sock *sk,
+ struct inet_bind_bucket *tb,
+ struct inet_bind2_bucket *tb2)
{
bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
@@ -501,6 +502,9 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
tb->fastreuseport = 0;
}
}
+
+ tb2->fastreuse = tb->fastreuse;
+ tb2->fastreuseport = tb->fastreuseport;
}
/* Obtain a reference to a local port for the given sock,
@@ -582,7 +586,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
}
success:
- inet_csk_update_fastreuse(tb, sk);
+ inet_csk_update_fastreuse(sk, tb, tb2);
if (!inet_csk(sk)->icsk_bind_hash)
inet_bind_hash(sk, tb, tb2, port);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ceeeec9b7290..5e6eaae38105 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -87,10 +87,22 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
*/
void inet_bind_bucket_destroy(struct inet_bind_bucket *tb)
{
+ const struct inet_bind2_bucket *tb2;
+
if (hlist_empty(&tb->bhash2)) {
hlist_del_rcu(&tb->node);
kfree_rcu(tb, rcu);
+ return;
+ }
+
+ if (tb->fastreuse == -1 && tb->fastreuseport == -1)
+ return;
+ hlist_for_each_entry(tb2, &tb->bhash2, bhash_node) {
+ if (tb2->fastreuse != -1 || tb2->fastreuseport != -1)
+ return;
}
+ tb->fastreuse = -1;
+ tb->fastreuseport = -1;
}
bool inet_bind_bucket_match(const struct inet_bind_bucket *tb, const struct net *net,
@@ -121,6 +133,8 @@ static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb2,
#else
tb2->rcv_saddr = sk->sk_rcv_saddr;
#endif
+ tb2->fastreuse = 0;
+ tb2->fastreuseport = 0;
INIT_HLIST_HEAD(&tb2->owners);
hlist_add_head(&tb2->node, &head->chain);
hlist_add_head(&tb2->bhash_node, &tb->bhash2);
@@ -143,11 +157,23 @@ struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
/* Caller must hold hashbucket lock for this tb with local BH disabled */
void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
{
+ const struct sock *sk;
+
if (hlist_empty(&tb->owners)) {
__hlist_del(&tb->node);
__hlist_del(&tb->bhash_node);
kmem_cache_free(cachep, tb);
+ return;
+ }
+
+ if (tb->fastreuse == -1 && tb->fastreuseport == -1)
+ return;
+ sk_for_each_bound(sk, &tb->owners) {
+ if (!inet_sk_lazy_bind(sk))
+ return;
}
+ tb->fastreuse = -1;
+ tb->fastreuseport = -1;
}
static bool inet_bind2_bucket_addr_match(const struct inet_bind2_bucket *tb2,
@@ -277,7 +303,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
}
}
if (update_fastreuse)
- inet_csk_update_fastreuse(tb, child);
+ inet_csk_update_fastreuse(child, tb, tb2);
inet_bind_hash(child, tb, tb2, port);
spin_unlock(&head2->lock);
spin_unlock(&head->lock);
@@ -1136,6 +1162,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
head2, tb, sk);
if (!tb2)
goto error;
+ tb2->fastreuse = -1;
+ tb2->fastreuseport = -1;
}
/* Here we want to add a little bit of randomness to the next source
@@ -1148,6 +1176,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
/* Head lock still held and bh's disabled */
inet_bind_hash(sk, tb, tb2, port);
+ inet_set_bit(LAZY_BIND, sk);
if (sk_unhashed(sk)) {
inet_sk(sk)->inet_sport = htons(port);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 875ff923a8ed..ee668e5c0938 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -206,6 +206,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
tw->tw_hash = sk->sk_hash;
tw->tw_ipv6only = 0;
tw->tw_transparent = inet_test_bit(TRANSPARENT, sk);
+ tw->tw_lazy_bind = inet_test_bit(LAZY_BIND, sk);
tw->tw_prot = sk->sk_prot_creator;
atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
twsk_net_set(tw, sock_net(sk));
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH RFC net-next 2/2] selftests/net: Test tcp port reuse after unbinding a socket
2025-08-08 9:10 [PATCH RFC net-next 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-08-08 9:10 ` [PATCH RFC net-next 1/2] " Jakub Sitnicki
@ 2025-08-08 9:10 ` Jakub Sitnicki
1 sibling, 0 replies; 5+ messages in thread
From: Jakub Sitnicki @ 2025-08-08 9:10 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
Exercise the scenario described in detail in the cover letter:
1) socket A: connect() from ephemeral port X
2) socket B: explicitly bind() to port X
3) check that port X is now excluded from ephemeral ports
4) close socket B to release the port bind
5) socket C: connect() from ephemeral port X
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/tcp_port_share.c | 182 +++++++++++++++++++++++++++
2 files changed, 183 insertions(+)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index b31a71f2b372..b317ec5e6aec 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -117,6 +117,7 @@ TEST_GEN_FILES += tfo
TEST_PROGS += tfo_passive.sh
TEST_PROGS += broadcast_pmtu.sh
TEST_PROGS += ipv6_force_forwarding.sh
+TEST_GEN_PROGS += tcp_port_share
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := busy_poller netlink-dumps
diff --git a/tools/testing/selftests/net/tcp_port_share.c b/tools/testing/selftests/net/tcp_port_share.c
new file mode 100644
index 000000000000..d6db89affbc9
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_port_share.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2025 Cloudflare, Inc.
+
+/* Tests for TCP port sharing (bind bucket reuse). */
+
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/ioctl.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdlib.h>
+
+#include "../kselftest_harness.h"
+
+struct sockaddr_inet {
+ union {
+ struct sockaddr_storage ss;
+ struct sockaddr_in6 v6;
+ struct sockaddr_in v4;
+ struct sockaddr sa;
+ };
+ socklen_t len;
+ char str[INET6_ADDRSTRLEN + __builtin_strlen("[]:65535") + 1];
+};
+
+static void make_inet_addr(int af, const char *ip, __u16 port,
+ struct sockaddr_inet *addr)
+{
+ const char *fmt = "";
+
+ memset(addr, 0, sizeof(*addr));
+
+ switch (af) {
+ case AF_INET:
+ addr->len = sizeof(addr->v4);
+ addr->v4.sin_family = af;
+ addr->v4.sin_port = htons(port);
+ inet_pton(af, ip, &addr->v4.sin_addr);
+ fmt = "%s:%hu";
+ break;
+ case AF_INET6:
+ addr->len = sizeof(addr->v6);
+ addr->v6.sin6_family = af;
+ addr->v6.sin6_port = htons(port);
+ inet_pton(af, ip, &addr->v6.sin6_addr);
+ fmt = "[%s]:%hu";
+ break;
+ }
+
+ snprintf(addr->str, sizeof(addr->str), fmt, ip, port);
+}
+
+static int getsockname_port(int fd)
+{
+ struct sockaddr_inet addr = {};
+ int err;
+
+ addr.len = sizeof(addr);
+ err = getsockname(fd, &addr.sa, &addr.len);
+ if (err)
+ return -1;
+
+ switch (addr.sa.sa_family) {
+ case AF_INET:
+ return ntohs(addr.v4.sin_port);
+ case AF_INET6:
+ return ntohs(addr.v6.sin6_port);
+ default:
+ errno = EAFNOSUPPORT;
+ return -1;
+ }
+}
+
+FIXTURE(tcp_port_share) {};
+
+FIXTURE_VARIANT(tcp_port_share) {
+ int domain;
+ /* IP to listen on and connect to */
+ const char *dst_ip;
+ /* Primary IP to connect from */
+ const char *src1_ip;
+ /* Secondary IP to connect from */
+ const char *src2_ip;
+ /* IP to bind to to block the source port */
+ const char *bind_ip;
+};
+
+#define DST_PORT 30000
+#define SRC_PORT 40000
+
+FIXTURE_VARIANT_ADD(tcp_port_share, ipv4) {
+ .domain = AF_INET,
+ .dst_ip = "127.0.0.1",
+ .src1_ip = "127.1.1.1",
+ .src2_ip = "127.2.2.2",
+ .bind_ip = "127.3.3.3",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_share, ipv6) {
+ .domain = AF_INET6,
+ .dst_ip = "::1",
+ .src1_ip = "2001:db8::1",
+ .src2_ip = "2001:db8::2",
+ .bind_ip = "2001:db8::3",
+};
+
+FIXTURE_SETUP(tcp_port_share)
+{
+ ASSERT_EQ(unshare(CLONE_NEWNET), 0);
+ ASSERT_EQ(system("ip link set dev lo up"), 0);
+ ASSERT_EQ(system("ip addr add dev lo 2001:db8::1/32 nodad"), 0);
+ ASSERT_EQ(system("ip addr add dev lo 2001:db8::2/32 nodad"), 0);
+ ASSERT_EQ(system("ip addr add dev lo 2001:db8::3/32 nodad"), 0);
+ ASSERT_EQ(system("sysctl -wq net.ipv4.ip_local_port_range='40000 40000'"), 0);
+}
+
+FIXTURE_TEARDOWN(tcp_port_share) {}
+
+/* Check that an ephemeral port can be used again as soon as the socket bound to
+ * the port, blocking it from reuse, releases it.
+ */
+TEST_F(tcp_port_share, can_reuse_port_after_unbind)
+{
+ const typeof(variant) v = variant;
+ int c1, c2, ln, port_block;
+ struct sockaddr_inet addr;
+ const int one = 1;
+
+ /* Listen on <dst_ip>:<DST_PORT> */
+ ln = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(bind(ln, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+ ASSERT_EQ(listen(ln, 2), 0);
+
+ /* Connect from <src1_ip>:<SRC_PORT> */
+ c1 = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(c1, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(c1, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->src1_ip, 0, &addr);
+ ASSERT_EQ(bind(c1, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(connect(c1, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+ ASSERT_EQ(getsockname_port(c1), SRC_PORT);
+
+ /* Bind to <bind_ip>:<SRC_PORT>. Block the port from reuse. */
+ port_block = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(port_block, 0) TH_LOG("socket(): %m");
+ ASSERT_EQ(setsockopt(port_block, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->bind_ip, SRC_PORT, &addr);
+ ASSERT_EQ(bind(port_block, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ /* Try to connect from <src2_ip>:<SRC_PORT>. Expect failure. */
+ c2 = socket(v->domain, SOCK_STREAM, 0);
+ ASSERT_GE(c2, 0) TH_LOG("socket");
+ ASSERT_EQ(setsockopt(c2, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &one, sizeof(one)), 0);
+
+ make_inet_addr(v->domain, v->src2_ip, 0, &addr);
+ ASSERT_EQ(bind(c2, &addr.sa, addr.len), 0) TH_LOG("bind(%s): %m", addr.str);
+
+ make_inet_addr(v->domain, v->dst_ip, DST_PORT, &addr);
+ ASSERT_EQ(connect(c2, &addr.sa, addr.len), -1) TH_LOG("connect(%s)", addr.str);
+ ASSERT_EQ(errno, EADDRNOTAVAIL) TH_LOG("%m");
+
+ /* Unbind from <bind_ip>:<SRC_PORT>. Unblock the port for reuse. */
+ ASSERT_EQ(close(port_block), 0);
+
+ /* Connect again from <src2_ip>:<SRC_PORT> */
+ EXPECT_EQ(connect(c2, &addr.sa, addr.len), 0) TH_LOG("connect(%s): %m", addr.str);
+ EXPECT_EQ(getsockname_port(c2), SRC_PORT);
+
+ ASSERT_EQ(close(c2), 0);
+ ASSERT_EQ(close(c1), 0);
+ ASSERT_EQ(close(ln), 0);
+}
+
+TEST_HARNESS_MAIN
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH RFC net-next 1/2] tcp: Update bind bucket state on port release
2025-08-08 9:10 ` [PATCH RFC net-next 1/2] " Jakub Sitnicki
@ 2025-08-08 11:43 ` Eric Dumazet
2025-08-08 12:06 ` Jakub Sitnicki
0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2025-08-08 11:43 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, David S. Miller, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
On Fri, Aug 8, 2025 at 2:10 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Currently, when an inet_bind_bucket enters a state where fastreuse >= 0 or
> fastreuseport >= 0, after a socket explicitly binds to a port, it stays in
> that state until all associated sockets are removed and the bucket is
> destroyed.
>
> In this state, the bucket is skipped during ephemeral port selection in
> connect(). For applications using a small ephemeral port range (via
> IP_LOCAL_PORT_RANGE option), this can lead to quicker port exhaustion
> because "blocked" buckets remain excluded from reuse.
>
> The reason for not updating the bucket state on port release is unclear. It
> may have been a performance trade-off to avoid scanning bucket owners, or
> simply an oversight.
>
> Address it by recalculating the bind bucket state when a socket releases a
> port. To minimize overhead, use a divide-and-conquer strategy: duplicate
> the (fastreuse, fastreuseport) state in each inet_bind2_bucket. On port
> release, we only need to scan the relevant port-addr bucket, and the
> overall port bucket state can be derived from those.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> include/net/inet_connection_sock.h | 5 +++--
> include/net/inet_hashtables.h | 2 ++
> include/net/inet_sock.h | 2 ++
> include/net/inet_timewait_sock.h | 3 ++-
> include/net/tcp.h | 12 ++++++++++++
> net/ipv4/inet_connection_sock.c | 12 ++++++++----
> net/ipv4/inet_hashtables.c | 31 ++++++++++++++++++++++++++++++-
> net/ipv4/inet_timewait_sock.c | 1 +
> 8 files changed, 60 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 1735db332aab..072347f16483 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -322,8 +322,9 @@ int inet_csk_listen_start(struct sock *sk);
> void inet_csk_listen_stop(struct sock *sk);
>
> /* update the fast reuse flag when adding a socket */
> -void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
> - struct sock *sk);
> +void inet_csk_update_fastreuse(const struct sock *sk,
> + struct inet_bind_bucket *tb,
> + struct inet_bind2_bucket *tb2);
>
> struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 19dbd9081d5a..d6676746dabf 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -108,6 +108,8 @@ struct inet_bind2_bucket {
> struct hlist_node bhash_node;
> /* List of sockets hashed to this bucket */
> struct hlist_head owners;
> + signed char fastreuse;
> + signed char fastreuseport;
> };
>
> static inline struct net *ib_net(const struct inet_bind_bucket *ib)
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index 1086256549fa..73f1dbc1a04b 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -279,6 +279,8 @@ enum {
> INET_FLAGS_RTALERT_ISOLATE = 28,
> INET_FLAGS_SNDFLOW = 29,
> INET_FLAGS_RTALERT = 30,
> + /* socket bound to a port at connect() time */
> + INET_FLAGS_LAZY_BIND = 31,
I am not a huge fan of this name. I think we already use something
like autobind.
I have not seen where you clear this bit, once it has been set, it
sticks forever ?
Perhaps add in the selftest something to call tcp_disconnect() :)
fd = socket()
connect(fd ...) // this sets the 'autobind' bit
connect(fd ... AF_UNSPEC ..) // disconnects
// reuse fd
bind(fd, .... port=X)
connect(fd ...) // after this point 'autobind' should not be set.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC net-next 1/2] tcp: Update bind bucket state on port release
2025-08-08 11:43 ` Eric Dumazet
@ 2025-08-08 12:06 ` Jakub Sitnicki
0 siblings, 0 replies; 5+ messages in thread
From: Jakub Sitnicki @ 2025-08-08 12:06 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, David S. Miller, Jakub Kicinski, Kuniyuki Iwashima,
Neal Cardwell, Paolo Abeni, kernel-team, Lee Valentine
On Fri, Aug 08, 2025 at 04:43 AM -07, Eric Dumazet wrote:
> On Fri, Aug 8, 2025 at 2:10 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Currently, when an inet_bind_bucket enters a state where fastreuse >= 0 or
>> fastreuseport >= 0, after a socket explicitly binds to a port, it stays in
>> that state until all associated sockets are removed and the bucket is
>> destroyed.
>>
>> In this state, the bucket is skipped during ephemeral port selection in
>> connect(). For applications using a small ephemeral port range (via
>> IP_LOCAL_PORT_RANGE option), this can lead to quicker port exhaustion
>> because "blocked" buckets remain excluded from reuse.
>>
>> The reason for not updating the bucket state on port release is unclear. It
>> may have been a performance trade-off to avoid scanning bucket owners, or
>> simply an oversight.
>>
>> Address it by recalculating the bind bucket state when a socket releases a
>> port. To minimize overhead, use a divide-and-conquer strategy: duplicate
>> the (fastreuse, fastreuseport) state in each inet_bind2_bucket. On port
>> release, we only need to scan the relevant port-addr bucket, and the
>> overall port bucket state can be derived from those.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>> include/net/inet_connection_sock.h | 5 +++--
>> include/net/inet_hashtables.h | 2 ++
>> include/net/inet_sock.h | 2 ++
>> include/net/inet_timewait_sock.h | 3 ++-
>> include/net/tcp.h | 12 ++++++++++++
>> net/ipv4/inet_connection_sock.c | 12 ++++++++----
>> net/ipv4/inet_hashtables.c | 31 ++++++++++++++++++++++++++++++-
>> net/ipv4/inet_timewait_sock.c | 1 +
>> 8 files changed, 60 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>> index 1735db332aab..072347f16483 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -322,8 +322,9 @@ int inet_csk_listen_start(struct sock *sk);
>> void inet_csk_listen_stop(struct sock *sk);
>>
>> /* update the fast reuse flag when adding a socket */
>> -void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
>> - struct sock *sk);
>> +void inet_csk_update_fastreuse(const struct sock *sk,
>> + struct inet_bind_bucket *tb,
>> + struct inet_bind2_bucket *tb2);
>>
>> struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
>>
>> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> index 19dbd9081d5a..d6676746dabf 100644
>> --- a/include/net/inet_hashtables.h
>> +++ b/include/net/inet_hashtables.h
>> @@ -108,6 +108,8 @@ struct inet_bind2_bucket {
>> struct hlist_node bhash_node;
>> /* List of sockets hashed to this bucket */
>> struct hlist_head owners;
>> + signed char fastreuse;
>> + signed char fastreuseport;
>> };
>>
>> static inline struct net *ib_net(const struct inet_bind_bucket *ib)
>> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
>> index 1086256549fa..73f1dbc1a04b 100644
>> --- a/include/net/inet_sock.h
>> +++ b/include/net/inet_sock.h
>> @@ -279,6 +279,8 @@ enum {
>> INET_FLAGS_RTALERT_ISOLATE = 28,
>> INET_FLAGS_SNDFLOW = 29,
>> INET_FLAGS_RTALERT = 30,
>> + /* socket bound to a port at connect() time */
>> + INET_FLAGS_LAZY_BIND = 31,
>
> I am not a huge fan of this name. I think we already use something
> like autobind.
Now that I think of it - it is just another autobind path. Will change.
> I have not seen where you clear this bit, once it has been set, it
> sticks forever ?
>
> Perhaps add in the selftest something to call tcp_disconnect() :)
>
> fd = socket()
> connect(fd ...) // this sets the 'autobind' bit
> connect(fd ... AF_UNSPEC ..) // disconnects
> // reuse fd
> bind(fd, .... port=X)
> connect(fd ...) // after this point 'autobind' should not be set.
You're right. That's not handled correctly at all. The bit should be
cleared on disconnect. Completely missed that scenario.
Thanks for reviewing!
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-08-08 12:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-08 9:10 [PATCH RFC net-next 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-08-08 9:10 ` [PATCH RFC net-next 1/2] " Jakub Sitnicki
2025-08-08 11:43 ` Eric Dumazet
2025-08-08 12:06 ` Jakub Sitnicki
2025-08-08 9:10 ` [PATCH RFC net-next 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.