From: Jakub Sitnicki <jakub@cloudflare.com>
To: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Kuniyuki Iwashima <kuniyu@google.com>,
Neal Cardwell <ncardwell@google.com>,
Paolo Abeni <pabeni@redhat.com>,
kernel-team@cloudflare.com,
Lee Valentine <lvalentine@cloudflare.com>
Subject: [PATCH net-next v2 0/2] tcp: Update bind bucket state on port release
Date: Thu, 21 Aug 2025 13:09:13 +0200 [thread overview]
Message-ID: <20250821-update-bind-bucket-state-on-unhash-v2-0-0c204543a522@cloudflare.com> (raw)
TL;DR
-----
This is another take on addressing the issue we already raised earlier [1].
This time around, instead of trying to relax the bind-conflict checks in
connect(), we make an attempt to fix the tcp bind bucket state accounting.
The goal of this patch set is to make the bind buckets return to "port
reusable by ephemeral connections" state when all sockets blocking the port
from reuse get unhashed.
Changelog
---------
Changes in v2:
- Rename the inet_sock flag from LAZY_BIND to AUTOBIND (Eric)
- Clear the AUTOBIND flag on disconnect path (Eric)
- Add a test to cover the disconnect case (Eric)
- Link to RFC v1: https://lore.kernel.org/r/20250808-update-bind-bucket-state-on-unhash-v1-0-faf85099d61b@cloudflare.com
Situation
---------
We observe the following scenario in production:
inet_bind_bucket
state for port 54321
--------------------
(bucket doesn't exist)
// Process A opens a long-lived connection:
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s1.bind(192.0.2.10, 0)
s1.connect(192.51.100.1, 443)
tb->fastreuse = -1
tb->fastreuseport = -1
s1.getsockname() -> 192.0.2.10:54321
s1.send()
s1.recv()
// ... s1 stays open.
// Process B opens a short-lived connection:
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SO_REUSEADDR)
s2.bind(192.0.2.20, 0)
tb->fastreuse = 0
tb->fastreuseport = 0
s2.connect(192.51.100.2, 53)
s2.getsockname() -> 192.0.2.20:54321
s2.send()
s2.recv()
s2.close()
// bucket remains in this
// state even though port
// was released by s2
tb->fastreuse = 0
tb->fastreuseport = 0
// Process A attempts to open another connection
// when there is connection pressure from
// 192.0.2.30:54000..54500 to 192.51.100.1:443.
// Assume only port 54321 is still available.
s3 = socket(AF_INET, SOCK_STREAM)
s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s3.bind(192.0.2.30, 0)
s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)
Problem
-------
We end up in a state where Process A can't reuse ephemeral port 54321 for
as long as there are sockets, like s1, that keep the bind bucket alive. The
bucket does not return to "reusable" state even when all sockets which
blocked it from reuse, like s2, are gone.
The ephemeral port becomes available for use again only after all sockets
bound to it are gone and the bind bucket is destroyed.
Programs which behave like Process B in this scenario - that is, binding to
an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
poorly written. However, the reality is that such implementation is not
actually uncommon. Trying to fix each and every such program is like
playing whack-a-mole.
For instance, it could be any software using Golang's net.Dialer with
LocalAddr provided:
dialer := &net.Dialer{
LocalAddr: &net.TCPAddr{IP: srcIP},
}
conn, err := dialer.Dial("tcp4", dialTarget)
Or even a ubiquitous tool like dig when using a specific local address:
$ dig -b 127.1.1.1 +tcp +short example.com
Hence, we are proposing a systematic fix in the network stack itself.
Solution
--------
Please see the description in patch 1.
[1] https://lore.kernel.org/r/20250714-connect-port-search-harder-v3-0-b1a41f249865@cloudflare.com
Reported-by: Lee Valentine <lvalentine@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Jakub Sitnicki (2):
tcp: Update bind bucket state on port release
selftests/net: Test tcp port reuse after unbinding a socket
include/net/inet_connection_sock.h | 5 +-
include/net/inet_hashtables.h | 2 +
include/net/inet_sock.h | 2 +
include/net/inet_timewait_sock.h | 3 +-
include/net/tcp.h | 15 ++
net/ipv4/inet_connection_sock.c | 12 +-
net/ipv4/inet_hashtables.c | 32 +++-
net/ipv4/inet_timewait_sock.c | 1 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/tcp_port_share.c | 258 +++++++++++++++++++++++++++
10 files changed, 323 insertions(+), 8 deletions(-)
next reply other threads:[~2025-08-21 11:09 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-21 11:09 Jakub Sitnicki [this message]
2025-08-21 11:09 ` [PATCH net-next v2 1/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-08-22 3:58 ` Kuniyuki Iwashima
2025-08-22 13:37 ` Jakub Sitnicki
2025-08-21 11:09 ` [PATCH net-next v2 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250821-update-bind-bucket-state-on-unhash-v2-0-0c204543a522@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=lvalentine@cloudflare.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).