All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org,  "David S. Miller" <davem@davemloft.net>,
	 Eric Dumazet <edumazet@google.com>,
	 Jakub Kicinski <kuba@kernel.org>,
	Kuniyuki Iwashima <kuniyu@google.com>,
	 Neal Cardwell <ncardwell@google.com>,
	 kernel-team@cloudflare.com,
	 Lee Valentine <lvalentine@cloudflare.com>
Subject: Re: [PATCH net-next v4 1/2] tcp: Update bind bucket state on port release
Date: Tue, 16 Sep 2025 15:14:57 +0200	[thread overview]
Message-ID: <875xdi5yjy.fsf@cloudflare.com> (raw)
In-Reply-To: <b22af0eb-e50b-4d5c-a5bc-eb475388da10@redhat.com> (Paolo Abeni's message of "Tue, 16 Sep 2025 12:14:01 +0200")

On Tue, Sep 16, 2025 at 12:14 PM +02, Paolo Abeni wrote:
> On 9/13/25 12:09 PM, Jakub Sitnicki wrote:
>> Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
>> fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
>> in that state until all sockets are removed and the bucket is destroyed.
>> 
>> In this state, the bucket is skipped during ephemeral port selection in
>> connect(). For applications using a reduced ephemeral port
>> range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
>> exhaustion since blocked buckets are excluded from reuse.
>> 
>> The reason the bucket state isn't updated on port release is unclear.
>> Possibly a performance trade-off to avoid scanning bucket owners, or just
>> an oversight.
>> 
>> Fix it by recalculating the bucket state when a socket releases a port. To
>> limit overhead, each inet_bind2_bucket stores its own (fastreuse,
>> fastreuseport) state. On port release, only the relevant port-addr bucket
>> is scanned, and the overall state is derived from these.
>
> I'm possibly likely lost, but I think that the bucket state could change
> even after inet_bhash2_update_saddr(), but AFAICS it's not updated there.

Let me double check if I understand what you have in mind because now I
also feel a bit lost :-)

We already update the bucket state in inet_bhash2_update_saddr(). I
assume we are talking about the main body, not the early bailout path
when the socket is not bound yet [1].

This code gets called only in the obscure (?) case when ip_dynaddr [2]
sysctl is set, and we have a routing failure during connection setup
phase (SYN-SENT).

In such case, on source address update, call to
inet_bind2_bucket_destroy() will recalculate port-addr bucket state,
potentially "downgrading" it to (fastreuse=-1, fastreuseport=-1).

But if the "downgrade" happens, it changes nothing for the port bucket
state, as we are about to re-add the socket into another port-addr
bucket.

Now, adding a CONNECT_BIND socket to an existing port-addr bucket, that
also has no side effects. We can't "upgrade" the bucket to the shareable
state (fastreuse=-1, fastreuseport=-1).

That said, I do see an unaddressed corner case now that I audit this
code again. If we end up _creating_ a new inet_bind2_bucket
(__inet_bhash2_update_saddr->inet_bind2_bucket_init), then the bucket
state should be initialized to (fastreuse=-1, fastreuseport=-1) when the
socket has the CONNECT_BIND flag set.

The call chain I'm referring to:

tcp_connect / __tcp_retransmit_skb
  ->rebuild_header
    inet_sk_rebuild_header
      inet_sk_reselect_saddr IFF sysctl_ip_dynaddr != 0
        inet_bhash2_update_saddr
          __inet_bhash2_update_saddr
            inet_bind2_bucket_init

I propose to handle that by checking if the socket has CONNECT_BIND flag
set and overwriting the port-addr bucket state similiar to like I did in
__inet_hash_connect:

	tb2 = inet_bind2_bucket_find(head2, net, port, l3mdev, sk);
	if (!tb2) {
		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, net,
					       head2, tb, sk);
		if (!tb2)
			goto error;
		tb2->fastreuse = -1;
		tb2->fastreuseport = -1;
	}

So the obscure ip_dynadr path does need a fixup. Other than that I'm not
able to poke any other holes in how we manage the bucket state.

Was that your concern or you had something else in mind?

Thanks,
-jkbs

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/inet_hashtables.c?h=v6.17-rc6#n914
[2] https://docs.kernel.org/networking/ip_dynaddr.html

  reply	other threads:[~2025-09-16 13:15 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-13 10:09 [PATCH net-next v4 0/2] tcp: Update bind bucket state on port release Jakub Sitnicki
2025-09-13 10:09 ` [PATCH net-next v4 1/2] " Jakub Sitnicki
2025-09-16  5:45   ` Kuniyuki Iwashima
2025-09-16 10:14   ` Paolo Abeni
2025-09-16 13:14     ` Jakub Sitnicki [this message]
2025-09-23  7:45       ` Paolo Abeni
2025-09-13 10:09 ` [PATCH net-next v4 2/2] selftests/net: Test tcp port reuse after unbinding a socket Jakub Sitnicki
2025-09-16  5:50   ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875xdi5yjy.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=lvalentine@cloudflare.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.