From: Jakub Sitnicki <jakub@cloudflare.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Neal Cardwell <ncardwell@google.com>,
Kuniyuki Iwashima <kuniyu@google.com>,
netdev@vger.kernel.org, kernel-team@cloudflare.com,
Lee Valentine <lvalentine@cloudflare.com>
Subject: Re: [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE
Date: Thu, 17 Jul 2025 13:21:24 +0200 [thread overview]
Message-ID: <87ecufdpvv.fsf@cloudflare.com> (raw)
In-Reply-To: <87qzyfdue6.fsf@cloudflare.com> (Jakub Sitnicki's message of "Thu, 17 Jul 2025 11:44:01 +0200")
On Thu, Jul 17, 2025 at 11:44 AM +02, Jakub Sitnicki wrote:
> On Thu, Jul 17, 2025 at 11:23 AM +02, Paolo Abeni wrote:
>> On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
>>> Solution
>>> --------
>>>
>>> If there is no IP address conflict with any socket bound to a given local
>>> port, then from the protocol's perspective, the port can be safely shared.
>>>
>>> With that in mind, modify the port search during connect(), that is
>>> __inet_hash_connect, to consider all bind buckets (ports) when looking for
>>> a local port for egress.
>>>
>>> To achieve this, add an extra walk over bhash2 buckets for the port to
>>> check for IP conflicts. The additional walk is not free, so perform it only
>>> once per port - during the second phase of conflict checking, when the
>>> bhash bucket is locked.
>>>
>>> We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket
>>> option is set. The rationale is that users are likely to care about using
>>> every possible local port only when they have deliberately constrained the
>>> ephemeral port range.
>>
>> I'm not a big fan of piggybacking additional semantic on existing
>> socketopt, have you considered a new one?
>
> That's a fair point. Though a dedicated sysctl seems more appropriate in
> this case. Akin to how we have ip_autobind_reuse to enable amore
> aggresive port sharing strategy on bind() side. How does that sound?
Thinking about this some more - if we're considering a dedicated sysctl
guard for this, perhaps this merits giving a shot to the more
comprehensive fix first.
That is to update the inet_bind_bucket state (fastreuse, fastreuseport)
on socket unbind to reflect the change in bucket owners. IOW, pivot to
one of the alternatives that I've highlighted:
| Alternatives
| ------------
|
| * Update bind bucket state on port release
|
| A valid solution to the described problem would also be to walk the bind
| bucket owners when releasing the port and recalculate the
| tb->{reuse,reuseport} state.
|
| However, in comparison to the proposed solution, this alone would not allow
| sharing the local port with other sockets bound to non-conflicting IPs for
| as long as they exist.
|
| Another downside is that we'd pay the extra cost on each unbind (more
| frequent) rather than only when connecting with IP_LOCAL_PORT_RANGE
| set (less frequent). Due to that we would also likely need to guard it
| behind a sysctl (see below).
Right now the inet_bind_bucket fastreuse{,port} state is being
mismanaged, IMO. This would be the fix for the actual root cause here.
next prev parent reply other threads:[~2025-07-17 11:21 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-14 16:03 [PATCH net-next v3 0/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 1/3] tcp: Add RCU management to inet_bind2_bucket Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
2025-07-17 9:23 ` Paolo Abeni
2025-07-17 9:44 ` Jakub Sitnicki
2025-07-17 11:21 ` Jakub Sitnicki [this message]
2025-07-14 16:03 ` [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios " Jakub Sitnicki
2025-07-17 9:27 ` Paolo Abeni
2025-07-17 9:44 ` Jakub Sitnicki
2025-07-17 9:34 ` Paolo Abeni
2025-07-17 9:50 ` Jakub Sitnicki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ecufdpvv.fsf@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=lvalentine@cloudflare.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.