public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Fernando Fernandez Mancera <fmancera@suse.de>
To: Kuniyuki Iwashima <kuniyu@google.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	ij@kernel.org, chia-yu.chang@nokia-bell-labs.com,
	idosch@nvidia.com, willemb@google.com, dsahern@kernel.org,
	ncardwell@google.com, corbet@lwn.net, horms@kernel.org,
	pabeni@redhat.com, kuba@kernel.org, edumazet@google.com,
	davem@davemloft.net
Subject: Re: [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution
Date: Thu, 26 Feb 2026 11:39:11 +0100	[thread overview]
Message-ID: <06ae5fc9-fb72-44cc-bb63-941f52a2b70d@suse.de> (raw)
In-Reply-To: <CAAVpQUBMPoUj57LGEyr9m4E54CTLeabSH2aZca+EcYzYRNBfXA@mail.gmail.com>

On 2/25/26 6:33 PM, Kuniyuki Iwashima wrote:
> On Wed, Feb 25, 2026 at 2:03 AM Fernando Fernandez Mancera
> <fmancera@suse.de> wrote:
>>
>> On 2/25/26 7:28 AM, Kuniyuki Iwashima wrote:
>>> On Tue, Feb 24, 2026 at 7:05 AM Fernando Fernandez Mancera
>>> <fmancera@suse.de> wrote:
>>>>
>>>> With the current port selection algorithm, ports after a reserved port
>>>> range or long time used port are used more often than others [1]. This
>>>> causes an uneven port usage distribution. This combines with cloud
>>>> environments blocking connections between the application server and the
>>>> database server if there was a previous connection with the same source
>>>> port, leading to connectivity problems between applications on cloud
>>>> environments.
>>>>
>>>> The real issue here is that these firewalls cannot cope with
>>>> standards-compliant port reuse. This is a workaround for such situations
>>>> and an improvement on the distribution of ports selected.
>>>>
>>>> The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
>>>> The step size is selected randomly on every connect() call ensuring it
>>>> is a coprime with respect to the size of the range of ports we want to
>>>> scan. This way, we can ensure that all ports within the range are
>>>> scanned before returning an error. To enable this algorithm, the user
>>>> must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
>>>>
>>>> In addition, on graphs generated we can observe that the distribution of
>>>> source ports is more even with the proposed approach. [2]
>>>>
>>>> [1] https://0xffsoftware.com/port_graph_current_alg.html
>>>>
>>>> [2] https://0xffsoftware.com/port_graph_random_step_alg.html
>>>>
>>>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>>>> ---
>>>>    Documentation/networking/ip-sysctl.rst        |  9 ++++++++
>>>>    .../net_cachelines/netns_ipv4_sysctl.rst      |  1 +
>>>>    include/net/netns/ipv4.h                      |  1 +
>>>>    net/ipv4/inet_hashtables.c                    | 22 ++++++++++++++++---
>>>>    net/ipv4/sysctl_net_ipv4.c                    |  7 ++++++
>>>>    5 files changed, 37 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
>>>> index 6921d8594b84..9e2625ee778c 100644
>>>> --- a/Documentation/networking/ip-sysctl.rst
>>>> +++ b/Documentation/networking/ip-sysctl.rst
>>>> @@ -1612,6 +1612,15 @@ ip_local_reserved_ports - list of comma separated ranges
>>>>
>>>>           Default: Empty
>>>>
>>>> +ip_local_port_step_width - INTEGER
>>>> +        Defines the numerical maximum increment between successive port
>>>> +        allocations within the ephemeral port range when an unavailable port is
>>>> +        reached. This can be used to mitigate accumulated nodes in port
>>>> +        distribution when reserved ports have been configured. Please note that
>>>> +        port collisions may be more frequent in a system with a very high load.
>>>> +
>>>> +        Default: 0 (disabled)
>>>> +
>>>>    ip_unprivileged_port_start - INTEGER
>>>>           This is a per-namespace sysctl.  It defines the first
>>>>           unprivileged port in the network namespace.  Privileged ports
>>>> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>>>> index beaf1880a19b..c0e194a6e4ee 100644
>>>> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>>>> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>>>> @@ -47,6 +47,7 @@ u8                              sysctl_tcp_ecn
>>>>    u8                              sysctl_tcp_ecn_fallback
>>>>    u8                              sysctl_ip_default_ttl                                                                ip4_dst_hoplimit/ip_select_ttl
>>>>    u8                              sysctl_ip_no_pmtu_disc
>>>> +u32                             sysctl_ip_local_port_step_width
>>>>    u8                              sysctl_ip_fwd_use_pmtu                       read_mostly                             ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
>>>>    u8                              sysctl_ip_fwd_update_priority                                                        ip_forward
>>>>    u8                              sysctl_ip_nonlocal_bind
>>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>>> index 8e971c7bf164..fb7c2235af21 100644
>>>> --- a/include/net/netns/ipv4.h
>>>> +++ b/include/net/netns/ipv4.h
>>>> @@ -166,6 +166,7 @@ struct netns_ipv4 {
>>>>           u8 sysctl_ip_autobind_reuse;
>>>>           /* Shall we try to damage output packets if routing dev changes? */
>>>>           u8 sysctl_ip_dynaddr;
>>>> +       u32 sysctl_ip_local_port_step_width;
>>>>    #ifdef CONFIG_NET_L3_MASTER_DEV
>>>>           u8 sysctl_raw_l3mdev_accept;
>>>>    #endif
>>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>>> index f5826ec4bcaa..1992dc21818f 100644
>>>> --- a/net/ipv4/inet_hashtables.c
>>>> +++ b/net/ipv4/inet_hashtables.c
>>>> @@ -16,6 +16,7 @@
>>>>    #include <linux/wait.h>
>>>>    #include <linux/vmalloc.h>
>>>>    #include <linux/memblock.h>
>>>> +#include <linux/gcd.h>
>>>>
>>>>    #include <net/addrconf.h>
>>>>    #include <net/inet_connection_sock.h>
>>>> @@ -1046,12 +1047,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>>>           struct net *net = sock_net(sk);
>>>>           struct inet_bind2_bucket *tb2;
>>>>           struct inet_bind_bucket *tb;
>>>> +       int step, scan_step, l3mdev;
>>>> +       u32 index, max_rand_step;
>>>>           bool tb_created = false;
>>>>           u32 remaining, offset;
>>>>           int ret, i, low, high;
>>>>           bool local_ports;
>>>> -       int step, l3mdev;
>>>> -       u32 index;
>>>>
>>>>           if (port) {
>>>>                   local_bh_disable();
>>>> @@ -1065,6 +1066,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>>>
>>>>           local_ports = inet_sk_get_local_port_range(sk, &low, &high);
>>>>           step = local_ports ? 1 : 2;
>>>> +       scan_step = step;
>>>> +       max_rand_step = READ_ONCE(net->ipv4.sysctl_ip_local_port_step_width);
>>>>
>>>>           high++; /* [32768, 60999] -> [32768, 61000[ */
>>>>           remaining = high - low;
>>>> @@ -1083,9 +1086,22 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>>>            */
>>>>           if (!local_ports)
>>>>                   offset &= ~1U;
>>>> +
>>>> +       if (max_rand_step && remaining > 1) {
>>>> +               u32 range = (step == 1) ? remaining : (remaining / 2);
>>>> +               u32 upper_bound = min(range, max_rand_step);
>>>> +
>>>> +               scan_step = get_random_u32_inclusive(1, upper_bound);
>>>> +               while (gcd(scan_step, range) != 1) {
>>>> +                       scan_step++;
>>>
>>> If both scan_step and range are even, an extra
>>> increment here saves 1/2 calls of gcd().
>>>
>>
>> Ah right, thanks!
>>
>>>
>>>> +                       if (unlikely(scan_step > upper_bound))
>>>> +                               scan_step = 1;
>>>> +               }
>>>> +               scan_step *= step;
>>>> +       }
>>>>    other_parity_scan:
>>>
>>> Doing "other_parity_scan" will be just redundant
>>> unless scan_step is 2 ?
>>>
>>
>> I have tried to preserve the parity behavior. Maybe I missed something,
>> let me explain why it isn't redundant in my opinion.
>>
>> In essence, when calculating the range we first look at "step". If step
>> == 1 we use all the remaining ports as range, otherwise we use remaining/2.
>>
>> If step == 1 we do not care about parity so let's look at step == 2.
>>
>> If step == 2, we calculate a step_scan that is coprime with remaining/2.
>> Once we have it, we multiply it by 2 so we make sure scan_step is even.
> 
> Ah, I missed scan_step *= step.  Then looks good.
> Maybe we can set range = remaining / step similarly.

Yes, let's do that. Thanks!


      reply	other threads:[~2026-02-26 10:39 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24 15:05 [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution Fernando Fernandez Mancera
2026-02-25  6:28 ` Kuniyuki Iwashima
2026-02-25 10:02   ` Fernando Fernandez Mancera
2026-02-25 17:33     ` Kuniyuki Iwashima
2026-02-26 10:39       ` Fernando Fernandez Mancera [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=06ae5fc9-fb72-44cc-bb63-941f52a2b70d@suse.de \
    --to=fmancera@suse.de \
    --cc=chia-yu.chang@nokia-bell-labs.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=idosch@nvidia.com \
    --cc=ij@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox