* [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution
@ 2026-03-03 17:29 Fernando Fernandez Mancera
2026-03-03 21:18 ` Kuniyuki Iwashima
2026-03-04 7:05 ` Eric Dumazet
0 siblings, 2 replies; 5+ messages in thread
From: Fernando Fernandez Mancera @ 2026-03-03 17:29 UTC (permalink / raw)
To: netdev
Cc: linux-doc, linux-kernel, chia-yu.chang, idosch, willemb, dsahern,
kuniyu, ncardwell, skhan, corbet, horms, pabeni, kuba, edumazet,
davem, Fernando Fernandez Mancera
With the current port selection algorithm, ports after a reserved port
range or long time used port are used more often than others [1]. This
causes an uneven port usage distribution. This combines with cloud
environments blocking connections between the application server and the
database server if there was a previous connection with the same source
port, leading to connectivity problems between applications on cloud
environments.
The real issue here is that these firewalls cannot cope with
standards-compliant port reuse. This is a workaround for such situations
and an improvement on the distribution of ports selected.
The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
The step size is selected randomly on every connect() call ensuring it
is a coprime with respect to the size of the range of ports we want to
scan. This way, we can ensure that all ports within the range are
scanned before returning an error. To enable this algorithm, the user
must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
In addition, on graphs generated we can observe that the distribution of
source ports is more even with the proposed approach. [2]
[1] https://0xffsoftware.com/port_graph_current_alg.html
[2] https://0xffsoftware.com/port_graph_random_step_alg.html
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
v2: used step to calculate remaining as (remaining / step) and avoid
calculating gcd when scan_step and range are both even
v3: xmas tree formatting and break the gdc() loop once scan_step is 1
---
Documentation/networking/ip-sysctl.rst | 9 ++++++
.../net_cachelines/netns_ipv4_sysctl.rst | 1 +
include/net/netns/ipv4.h | 1 +
net/ipv4/inet_hashtables.c | 28 +++++++++++++++++--
net/ipv4/sysctl_net_ipv4.c | 7 +++++
5 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 265158534cda..da29806700e9 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1630,6 +1630,15 @@ ip_local_reserved_ports - list of comma separated ranges
Default: Empty
+ip_local_port_step_width - INTEGER
+ Defines the numerical maximum increment between successive port
+ allocations within the ephemeral port range when an unavailable port is
+ reached. This can be used to mitigate accumulated nodes in port
+ distribution when reserved ports have been configured. Please note that
+ port collisions may be more frequent in a system with a very high load.
+
+ Default: 0 (disabled)
+
ip_unprivileged_port_start - INTEGER
This is a per-namespace sysctl. It defines the first
unprivileged port in the network namespace. Privileged ports
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
index beaf1880a19b..cf284263e69b 100644
--- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -52,6 +52,7 @@ u8 sysctl_ip_fwd_update_priority
u8 sysctl_ip_nonlocal_bind
u8 sysctl_ip_autobind_reuse
u8 sysctl_ip_dynaddr
+u32 sysctl_ip_local_port_step_width
u8 sysctl_ip_early_demux read_mostly ip(6)_rcv_finish_core
u8 sysctl_raw_l3mdev_accept
u8 sysctl_tcp_early_demux read_mostly ip(6)_rcv_finish_core
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e971c7bf164..fb7c2235af21 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -166,6 +166,7 @@ struct netns_ipv4 {
u8 sysctl_ip_autobind_reuse;
/* Shall we try to damage output packets if routing dev changes? */
u8 sysctl_ip_dynaddr;
+ u32 sysctl_ip_local_port_step_width;
#ifdef CONFIG_NET_L3_MASTER_DEV
u8 sysctl_raw_l3mdev_accept;
#endif
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index fca980772c81..86b0c6d2c25d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -16,6 +16,7 @@
#include <linux/wait.h>
#include <linux/vmalloc.h>
#include <linux/memblock.h>
+#include <linux/gcd.h>
#include <net/addrconf.h>
#include <net/inet_connection_sock.h>
@@ -1046,12 +1047,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct net *net = sock_net(sk);
struct inet_bind2_bucket *tb2;
struct inet_bind_bucket *tb;
+ int step, scan_step, l3mdev;
+ u32 index, max_rand_step;
bool tb_created = false;
u32 remaining, offset;
int ret, i, low, high;
bool local_ports;
- int step, l3mdev;
- u32 index;
if (port) {
local_bh_disable();
@@ -1065,6 +1066,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
local_ports = inet_sk_get_local_port_range(sk, &low, &high);
step = local_ports ? 1 : 2;
+ scan_step = step;
+ max_rand_step = READ_ONCE(net->ipv4.sysctl_ip_local_port_step_width);
high++; /* [32768, 60999] -> [32768, 61000[ */
remaining = high - low;
@@ -1083,9 +1086,28 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
*/
if (!local_ports)
offset &= ~1U;
+
+ if (max_rand_step && remaining > 1) {
+ u32 range = remaining / step;
+ u32 upper_bound;
+
+ upper_bound = min(range, max_rand_step);
+ scan_step = get_random_u32_inclusive(1, upper_bound);
+ while (gcd(scan_step, range) != 1) {
+ scan_step++;
+ /* if both scan_step and range are even gcd won't be 1 */
+ if (!(scan_step & 1) && !(range & 1))
+ scan_step++;
+ if (unlikely(scan_step > upper_bound)) {
+ scan_step = 1;
+ break;
+ }
+ }
+ scan_step *= step;
+ }
other_parity_scan:
port = low + offset;
- for (i = 0; i < remaining; i += step, port += step) {
+ for (i = 0; i < remaining; i += step, port += scan_step) {
if (unlikely(port >= high))
port -= remaining;
if (inet_is_local_reserved_port(net, port))
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 643763bc2142..c533374f656c 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -822,6 +822,13 @@ static struct ctl_table ipv4_net_table[] = {
.mode = 0644,
.proc_handler = ipv4_local_port_range,
},
+ {
+ .procname = "ip_local_port_step_width",
+ .maxlen = sizeof(u32),
+ .data = &init_net.ipv4.sysctl_ip_local_port_step_width,
+ .mode = 0644,
+ .proc_handler = proc_douintvec,
+ },
{
.procname = "ip_local_reserved_ports",
.data = &init_net.ipv4.sysctl_local_reserved_ports,
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution
2026-03-03 17:29 [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution Fernando Fernandez Mancera
@ 2026-03-03 21:18 ` Kuniyuki Iwashima
2026-03-04 7:05 ` Eric Dumazet
1 sibling, 0 replies; 5+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-03 21:18 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, linux-doc, linux-kernel, chia-yu.chang, idosch, willemb,
dsahern, ncardwell, skhan, corbet, horms, pabeni, kuba, edumazet,
davem
On Tue, Mar 3, 2026 at 9:30 AM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
> With the current port selection algorithm, ports after a reserved port
> range or long time used port are used more often than others [1]. This
> causes an uneven port usage distribution. This combines with cloud
> environments blocking connections between the application server and the
> database server if there was a previous connection with the same source
> port, leading to connectivity problems between applications on cloud
> environments.
>
> The real issue here is that these firewalls cannot cope with
> standards-compliant port reuse. This is a workaround for such situations
> and an improvement on the distribution of ports selected.
>
> The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
> The step size is selected randomly on every connect() call ensuring it
> is a coprime with respect to the size of the range of ports we want to
> scan. This way, we can ensure that all ports within the range are
> scanned before returning an error. To enable this algorithm, the user
> must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
>
> In addition, on graphs generated we can observe that the distribution of
> source ports is more even with the proposed approach. [2]
>
> [1] https://0xffsoftware.com/port_graph_current_alg.html
>
> [2] https://0xffsoftware.com/port_graph_random_step_alg.html
>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution
2026-03-03 17:29 [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution Fernando Fernandez Mancera
2026-03-03 21:18 ` Kuniyuki Iwashima
@ 2026-03-04 7:05 ` Eric Dumazet
2026-03-04 9:54 ` Fernando Fernandez Mancera
1 sibling, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2026-03-04 7:05 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, linux-doc, linux-kernel, chia-yu.chang, idosch, willemb,
dsahern, kuniyu, ncardwell, skhan, corbet, horms, pabeni, kuba,
davem
On Tue, Mar 3, 2026 at 6:30 PM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
> With the current port selection algorithm, ports after a reserved port
> range or long time used port are used more often than others [1]. This
> causes an uneven port usage distribution. This combines with cloud
> environments blocking connections between the application server and the
> database server if there was a previous connection with the same source
> port, leading to connectivity problems between applications on cloud
> environments.
>
> The real issue here is that these firewalls cannot cope with
> standards-compliant port reuse. This is a workaround for such situations
> and an improvement on the distribution of ports selected.
>
> The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
> The step size is selected randomly on every connect() call ensuring it
> is a coprime with respect to the size of the range of ports we want to
> scan. This way, we can ensure that all ports within the range are
> scanned before returning an error. To enable this algorithm, the user
> must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
>
> In addition, on graphs generated we can observe that the distribution of
> source ports is more even with the proposed approach. [2]
>
> [1] https://0xffsoftware.com/port_graph_current_alg.html
>
> [2] https://0xffsoftware.com/port_graph_random_step_alg.html
>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> ---
> v2: used step to calculate remaining as (remaining / step) and avoid
> calculating gcd when scan_step and range are both even
> v3: xmas tree formatting and break the gdc() loop once scan_step is 1
> ---
> Documentation/networking/ip-sysctl.rst | 9 ++++++
> .../net_cachelines/netns_ipv4_sysctl.rst | 1 +
> include/net/netns/ipv4.h | 1 +
> net/ipv4/inet_hashtables.c | 28 +++++++++++++++++--
> net/ipv4/sysctl_net_ipv4.c | 7 +++++
> 5 files changed, 43 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 265158534cda..da29806700e9 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1630,6 +1630,15 @@ ip_local_reserved_ports - list of comma separated ranges
>
> Default: Empty
>
> +ip_local_port_step_width - INTEGER
> + Defines the numerical maximum increment between successive port
> + allocations within the ephemeral port range when an unavailable port is
> + reached. This can be used to mitigate accumulated nodes in port
> + distribution when reserved ports have been configured. Please note that
> + port collisions may be more frequent in a system with a very high load.
> +
Patch SGTM, but I find this documentation obscure.
Some guidance would be nice. What values have you tested/tried ?
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution
2026-03-04 7:05 ` Eric Dumazet
@ 2026-03-04 9:54 ` Fernando Fernandez Mancera
2026-03-05 1:46 ` Jakub Kicinski
0 siblings, 1 reply; 5+ messages in thread
From: Fernando Fernandez Mancera @ 2026-03-04 9:54 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, linux-doc, linux-kernel, chia-yu.chang, idosch, willemb,
dsahern, kuniyu, ncardwell, skhan, corbet, horms, pabeni, kuba,
davem
On 3/4/26 8:05 AM, Eric Dumazet wrote:
> On Tue, Mar 3, 2026 at 6:30 PM Fernando Fernandez Mancera
> <fmancera@suse.de> wrote:
>>
>> With the current port selection algorithm, ports after a reserved port
>> range or long time used port are used more often than others [1]. This
>> causes an uneven port usage distribution. This combines with cloud
>> environments blocking connections between the application server and the
>> database server if there was a previous connection with the same source
>> port, leading to connectivity problems between applications on cloud
>> environments.
>>
>> The real issue here is that these firewalls cannot cope with
>> standards-compliant port reuse. This is a workaround for such situations
>> and an improvement on the distribution of ports selected.
>>
>> The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
>> The step size is selected randomly on every connect() call ensuring it
>> is a coprime with respect to the size of the range of ports we want to
>> scan. This way, we can ensure that all ports within the range are
>> scanned before returning an error. To enable this algorithm, the user
>> must configure the new sysctl option "net.ipv4.ip_local_port_step_width".
>>
>> In addition, on graphs generated we can observe that the distribution of
>> source ports is more even with the proposed approach. [2]
>>
>> [1] https://0xffsoftware.com/port_graph_current_alg.html
>>
>> [2] https://0xffsoftware.com/port_graph_random_step_alg.html
>>
>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>> ---
>> v2: used step to calculate remaining as (remaining / step) and avoid
>> calculating gcd when scan_step and range are both even
>> v3: xmas tree formatting and break the gdc() loop once scan_step is 1
>> ---
>> Documentation/networking/ip-sysctl.rst | 9 ++++++
>> .../net_cachelines/netns_ipv4_sysctl.rst | 1 +
>> include/net/netns/ipv4.h | 1 +
>> net/ipv4/inet_hashtables.c | 28 +++++++++++++++++--
>> net/ipv4/sysctl_net_ipv4.c | 7 +++++
>> 5 files changed, 43 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
>> index 265158534cda..da29806700e9 100644
>> --- a/Documentation/networking/ip-sysctl.rst
>> +++ b/Documentation/networking/ip-sysctl.rst
>> @@ -1630,6 +1630,15 @@ ip_local_reserved_ports - list of comma separated ranges
>>
>> Default: Empty
>>
>> +ip_local_port_step_width - INTEGER
>> + Defines the numerical maximum increment between successive port
>> + allocations within the ephemeral port range when an unavailable port is
>> + reached. This can be used to mitigate accumulated nodes in port
>> + distribution when reserved ports have been configured. Please note that
>> + port collisions may be more frequent in a system with a very high load.
>> +
>
> Patch SGTM, but I find this documentation obscure.
>
> Some guidance would be nice. What values have you tested/tried ?
>
As I am working on a patch series with improvements to ip-sysctl.rst
documentation I will handle that there.
FTR; I tested multiple scenarios and numbers. If the value is >= the
whole range, the issue is always mitigated but of course this will have
a hit on performance under port exhaustion situation. The value that
works better in my experience is 2x 3x or even 4x the size of the
largest reserved block. If only a couple of ports are marked as
reserved, 128 is usually enough to avoid clustering..
Thank you all for the reviews!
> Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution
2026-03-04 9:54 ` Fernando Fernandez Mancera
@ 2026-03-05 1:46 ` Jakub Kicinski
0 siblings, 0 replies; 5+ messages in thread
From: Jakub Kicinski @ 2026-03-05 1:46 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: Eric Dumazet, netdev, linux-doc, linux-kernel, chia-yu.chang,
idosch, willemb, dsahern, kuniyu, ncardwell, skhan, corbet, horms,
pabeni, davem
On Wed, 4 Mar 2026 10:54:16 +0100 Fernando Fernandez Mancera wrote:
> > Some guidance would be nice. What values have you tested/tried ?
>
> As I am working on a patch series with improvements to ip-sysctl.rst
> documentation I will handle that there.
I'd prefer a respin please.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-03-05 1:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-03 17:29 [PATCH net-next v3] inet: add ip_local_port_step_width sysctl to improve port usage distribution Fernando Fernandez Mancera
2026-03-03 21:18 ` Kuniyuki Iwashima
2026-03-04 7:05 ` Eric Dumazet
2026-03-04 9:54 ` Fernando Fernandez Mancera
2026-03-05 1:46 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox