* [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
@ 2026-02-03 17:54 Fernando Fernandez Mancera
2026-02-03 18:02 ` Fernando Fernandez Mancera
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-03 17:54 UTC (permalink / raw)
To: netdev
Cc: davem, edumazet, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel,
Fernando Fernandez Mancera, Thorsten Toepper
With the current port selection algorithm, ports after a reserved port
or long time used port are used more often than others. This combines
with cloud environments blocking connections between the application
server and the database server if there was a previous connection with
the same source port. This leads to connectivity problems between
applications on cloud environments.
The situation is that a source tuple is usable again after being closed
for a maximum lifetime segment of two minutes while in the firewall it's
still noted as existing for 60 minutes or longer. So in case that the
port is reused for the same target tuple before the firewall cleans up,
the connection will fail due to firewall interference which itself will
reset the activity timeout in its own table. We understand the real
issue here is that these firewalls cannot cope with standards-compliant
port reuse. But this is a workaround for such situations and an
improvement on the distribution of ports selected.
The proposed solution is instead of incrementing the port number,
performing a re-selection of a new random port within the remaining
range. This solution is configured via sysctl new option
"net.ipv4.ip_retry_random_port".
The test run consists of two processes, a client and a server, and loops
connect to the server sending some bytes back. The results we got are
promising:
Executed test: Current algorithm
ephemeral port range: 9000-65499
simulated selections: 10000000
retries during simulation: 14197718
longest retry sequence: 5202
Executed test: Proposed modified algorithm
ephemeral port range: 9000-65499
simulated selections: 10000000
retries during simulation: 3976671
longest retry sequence: 12
In addition, on graphs generated we can observe that the distribution of
source ports is more even with the proposed patch.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Tested-by: Thorsten Toepper <thorsten.toepper@sap.com>
---
.../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
include/net/netns/ipv4.h | 1 +
net/ipv4/inet_hashtables.c | 7 ++++++-
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
4 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
index beaf1880a19b..c4041fdca01e 100644
--- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn
u8 sysctl_tcp_ecn_fallback
u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl
u8 sysctl_ip_no_pmtu_disc
+u8 sysctl_ip_retry_random_port
u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
u8 sysctl_ip_fwd_update_priority ip_forward
u8 sysctl_ip_nonlocal_bind
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 2dbd46fc4734..d04b07e7c935 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -156,6 +156,7 @@ struct netns_ipv4 {
u8 sysctl_ip_default_ttl;
u8 sysctl_ip_no_pmtu_disc;
+ u8 sysctl_ip_retry_random_port;
u8 sysctl_ip_fwd_update_priority;
u8 sysctl_ip_nonlocal_bind;
u8 sysctl_ip_autobind_reuse;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index f5826ec4bcaa..f1c79a7d3fd3 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1088,8 +1088,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
for (i = 0; i < remaining; i += step, port += step) {
if (unlikely(port >= high))
port -= remaining;
- if (inet_is_local_reserved_port(net, port))
+ if (inet_is_local_reserved_port(net, port)) {
+ if (net->ipv4.sysctl_ip_retry_random_port) {
+ port = low + get_random_u32_below(remaining);
+ port = ((port & 1) == step) ? port : (port - 1);
+ }
continue;
+ }
head = &hinfo->bhash[inet_bhashfn(net, port,
hinfo->bhash_size)];
rcu_read_lock();
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a1a50a5c80dc..5eade7d9e4a2 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -822,6 +822,13 @@ static struct ctl_table ipv4_net_table[] = {
.mode = 0644,
.proc_handler = ipv4_local_port_range,
},
+ {
+ .procname = "ip_retry_random_port",
+ .maxlen = sizeof(u8),
+ .data = &init_net.ipv4.sysctl_ip_retry_random_port,
+ .mode = 0644,
+ .proc_handler = proc_dou8vec_minmax,
+ },
{
.procname = "ip_local_reserved_ports",
.data = &init_net.ipv4.sysctl_local_reserved_ports,
--
2.52.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-03 17:54 [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries Fernando Fernandez Mancera
@ 2026-02-03 18:02 ` Fernando Fernandez Mancera
2026-02-04 16:25 ` Fernando Fernandez Mancera
2026-02-04 16:49 ` Eric Dumazet
2 siblings, 0 replies; 10+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-03 18:02 UTC (permalink / raw)
To: netdev
Cc: davem, edumazet, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel, Thorsten Toepper
On 2/3/26 6:54 PM, Fernando Fernandez Mancera wrote:
> With the current port selection algorithm, ports after a reserved port
> or long time used port are used more often than others. This combines
> with cloud environments blocking connections between the application
> server and the database server if there was a previous connection with
> the same source port. This leads to connectivity problems between
> applications on cloud environments.
>
> The situation is that a source tuple is usable again after being closed
> for a maximum lifetime segment of two minutes while in the firewall it's
> still noted as existing for 60 minutes or longer. So in case that the
> port is reused for the same target tuple before the firewall cleans up,
> the connection will fail due to firewall interference which itself will
> reset the activity timeout in its own table. We understand the real
> issue here is that these firewalls cannot cope with standards-compliant
> port reuse. But this is a workaround for such situations and an
> improvement on the distribution of ports selected.
>
> The proposed solution is instead of incrementing the port number,
> performing a re-selection of a new random port within the remaining
> range. This solution is configured via sysctl new option
> "net.ipv4.ip_retry_random_port".
>
> The test run consists of two processes, a client and a server, and loops
> connect to the server sending some bytes back. The results we got are
> promising:
>
> Executed test: Current algorithm
> ephemeral port range: 9000-65499
> simulated selections: 10000000
> retries during simulation: 14197718
> longest retry sequence: 5202
>
> Executed test: Proposed modified algorithm
> ephemeral port range: 9000-65499
> simulated selections: 10000000
> retries during simulation: 3976671
> longest retry sequence: 12
>
> In addition, on graphs generated we can observe that the distribution of
> source ports is more even with the proposed patch.
>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> Tested-by: Thorsten Toepper <thorsten.toepper@sap.com>
> ---
> .../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
> include/net/netns/ipv4.h | 1 +
> net/ipv4/inet_hashtables.c | 7 ++++++-
> net/ipv4/sysctl_net_ipv4.c | 7 +++++++
> 4 files changed, 15 insertions(+), 1 deletion(-)
>
I just noticed I didn't add the following diffs to the patch. Please
keep them on mind and sorry for the inconvenience.
diff --git a/Documentation/networking/ip-sysctl.rst
b/Documentation/networking/ip-sysctl.rst
index bc9a01606daf..e6ae9400332c 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1610,6 +1610,17 @@ ip_local_reserved_ports - list of comma separated
ranges
Default: Empty
+ip_retry_random_port - BOOLEAN
+ Randomize the selection of a new port if a reserved port is hit
during
+ automatic port selection instead of incrementing the port number.
+
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
ip_unprivileged_port_start - INTEGER
This is a per-namespace sysctl. It defines the first
unprivileged port in the network namespace. Privileged ports
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5eade7d9e4a2..32ca260701ba 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -828,6 +828,8 @@ static struct ctl_table ipv4_net_table[] = {
.data = &init_net.ipv4.sysctl_ip_retry_random_port,
.mode = 0644,
.proc_handler = proc_dou8vec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
},
{
.procname = "ip_local_reserved_ports",
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-03 17:54 [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries Fernando Fernandez Mancera
2026-02-03 18:02 ` Fernando Fernandez Mancera
@ 2026-02-04 16:25 ` Fernando Fernandez Mancera
2026-02-04 16:49 ` Eric Dumazet
2 siblings, 0 replies; 10+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-04 16:25 UTC (permalink / raw)
To: netdev
Cc: davem, edumazet, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel, Thorsten Toepper
On 2/3/26 6:54 PM, Fernando Fernandez Mancera wrote:
> With the current port selection algorithm, ports after a reserved port
> or long time used port are used more often than others. This combines
> with cloud environments blocking connections between the application
> server and the database server if there was a previous connection with
> the same source port. This leads to connectivity problems between
> applications on cloud environments.
>
> The situation is that a source tuple is usable again after being closed
> for a maximum lifetime segment of two minutes while in the firewall it's
> still noted as existing for 60 minutes or longer. So in case that the
> port is reused for the same target tuple before the firewall cleans up,
> the connection will fail due to firewall interference which itself will
> reset the activity timeout in its own table. We understand the real
> issue here is that these firewalls cannot cope with standards-compliant
> port reuse. But this is a workaround for such situations and an
> improvement on the distribution of ports selected.
>
> The proposed solution is instead of incrementing the port number,
> performing a re-selection of a new random port within the remaining
> range. This solution is configured via sysctl new option
> "net.ipv4.ip_retry_random_port".
>
> The test run consists of two processes, a client and a server, and loops
> connect to the server sending some bytes back. The results we got are
> promising:
>
> Executed test: Current algorithm
> ephemeral port range: 9000-65499
> simulated selections: 10000000
> retries during simulation: 14197718
> longest retry sequence: 5202
>
> Executed test: Proposed modified algorithm
> ephemeral port range: 9000-65499
> simulated selections: 10000000
> retries during simulation: 3976671
> longest retry sequence: 12
>
> In addition, on graphs generated we can observe that the distribution of
> source ports is more even with the proposed patch.
>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> Tested-by: Thorsten Toepper <thorsten.toepper@sap.com>
> ---
> .../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
> include/net/netns/ipv4.h | 1 +
> net/ipv4/inet_hashtables.c | 7 ++++++-
> net/ipv4/sysctl_net_ipv4.c | 7 +++++++
> 4 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> index beaf1880a19b..c4041fdca01e 100644
> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn
> u8 sysctl_tcp_ecn_fallback
> u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl
> u8 sysctl_ip_no_pmtu_disc
> +u8 sysctl_ip_retry_random_port
> u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
> u8 sysctl_ip_fwd_update_priority ip_forward
> u8 sysctl_ip_nonlocal_bind
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 2dbd46fc4734..d04b07e7c935 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -156,6 +156,7 @@ struct netns_ipv4 {
>
> u8 sysctl_ip_default_ttl;
> u8 sysctl_ip_no_pmtu_disc;
> + u8 sysctl_ip_retry_random_port;
> u8 sysctl_ip_fwd_update_priority;
> u8 sysctl_ip_nonlocal_bind;
> u8 sysctl_ip_autobind_reuse;
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index f5826ec4bcaa..f1c79a7d3fd3 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1088,8 +1088,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> for (i = 0; i < remaining; i += step, port += step) {
> if (unlikely(port >= high))
> port -= remaining;
> - if (inet_is_local_reserved_port(net, port))
> + if (inet_is_local_reserved_port(net, port)) {
> + if (net->ipv4.sysctl_ip_retry_random_port) {
> + port = low + get_random_u32_below(remaining);
> + port = ((port & 1) == step) ? port : (port - 1);
The AI bot did a good observation
(https://netdev-ai.bots.linux.dev/ai-review.html?id=c1544ebc-4c9d-45c5-bce9-784764102912).
I think this would be better as it will keep the random scan within the
same parity when needed.
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index f1c79a7d3fd3..c9650079f9e5 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1090,8 +1090,11 @@ int __inet_hash_connect(struct
inet_timewait_death_row *death_row,
port -= remaining;
if (inet_is_local_reserved_port(net, port)) {
if (net->ipv4.sysctl_ip_retry_random_port) {
- port = low + get_random_u32_below(remaining);
- port = ((port & 1) == step) ? port : (port - 1);
+ u32 candidate = low + get_random_u32_below(remaining);
+
+ if (step == 2 && (candidate & 1) != (port & 1))
+ candidate++;
+ port = candidate;
}
continue;
}
> + }
> continue;
> + }
> head = &hinfo->bhash[inet_bhashfn(net, port,
> hinfo->bhash_size)];
> rcu_read_lock();
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index a1a50a5c80dc..5eade7d9e4a2 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -822,6 +822,13 @@ static struct ctl_table ipv4_net_table[] = {
> .mode = 0644,
> .proc_handler = ipv4_local_port_range,
> },
> + {
> + .procname = "ip_retry_random_port",
> + .maxlen = sizeof(u8),
> + .data = &init_net.ipv4.sysctl_ip_retry_random_port,
> + .mode = 0644,
> + .proc_handler = proc_dou8vec_minmax,
> + },
> {
> .procname = "ip_local_reserved_ports",
> .data = &init_net.ipv4.sysctl_local_reserved_ports,
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-03 17:54 [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries Fernando Fernandez Mancera
2026-02-03 18:02 ` Fernando Fernandez Mancera
2026-02-04 16:25 ` Fernando Fernandez Mancera
@ 2026-02-04 16:49 ` Eric Dumazet
2026-02-04 17:29 ` Fernando Fernandez Mancera
2 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2026-02-04 16:49 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, davem, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel, Thorsten Toepper
On Tue, Feb 3, 2026 at 6:54 PM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
> With the current port selection algorithm, ports after a reserved port
> or long time used port are used more often than others. This combines
> with cloud environments blocking connections between the application
> server and the database server if there was a previous connection with
> the same source port. This leads to connectivity problems between
> applications on cloud environments.
>
> The situation is that a source tuple is usable again after being closed
> for a maximum lifetime segment of two minutes while in the firewall it's
> still noted as existing for 60 minutes or longer. So in case that the
> port is reused for the same target tuple before the firewall cleans up,
> the connection will fail due to firewall interference which itself will
> reset the activity timeout in its own table. We understand the real
> issue here is that these firewalls cannot cope with standards-compliant
> port reuse. But this is a workaround for such situations and an
> improvement on the distribution of ports selected.
>
> The proposed solution is instead of incrementing the port number,
> performing a re-selection of a new random port within the remaining
> range. This solution is configured via sysctl new option
> "net.ipv4.ip_retry_random_port".
>
> The test run consists of two processes, a client and a server, and loops
> connect to the server sending some bytes back. The results we got are
> promising:
>
> Executed test: Current algorithm
> ephemeral port range: 9000-65499
> simulated selections: 10000000
> retries during simulation: 14197718
> longest retry sequence: 5202
>
> Executed test: Proposed modified algorithm
> ephemeral port range: 9000-65499
> simulated selections: 10000000
> retries during simulation: 3976671
> longest retry sequence: 12
>
> In addition, on graphs generated we can observe that the distribution of
> source ports is more even with the proposed patch.
>
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> Tested-by: Thorsten Toepper <thorsten.toepper@sap.com>
> ---
> .../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
> include/net/netns/ipv4.h | 1 +
> net/ipv4/inet_hashtables.c | 7 ++++++-
> net/ipv4/sysctl_net_ipv4.c | 7 +++++++
> 4 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> index beaf1880a19b..c4041fdca01e 100644
> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn
> u8 sysctl_tcp_ecn_fallback
> u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl
> u8 sysctl_ip_no_pmtu_disc
> +u8 sysctl_ip_retry_random_port
> u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
> u8 sysctl_ip_fwd_update_priority ip_forward
> u8 sysctl_ip_nonlocal_bind
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 2dbd46fc4734..d04b07e7c935 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -156,6 +156,7 @@ struct netns_ipv4 {
>
> u8 sysctl_ip_default_ttl;
> u8 sysctl_ip_no_pmtu_disc;
> + u8 sysctl_ip_retry_random_port;
> u8 sysctl_ip_fwd_update_priority;
> u8 sysctl_ip_nonlocal_bind;
> u8 sysctl_ip_autobind_reuse;
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index f5826ec4bcaa..f1c79a7d3fd3 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1088,8 +1088,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> for (i = 0; i < remaining; i += step, port += step) {
> if (unlikely(port >= high))
> port -= remaining;
> - if (inet_is_local_reserved_port(net, port))
> + if (inet_is_local_reserved_port(net, port)) {
> + if (net->ipv4.sysctl_ip_retry_random_port) {
> + port = low + get_random_u32_below(remaining);
> + port = ((port & 1) == step) ? port : (port - 1);
> + }
What happens when almost all ephemeral ports are in use, and
hundreds of ports are reserved ?
Choosing a random value each time we meet a reserved port is going to
be quite expensive,
and we might return an error from this function even if there are many
available ports.
Perhaps randomly select @step one time at the beginning of this
function so that @step/2 and @remaining/2
are relatively prime numbers.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-04 16:49 ` Eric Dumazet
@ 2026-02-04 17:29 ` Fernando Fernandez Mancera
2026-02-06 16:27 ` Fernando Fernandez Mancera
0 siblings, 1 reply; 10+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-04 17:29 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, davem, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel, Thorsten Toepper
On 2/4/26 5:49 PM, Eric Dumazet wrote:
> On Tue, Feb 3, 2026 at 6:54 PM Fernando Fernandez Mancera
> <fmancera@suse.de> wrote:
>>
>> With the current port selection algorithm, ports after a reserved port
>> or long time used port are used more often than others. This combines
>> with cloud environments blocking connections between the application
>> server and the database server if there was a previous connection with
>> the same source port. This leads to connectivity problems between
>> applications on cloud environments.
>>
>> The situation is that a source tuple is usable again after being closed
>> for a maximum lifetime segment of two minutes while in the firewall it's
>> still noted as existing for 60 minutes or longer. So in case that the
>> port is reused for the same target tuple before the firewall cleans up,
>> the connection will fail due to firewall interference which itself will
>> reset the activity timeout in its own table. We understand the real
>> issue here is that these firewalls cannot cope with standards-compliant
>> port reuse. But this is a workaround for such situations and an
>> improvement on the distribution of ports selected.
>>
>> The proposed solution is instead of incrementing the port number,
>> performing a re-selection of a new random port within the remaining
>> range. This solution is configured via sysctl new option
>> "net.ipv4.ip_retry_random_port".
>>
>> The test run consists of two processes, a client and a server, and loops
>> connect to the server sending some bytes back. The results we got are
>> promising:
>>
>> Executed test: Current algorithm
>> ephemeral port range: 9000-65499
>> simulated selections: 10000000
>> retries during simulation: 14197718
>> longest retry sequence: 5202
>>
>> Executed test: Proposed modified algorithm
>> ephemeral port range: 9000-65499
>> simulated selections: 10000000
>> retries during simulation: 3976671
>> longest retry sequence: 12
>>
>> In addition, on graphs generated we can observe that the distribution of
>> source ports is more even with the proposed patch.
>>
>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>> Tested-by: Thorsten Toepper <thorsten.toepper@sap.com>
>> ---
>> .../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
>> include/net/netns/ipv4.h | 1 +
>> net/ipv4/inet_hashtables.c | 7 ++++++-
>> net/ipv4/sysctl_net_ipv4.c | 7 +++++++
>> 4 files changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>> index beaf1880a19b..c4041fdca01e 100644
>> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>> @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn
>> u8 sysctl_tcp_ecn_fallback
>> u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl
>> u8 sysctl_ip_no_pmtu_disc
>> +u8 sysctl_ip_retry_random_port
>> u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
>> u8 sysctl_ip_fwd_update_priority ip_forward
>> u8 sysctl_ip_nonlocal_bind
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index 2dbd46fc4734..d04b07e7c935 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -156,6 +156,7 @@ struct netns_ipv4 {
>>
>> u8 sysctl_ip_default_ttl;
>> u8 sysctl_ip_no_pmtu_disc;
>> + u8 sysctl_ip_retry_random_port;
>> u8 sysctl_ip_fwd_update_priority;
>> u8 sysctl_ip_nonlocal_bind;
>> u8 sysctl_ip_autobind_reuse;
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index f5826ec4bcaa..f1c79a7d3fd3 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -1088,8 +1088,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>> for (i = 0; i < remaining; i += step, port += step) {
>> if (unlikely(port >= high))
>> port -= remaining;
>> - if (inet_is_local_reserved_port(net, port))
>> + if (inet_is_local_reserved_port(net, port)) {
>> + if (net->ipv4.sysctl_ip_retry_random_port) {
>> + port = low + get_random_u32_below(remaining);
>> + port = ((port & 1) == step) ? port : (port - 1);
>> + }
>
> What happens when almost all ephemeral ports are in use, and
> hundreds of ports are reserved ?
>
> Choosing a random value each time we meet a reserved port is going to
> be quite expensive,
> and we might return an error from this function even if there are many
> available ports.
>
> Perhaps randomly select @step one time at the beginning of this
> function so that @step/2 and @remaining/2
> are relatively prime numbers.
>
That actually makes sense. It would ensure all ports are visited before
returning an error. Let me test this out.
Thank you Eric,
Fernando.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-04 17:29 ` Fernando Fernandez Mancera
@ 2026-02-06 16:27 ` Fernando Fernandez Mancera
2026-02-06 17:09 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-06 16:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, davem, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel, Thorsten Toepper
On 2/4/26 6:29 PM, Fernando Fernandez Mancera wrote:
> On 2/4/26 5:49 PM, Eric Dumazet wrote:
>> On Tue, Feb 3, 2026 at 6:54 PM Fernando Fernandez Mancera
>> <fmancera@suse.de> wrote:
>>>
>>> With the current port selection algorithm, ports after a reserved port
>>> or long time used port are used more often than others. This combines
>>> with cloud environments blocking connections between the application
>>> server and the database server if there was a previous connection with
>>> the same source port. This leads to connectivity problems between
>>> applications on cloud environments.
>>>
>>> The situation is that a source tuple is usable again after being closed
>>> for a maximum lifetime segment of two minutes while in the firewall it's
>>> still noted as existing for 60 minutes or longer. So in case that the
>>> port is reused for the same target tuple before the firewall cleans up,
>>> the connection will fail due to firewall interference which itself will
>>> reset the activity timeout in its own table. We understand the real
>>> issue here is that these firewalls cannot cope with standards-compliant
>>> port reuse. But this is a workaround for such situations and an
>>> improvement on the distribution of ports selected.
>>>
>>> The proposed solution is instead of incrementing the port number,
>>> performing a re-selection of a new random port within the remaining
>>> range. This solution is configured via sysctl new option
>>> "net.ipv4.ip_retry_random_port".
>>>
>>> The test run consists of two processes, a client and a server, and loops
>>> connect to the server sending some bytes back. The results we got are
>>> promising:
>>>
>>> Executed test: Current algorithm
>>> ephemeral port range: 9000-65499
>>> simulated selections: 10000000
>>> retries during simulation: 14197718
>>> longest retry sequence: 5202
>>>
>>> Executed test: Proposed modified algorithm
>>> ephemeral port range: 9000-65499
>>> simulated selections: 10000000
>>> retries during simulation: 3976671
>>> longest retry sequence: 12
>>>
>>> In addition, on graphs generated we can observe that the distribution of
>>> source ports is more even with the proposed patch.
>>>
>>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
>>> Tested-by: Thorsten Toepper <thorsten.toepper@sap.com>
>>> ---
>>> .../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
>>> include/net/netns/ipv4.h | 1 +
>>> net/ipv4/inet_hashtables.c | 7 ++++++-
>>> net/ipv4/sysctl_net_ipv4.c | 7 +++++++
>>> 4 files changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/networking/net_cachelines/
>>> netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/
>>> netns_ipv4_sysctl.rst
>>> index beaf1880a19b..c4041fdca01e 100644
>>> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>>> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
>>> @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn
>>> u8 sysctl_tcp_ecn_fallback
>>> u8
>>> sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl
>>> u8 sysctl_ip_no_pmtu_disc
>>> +u8 sysctl_ip_retry_random_port
>>> u8
>>> sysctl_ip_fwd_use_pmtu
>>> read_mostly ip_dst_mtu_maybe_forward/
>>> ip_skb_dst_mtu
>>> u8
>>> sysctl_ip_fwd_update_priority ip_forward
>>> u8 sysctl_ip_nonlocal_bind
>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>> index 2dbd46fc4734..d04b07e7c935 100644
>>> --- a/include/net/netns/ipv4.h
>>> +++ b/include/net/netns/ipv4.h
>>> @@ -156,6 +156,7 @@ struct netns_ipv4 {
>>>
>>> u8 sysctl_ip_default_ttl;
>>> u8 sysctl_ip_no_pmtu_disc;
>>> + u8 sysctl_ip_retry_random_port;
>>> u8 sysctl_ip_fwd_update_priority;
>>> u8 sysctl_ip_nonlocal_bind;
>>> u8 sysctl_ip_autobind_reuse;
>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>> index f5826ec4bcaa..f1c79a7d3fd3 100644
>>> --- a/net/ipv4/inet_hashtables.c
>>> +++ b/net/ipv4/inet_hashtables.c
>>> @@ -1088,8 +1088,13 @@ int __inet_hash_connect(struct
>>> inet_timewait_death_row *death_row,
>>> for (i = 0; i < remaining; i += step, port += step) {
>>> if (unlikely(port >= high))
>>> port -= remaining;
>>> - if (inet_is_local_reserved_port(net, port))
>>> + if (inet_is_local_reserved_port(net, port)) {
>>> + if (net->ipv4.sysctl_ip_retry_random_port) {
>>> + port = low +
>>> get_random_u32_below(remaining);
>>> + port = ((port & 1) == step) ? port :
>>> (port - 1);
>>> + }
>>
>> What happens when almost all ephemeral ports are in use, and
>> hundreds of ports are reserved ?
>>
>> Choosing a random value each time we meet a reserved port is going to
>> be quite expensive,
>> and we might return an error from this function even if there are many
>> available ports.
>>
>> Perhaps randomly select @step one time at the beginning of this
>> function so that @step/2 and @remaining/2
>> are relatively prime numbers.
>>
>
> That actually makes sense. It would ensure all ports are visited before
> returning an error. Let me test this out.
>
It makes sense. I have tested this approach and we got a more even
distribution of source ports when having thousands of reserved ports. No
difference at all when not using reserved ports.
Please, you can find the distribution graph with the current algorithm
[1] and with the random step algorithm [2].
While I understand that this approach is introducing a call to
get_random_u32_below() on every connect, I am wondering if it makes
sense to replace the existing algorithm with this variant. What do you
think?
Please, notice the implementation below. I plan to send an official v1
once net-next is open. In addition, I am rewriting the commit message as
I find the current one confusing.
[1] https://0xffsoftware.com/port_graph_current_alg.html
[2] https://0xffsoftware.com/port_graph_random_step_alg.html
Thanks,
Fernando.
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index f5826ec4bcaa..10ecad190bae 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -16,6 +16,7 @@
#include <linux/wait.h>
#include <linux/vmalloc.h>
#include <linux/memblock.h>
+#include <linux/gcd.h>
#include <net/addrconf.h>
#include <net/inet_connection_sock.h>
@@ -1046,11 +1047,11 @@ int __inet_hash_connect(struct
inet_timewait_death_row *death_row,
struct net *net = sock_net(sk);
struct inet_bind2_bucket *tb2;
struct inet_bind_bucket *tb;
+ int step, scan_step, l3mdev;
bool tb_created = false;
u32 remaining, offset;
int ret, i, low, high;
bool local_ports;
- int step, l3mdev;
u32 index;
if (port) {
@@ -1065,6 +1066,7 @@ int __inet_hash_connect(struct
inet_timewait_death_row *death_row,
local_ports = inet_sk_get_local_port_range(sk, &low, &high);
step = local_ports ? 1 : 2;
+ scan_step = step;
high++; /* [32768, 60999] -> [32768, 61000[ */
remaining = high - low;
@@ -1083,9 +1085,20 @@ int __inet_hash_connect(struct
inet_timewait_death_row *death_row,
*/
if (!local_ports)
offset &= ~1U;
+ if (net->ipv4.sysctl_ip_retry_random_port) {
+ u32 range = (step == 1) ? remaining : (remaining / 2);
+
+ scan_step = 1 + get_random_u32_below(range - 1);
+ while (gcd(scan_step, range) != 1) {
+ scan_step++;
+ if (unlikely(scan_step >= range))
+ scan_step = 1;
+ }
+ scan_step *= step;
+ }
other_parity_scan:
port = low + offset;
- for (i = 0; i < remaining; i += step, port += step) {
+ for (i = 0; i < remaining; i += step, port += scan_step) {
if (unlikely(port >= high))
port -= remaining;
if (inet_is_local_reserved_port(net, port))
> Thank you Eric,
> Fernando.
>
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-06 16:27 ` Fernando Fernandez Mancera
@ 2026-02-06 17:09 ` Eric Dumazet
2026-02-09 11:56 ` Fernando Fernandez Mancera
0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2026-02-06 17:09 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: netdev, davem, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel, Thorsten Toepper
On Fri, Feb 6, 2026 at 5:28 PM Fernando Fernandez Mancera
<fmancera@suse.de> wrote:
>
>
>
> It makes sense. I have tested this approach and we got a more even
> distribution of source ports when having thousands of reserved ports. No
> difference at all when not using reserved ports.
>
> Please, you can find the distribution graph with the current algorithm
> [1] and with the random step algorithm [2].
>
> While I understand that this approach is introducing a call to
> get_random_u32_below() on every connect, I am wondering if it makes
> sense to replace the existing algorithm with this variant. What do you
> think?
I would ask RFC 6056 experts like Fernando Gont what they think.
Note that if we use random at each connect(), we defeat one of the principles
of ephemeral port selection : try very hard to avoid 4-tuple collision.
>
> Please, notice the implementation below. I plan to send an official v1
> once net-next is open. In addition, I am rewriting the commit message as
> I find the current one confusing.
>
> [1] https://0xffsoftware.com/port_graph_current_alg.html
>
> [2] https://0xffsoftware.com/port_graph_random_step_alg.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-06 17:09 ` Eric Dumazet
@ 2026-02-09 11:56 ` Fernando Fernandez Mancera
2026-02-09 13:53 ` longxie86
0 siblings, 1 reply; 10+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-09 11:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, davem, kuba, pabeni, horms, corbet, ncardwell, kuniyu,
dsahern, idosch, linux-doc, linux-kernel, Thorsten Toepper
On 2/6/26 6:09 PM, Eric Dumazet wrote:
> On Fri, Feb 6, 2026 at 5:28 PM Fernando Fernandez Mancera
> <fmancera@suse.de> wrote:
>>
>>
>>
>> It makes sense. I have tested this approach and we got a more even
>> distribution of source ports when having thousands of reserved ports. No
>> difference at all when not using reserved ports.
>>
>> Please, you can find the distribution graph with the current algorithm
>> [1] and with the random step algorithm [2].
>>
>> While I understand that this approach is introducing a call to
>> get_random_u32_below() on every connect, I am wondering if it makes
>> sense to replace the existing algorithm with this variant. What do you
>> think?
>
> I would ask RFC 6056 experts like Fernando Gont what they think.
>
> Note that if we use random at each connect(), we defeat one of the principles
> of ephemeral port selection : try very hard to avoid 4-tuple collision.
>
Right. I will reach out to him and get his opinion. I have plenty of
time before net-next open again. I am also collecting some metrics
regarding the 4-tuple collision frequency.
>>
>> Please, notice the implementation below. I plan to send an official v1
>> once net-next is open. In addition, I am rewriting the commit message as
>> I find the current one confusing.
>>
>> [1] https://0xffsoftware.com/port_graph_current_alg.html
>>
>> [2] https://0xffsoftware.com/port_graph_random_step_alg.html
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-09 11:56 ` Fernando Fernandez Mancera
@ 2026-02-09 13:53 ` longxie86
2026-02-09 15:25 ` Fernando Fernandez Mancera
0 siblings, 1 reply; 10+ messages in thread
From: longxie86 @ 2026-02-09 13:53 UTC (permalink / raw)
To: Fernando Fernandez Mancera
Cc: Eric Dumazet, netdev, davem, kuba, pabeni, horms, corbet,
ncardwell, kuniyu, dsahern, idosch, linux-doc, linux-kernel,
Thorsten Toepper
On Monday, February 9th, 2026 at 12:57 PM, Fernando Fernandez Mancera <fmancera@suse.de> wrote:
>
>
> On 2/6/26 6:09 PM, Eric Dumazet wrote:
>
> > On Fri, Feb 6, 2026 at 5:28 PM Fernando Fernandez Mancera
> > fmancera@suse.de wrote:
> >
> > > It makes sense. I have tested this approach and we got a more even
> > > distribution of source ports when having thousands of reserved ports. No
> > > difference at all when not using reserved ports.
> > >
> > > Please, you can find the distribution graph with the current algorithm
> > > [1] and with the random step algorithm [2].
> > >
> > > While I understand that this approach is introducing a call to
> > > get_random_u32_below() on every connect, I am wondering if it makes
> > > sense to replace the existing algorithm with this variant. What do you
> > > think?
> >
> > I would ask RFC 6056 experts like Fernando Gont what they think.
> >
> > Note that if we use random at each connect(), we defeat one of the principles
> > of ephemeral port selection : try very hard to avoid 4-tuple collision.
>
>
> Right. I will reach out to him and get his opinion. I have plenty of
> time before net-next open again. I am also collecting some metrics
> regarding the 4-tuple collision frequency.
>
We have had this problem in AWS for a long time. The patch works on our system. What is needed for it to be included in the next Linux release?
Please bring this to the stable versions.
> > > Please, notice the implementation below. I plan to send an official v1
> > > once net-next is open. In addition, I am rewriting the commit message as
> > > I find the current one confusing.
> > >
> > > [1] https://0xffsoftware.com/port_graph_current_alg.html
> > >
> > > [2] https://0xffsoftware.com/port_graph_random_step_alg.html
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries
2026-02-09 13:53 ` longxie86
@ 2026-02-09 15:25 ` Fernando Fernandez Mancera
0 siblings, 0 replies; 10+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-09 15:25 UTC (permalink / raw)
To: longxie86
Cc: Eric Dumazet, netdev, davem, kuba, pabeni, horms, corbet,
ncardwell, kuniyu, dsahern, idosch, linux-doc, linux-kernel,
Thorsten Toepper
On 2/9/26 2:53 PM, longxie86@protonmail.com wrote:
> On Monday, February 9th, 2026 at 12:57 PM, Fernando Fernandez Mancera <fmancera@suse.de> wrote:
>
>>
>>
>> On 2/6/26 6:09 PM, Eric Dumazet wrote:
>>
>>> On Fri, Feb 6, 2026 at 5:28 PM Fernando Fernandez Mancera
>>> fmancera@suse.de wrote:
>>>
>>>> It makes sense. I have tested this approach and we got a more even
>>>> distribution of source ports when having thousands of reserved ports. No
>>>> difference at all when not using reserved ports.
>>>>
>>>> Please, you can find the distribution graph with the current algorithm
>>>> [1] and with the random step algorithm [2].
>>>>
>>>> While I understand that this approach is introducing a call to
>>>> get_random_u32_below() on every connect, I am wondering if it makes
>>>> sense to replace the existing algorithm with this variant. What do you
>>>> think?
>>>
>>> I would ask RFC 6056 experts like Fernando Gont what they think.
>>>
>>> Note that if we use random at each connect(), we defeat one of the principles
>>> of ephemeral port selection : try very hard to avoid 4-tuple collision.
>>
>>
>> Right. I will reach out to him and get his opinion. I have plenty of
>> time before net-next open again. I am also collecting some metrics
>> regarding the 4-tuple collision frequency.
>>
>
> We have had this problem in AWS for a long time. The patch works on our system. What is needed for it to be included in the next Linux release?
>
This just an RFC, I discourage using it in production yet. An official
v1 will be sent once net-next is open and there it needs to be reviewed
and approved by the maintainers.
> Please bring this to the stable versions.
>
I don't think that will happen. This is an improvement not a "fix" per
definition. Anyway, you could ask for a backport to your
vendor/distribution.
Thanks,
Fernando.
>>>> Please, notice the implementation below. I plan to send an official v1
>>>> once net-next is open. In addition, I am rewriting the commit message as
>>>> I find the current one confusing.
>>>>
>>>> [1] https://0xffsoftware.com/port_graph_current_alg.html
>>>>
>>>> [2] https://0xffsoftware.com/port_graph_random_step_alg.html
>>
>>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-02-09 15:26 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 17:54 [PATCH RFC net-next] inet: add ip_retry_random_port sysctl to reduce sequential port retries Fernando Fernandez Mancera
2026-02-03 18:02 ` Fernando Fernandez Mancera
2026-02-04 16:25 ` Fernando Fernandez Mancera
2026-02-04 16:49 ` Eric Dumazet
2026-02-04 17:29 ` Fernando Fernandez Mancera
2026-02-06 16:27 ` Fernando Fernandez Mancera
2026-02-06 17:09 ` Eric Dumazet
2026-02-09 11:56 ` Fernando Fernandez Mancera
2026-02-09 13:53 ` longxie86
2026-02-09 15:25 ` Fernando Fernandez Mancera
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox