* [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution @ 2026-02-24 15:05 Fernando Fernandez Mancera 2026-02-25 6:28 ` Kuniyuki Iwashima 0 siblings, 1 reply; 5+ messages in thread From: Fernando Fernandez Mancera @ 2026-02-24 15:05 UTC (permalink / raw) To: netdev Cc: linux-kernel, ij, chia-yu.chang, idosch, willemb, dsahern, kuniyu, ncardwell, corbet, horms, pabeni, kuba, edumazet, davem, Fernando Fernandez Mancera With the current port selection algorithm, ports after a reserved port range or long time used port are used more often than others [1]. This causes an uneven port usage distribution. This combines with cloud environments blocking connections between the application server and the database server if there was a previous connection with the same source port, leading to connectivity problems between applications on cloud environments. The real issue here is that these firewalls cannot cope with standards-compliant port reuse. This is a workaround for such situations and an improvement on the distribution of ports selected. The proposed solution is to implement a variant of RFC 6056 Algorithm 5. The step size is selected randomly on every connect() call ensuring it is a coprime with respect to the size of the range of ports we want to scan. This way, we can ensure that all ports within the range are scanned before returning an error. To enable this algorithm, the user must configure the new sysctl option "net.ipv4.ip_local_port_step_width". In addition, on graphs generated we can observe that the distribution of source ports is more even with the proposed approach. [2] [1] https://0xffsoftware.com/port_graph_current_alg.html [2] https://0xffsoftware.com/port_graph_random_step_alg.html Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> --- Documentation/networking/ip-sysctl.rst | 9 ++++++++ .../net_cachelines/netns_ipv4_sysctl.rst | 1 + include/net/netns/ipv4.h | 1 + net/ipv4/inet_hashtables.c | 22 ++++++++++++++++--- net/ipv4/sysctl_net_ipv4.c | 7 ++++++ 5 files changed, 37 insertions(+), 3 deletions(-) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 6921d8594b84..9e2625ee778c 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1612,6 +1612,15 @@ ip_local_reserved_ports - list of comma separated ranges Default: Empty +ip_local_port_step_width - INTEGER + Defines the numerical maximum increment between successive port + allocations within the ephemeral port range when an unavailable port is + reached. This can be used to mitigate accumulated nodes in port + distribution when reserved ports have been configured. Please note that + port collisions may be more frequent in a system with a very high load. + + Default: 0 (disabled) + ip_unprivileged_port_start - INTEGER This is a per-namespace sysctl. It defines the first unprivileged port in the network namespace. Privileged ports diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst index beaf1880a19b..c0e194a6e4ee 100644 --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn u8 sysctl_tcp_ecn_fallback u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl u8 sysctl_ip_no_pmtu_disc +u32 sysctl_ip_local_port_step_width u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu u8 sysctl_ip_fwd_update_priority ip_forward u8 sysctl_ip_nonlocal_bind diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 8e971c7bf164..fb7c2235af21 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -166,6 +166,7 @@ struct netns_ipv4 { u8 sysctl_ip_autobind_reuse; /* Shall we try to damage output packets if routing dev changes? */ u8 sysctl_ip_dynaddr; + u32 sysctl_ip_local_port_step_width; #ifdef CONFIG_NET_L3_MASTER_DEV u8 sysctl_raw_l3mdev_accept; #endif diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index f5826ec4bcaa..1992dc21818f 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -16,6 +16,7 @@ #include <linux/wait.h> #include <linux/vmalloc.h> #include <linux/memblock.h> +#include <linux/gcd.h> #include <net/addrconf.h> #include <net/inet_connection_sock.h> @@ -1046,12 +1047,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, struct net *net = sock_net(sk); struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; + int step, scan_step, l3mdev; + u32 index, max_rand_step; bool tb_created = false; u32 remaining, offset; int ret, i, low, high; bool local_ports; - int step, l3mdev; - u32 index; if (port) { local_bh_disable(); @@ -1065,6 +1066,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, local_ports = inet_sk_get_local_port_range(sk, &low, &high); step = local_ports ? 1 : 2; + scan_step = step; + max_rand_step = READ_ONCE(net->ipv4.sysctl_ip_local_port_step_width); high++; /* [32768, 60999] -> [32768, 61000[ */ remaining = high - low; @@ -1083,9 +1086,22 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, */ if (!local_ports) offset &= ~1U; + + if (max_rand_step && remaining > 1) { + u32 range = (step == 1) ? remaining : (remaining / 2); + u32 upper_bound = min(range, max_rand_step); + + scan_step = get_random_u32_inclusive(1, upper_bound); + while (gcd(scan_step, range) != 1) { + scan_step++; + if (unlikely(scan_step > upper_bound)) + scan_step = 1; + } + scan_step *= step; + } other_parity_scan: port = low + offset; - for (i = 0; i < remaining; i += step, port += step) { + for (i = 0; i < remaining; i += step, port += scan_step) { if (unlikely(port >= high)) port -= remaining; if (inet_is_local_reserved_port(net, port)) diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 643763bc2142..c533374f656c 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -822,6 +822,13 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = ipv4_local_port_range, }, + { + .procname = "ip_local_port_step_width", + .maxlen = sizeof(u32), + .data = &init_net.ipv4.sysctl_ip_local_port_step_width, + .mode = 0644, + .proc_handler = proc_douintvec, + }, { .procname = "ip_local_reserved_ports", .data = &init_net.ipv4.sysctl_local_reserved_ports, -- 2.53.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution 2026-02-24 15:05 [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution Fernando Fernandez Mancera @ 2026-02-25 6:28 ` Kuniyuki Iwashima 2026-02-25 10:02 ` Fernando Fernandez Mancera 0 siblings, 1 reply; 5+ messages in thread From: Kuniyuki Iwashima @ 2026-02-25 6:28 UTC (permalink / raw) To: Fernando Fernandez Mancera Cc: netdev, linux-kernel, ij, chia-yu.chang, idosch, willemb, dsahern, ncardwell, corbet, horms, pabeni, kuba, edumazet, davem On Tue, Feb 24, 2026 at 7:05 AM Fernando Fernandez Mancera <fmancera@suse.de> wrote: > > With the current port selection algorithm, ports after a reserved port > range or long time used port are used more often than others [1]. This > causes an uneven port usage distribution. This combines with cloud > environments blocking connections between the application server and the > database server if there was a previous connection with the same source > port, leading to connectivity problems between applications on cloud > environments. > > The real issue here is that these firewalls cannot cope with > standards-compliant port reuse. This is a workaround for such situations > and an improvement on the distribution of ports selected. > > The proposed solution is to implement a variant of RFC 6056 Algorithm 5. > The step size is selected randomly on every connect() call ensuring it > is a coprime with respect to the size of the range of ports we want to > scan. This way, we can ensure that all ports within the range are > scanned before returning an error. To enable this algorithm, the user > must configure the new sysctl option "net.ipv4.ip_local_port_step_width". > > In addition, on graphs generated we can observe that the distribution of > source ports is more even with the proposed approach. [2] > > [1] https://0xffsoftware.com/port_graph_current_alg.html > > [2] https://0xffsoftware.com/port_graph_random_step_alg.html > > Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> > --- > Documentation/networking/ip-sysctl.rst | 9 ++++++++ > .../net_cachelines/netns_ipv4_sysctl.rst | 1 + > include/net/netns/ipv4.h | 1 + > net/ipv4/inet_hashtables.c | 22 ++++++++++++++++--- > net/ipv4/sysctl_net_ipv4.c | 7 ++++++ > 5 files changed, 37 insertions(+), 3 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > index 6921d8594b84..9e2625ee778c 100644 > --- a/Documentation/networking/ip-sysctl.rst > +++ b/Documentation/networking/ip-sysctl.rst > @@ -1612,6 +1612,15 @@ ip_local_reserved_ports - list of comma separated ranges > > Default: Empty > > +ip_local_port_step_width - INTEGER > + Defines the numerical maximum increment between successive port > + allocations within the ephemeral port range when an unavailable port is > + reached. This can be used to mitigate accumulated nodes in port > + distribution when reserved ports have been configured. Please note that > + port collisions may be more frequent in a system with a very high load. > + > + Default: 0 (disabled) > + > ip_unprivileged_port_start - INTEGER > This is a per-namespace sysctl. It defines the first > unprivileged port in the network namespace. Privileged ports > diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst > index beaf1880a19b..c0e194a6e4ee 100644 > --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst > +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst > @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn > u8 sysctl_tcp_ecn_fallback > u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl > u8 sysctl_ip_no_pmtu_disc > +u32 sysctl_ip_local_port_step_width > u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu > u8 sysctl_ip_fwd_update_priority ip_forward > u8 sysctl_ip_nonlocal_bind > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > index 8e971c7bf164..fb7c2235af21 100644 > --- a/include/net/netns/ipv4.h > +++ b/include/net/netns/ipv4.h > @@ -166,6 +166,7 @@ struct netns_ipv4 { > u8 sysctl_ip_autobind_reuse; > /* Shall we try to damage output packets if routing dev changes? */ > u8 sysctl_ip_dynaddr; > + u32 sysctl_ip_local_port_step_width; > #ifdef CONFIG_NET_L3_MASTER_DEV > u8 sysctl_raw_l3mdev_accept; > #endif > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > index f5826ec4bcaa..1992dc21818f 100644 > --- a/net/ipv4/inet_hashtables.c > +++ b/net/ipv4/inet_hashtables.c > @@ -16,6 +16,7 @@ > #include <linux/wait.h> > #include <linux/vmalloc.h> > #include <linux/memblock.h> > +#include <linux/gcd.h> > > #include <net/addrconf.h> > #include <net/inet_connection_sock.h> > @@ -1046,12 +1047,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, > struct net *net = sock_net(sk); > struct inet_bind2_bucket *tb2; > struct inet_bind_bucket *tb; > + int step, scan_step, l3mdev; > + u32 index, max_rand_step; > bool tb_created = false; > u32 remaining, offset; > int ret, i, low, high; > bool local_ports; > - int step, l3mdev; > - u32 index; > > if (port) { > local_bh_disable(); > @@ -1065,6 +1066,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, > > local_ports = inet_sk_get_local_port_range(sk, &low, &high); > step = local_ports ? 1 : 2; > + scan_step = step; > + max_rand_step = READ_ONCE(net->ipv4.sysctl_ip_local_port_step_width); > > high++; /* [32768, 60999] -> [32768, 61000[ */ > remaining = high - low; > @@ -1083,9 +1086,22 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, > */ > if (!local_ports) > offset &= ~1U; > + > + if (max_rand_step && remaining > 1) { > + u32 range = (step == 1) ? remaining : (remaining / 2); > + u32 upper_bound = min(range, max_rand_step); > + > + scan_step = get_random_u32_inclusive(1, upper_bound); > + while (gcd(scan_step, range) != 1) { > + scan_step++; If both scan_step and range are even, an extra increment here saves 1/2 calls of gcd(). > + if (unlikely(scan_step > upper_bound)) > + scan_step = 1; > + } > + scan_step *= step; > + } > other_parity_scan: Doing "other_parity_scan" will be just redundant unless scan_step is 2 ? > port = low + offset; > - for (i = 0; i < remaining; i += step, port += step) { > + for (i = 0; i < remaining; i += step, port += scan_step) { > if (unlikely(port >= high)) > port -= remaining; > if (inet_is_local_reserved_port(net, port)) > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 643763bc2142..c533374f656c 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -822,6 +822,13 @@ static struct ctl_table ipv4_net_table[] = { > .mode = 0644, > .proc_handler = ipv4_local_port_range, > }, > + { > + .procname = "ip_local_port_step_width", > + .maxlen = sizeof(u32), > + .data = &init_net.ipv4.sysctl_ip_local_port_step_width, > + .mode = 0644, > + .proc_handler = proc_douintvec, > + }, > { > .procname = "ip_local_reserved_ports", > .data = &init_net.ipv4.sysctl_local_reserved_ports, > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution 2026-02-25 6:28 ` Kuniyuki Iwashima @ 2026-02-25 10:02 ` Fernando Fernandez Mancera 2026-02-25 17:33 ` Kuniyuki Iwashima 0 siblings, 1 reply; 5+ messages in thread From: Fernando Fernandez Mancera @ 2026-02-25 10:02 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: netdev, linux-kernel, ij, chia-yu.chang, idosch, willemb, dsahern, ncardwell, corbet, horms, pabeni, kuba, edumazet, davem On 2/25/26 7:28 AM, Kuniyuki Iwashima wrote: > On Tue, Feb 24, 2026 at 7:05 AM Fernando Fernandez Mancera > <fmancera@suse.de> wrote: >> >> With the current port selection algorithm, ports after a reserved port >> range or long time used port are used more often than others [1]. This >> causes an uneven port usage distribution. This combines with cloud >> environments blocking connections between the application server and the >> database server if there was a previous connection with the same source >> port, leading to connectivity problems between applications on cloud >> environments. >> >> The real issue here is that these firewalls cannot cope with >> standards-compliant port reuse. This is a workaround for such situations >> and an improvement on the distribution of ports selected. >> >> The proposed solution is to implement a variant of RFC 6056 Algorithm 5. >> The step size is selected randomly on every connect() call ensuring it >> is a coprime with respect to the size of the range of ports we want to >> scan. This way, we can ensure that all ports within the range are >> scanned before returning an error. To enable this algorithm, the user >> must configure the new sysctl option "net.ipv4.ip_local_port_step_width". >> >> In addition, on graphs generated we can observe that the distribution of >> source ports is more even with the proposed approach. [2] >> >> [1] https://0xffsoftware.com/port_graph_current_alg.html >> >> [2] https://0xffsoftware.com/port_graph_random_step_alg.html >> >> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> >> --- >> Documentation/networking/ip-sysctl.rst | 9 ++++++++ >> .../net_cachelines/netns_ipv4_sysctl.rst | 1 + >> include/net/netns/ipv4.h | 1 + >> net/ipv4/inet_hashtables.c | 22 ++++++++++++++++--- >> net/ipv4/sysctl_net_ipv4.c | 7 ++++++ >> 5 files changed, 37 insertions(+), 3 deletions(-) >> >> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst >> index 6921d8594b84..9e2625ee778c 100644 >> --- a/Documentation/networking/ip-sysctl.rst >> +++ b/Documentation/networking/ip-sysctl.rst >> @@ -1612,6 +1612,15 @@ ip_local_reserved_ports - list of comma separated ranges >> >> Default: Empty >> >> +ip_local_port_step_width - INTEGER >> + Defines the numerical maximum increment between successive port >> + allocations within the ephemeral port range when an unavailable port is >> + reached. This can be used to mitigate accumulated nodes in port >> + distribution when reserved ports have been configured. Please note that >> + port collisions may be more frequent in a system with a very high load. >> + >> + Default: 0 (disabled) >> + >> ip_unprivileged_port_start - INTEGER >> This is a per-namespace sysctl. It defines the first >> unprivileged port in the network namespace. Privileged ports >> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst >> index beaf1880a19b..c0e194a6e4ee 100644 >> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst >> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst >> @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn >> u8 sysctl_tcp_ecn_fallback >> u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl >> u8 sysctl_ip_no_pmtu_disc >> +u32 sysctl_ip_local_port_step_width >> u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu >> u8 sysctl_ip_fwd_update_priority ip_forward >> u8 sysctl_ip_nonlocal_bind >> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h >> index 8e971c7bf164..fb7c2235af21 100644 >> --- a/include/net/netns/ipv4.h >> +++ b/include/net/netns/ipv4.h >> @@ -166,6 +166,7 @@ struct netns_ipv4 { >> u8 sysctl_ip_autobind_reuse; >> /* Shall we try to damage output packets if routing dev changes? */ >> u8 sysctl_ip_dynaddr; >> + u32 sysctl_ip_local_port_step_width; >> #ifdef CONFIG_NET_L3_MASTER_DEV >> u8 sysctl_raw_l3mdev_accept; >> #endif >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c >> index f5826ec4bcaa..1992dc21818f 100644 >> --- a/net/ipv4/inet_hashtables.c >> +++ b/net/ipv4/inet_hashtables.c >> @@ -16,6 +16,7 @@ >> #include <linux/wait.h> >> #include <linux/vmalloc.h> >> #include <linux/memblock.h> >> +#include <linux/gcd.h> >> >> #include <net/addrconf.h> >> #include <net/inet_connection_sock.h> >> @@ -1046,12 +1047,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, >> struct net *net = sock_net(sk); >> struct inet_bind2_bucket *tb2; >> struct inet_bind_bucket *tb; >> + int step, scan_step, l3mdev; >> + u32 index, max_rand_step; >> bool tb_created = false; >> u32 remaining, offset; >> int ret, i, low, high; >> bool local_ports; >> - int step, l3mdev; >> - u32 index; >> >> if (port) { >> local_bh_disable(); >> @@ -1065,6 +1066,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, >> >> local_ports = inet_sk_get_local_port_range(sk, &low, &high); >> step = local_ports ? 1 : 2; >> + scan_step = step; >> + max_rand_step = READ_ONCE(net->ipv4.sysctl_ip_local_port_step_width); >> >> high++; /* [32768, 60999] -> [32768, 61000[ */ >> remaining = high - low; >> @@ -1083,9 +1086,22 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, >> */ >> if (!local_ports) >> offset &= ~1U; >> + >> + if (max_rand_step && remaining > 1) { >> + u32 range = (step == 1) ? remaining : (remaining / 2); >> + u32 upper_bound = min(range, max_rand_step); >> + >> + scan_step = get_random_u32_inclusive(1, upper_bound); >> + while (gcd(scan_step, range) != 1) { >> + scan_step++; > > If both scan_step and range are even, an extra > increment here saves 1/2 calls of gcd(). > Ah right, thanks! > >> + if (unlikely(scan_step > upper_bound)) >> + scan_step = 1; >> + } >> + scan_step *= step; >> + } >> other_parity_scan: > > Doing "other_parity_scan" will be just redundant > unless scan_step is 2 ? > I have tried to preserve the parity behavior. Maybe I missed something, let me explain why it isn't redundant in my opinion. In essence, when calculating the range we first look at "step". If step == 1 we use all the remaining ports as range, otherwise we use remaining/2. If step == 1 we do not care about parity so let's look at step == 2. If step == 2, we calculate a step_scan that is coprime with remaining/2. Once we have it, we multiply it by 2 so we make sure scan_step is even. Then it works exactly like with the current algorithm, we look for the even ports first, everytime we reach the high we subtract the size of the range (remaining is actually a bad name IMHO) and continue. When i >= remaining (keep on mind that i is incresed by step, that is by 2 on each iteration), we start again for the odd numbers. They key piece is that step_scan/2 is coprime with remaining/2. As long as that holds, we should visit first all the even numbers and then the odd ones. Thanks, Fernando. > >> port = low + offset; >> - for (i = 0; i < remaining; i += step, port += step) { >> + for (i = 0; i < remaining; i += step, port += scan_step) { >> if (unlikely(port >= high)) >> port -= remaining; >> if (inet_is_local_reserved_port(net, port)) >> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c >> index 643763bc2142..c533374f656c 100644 >> --- a/net/ipv4/sysctl_net_ipv4.c >> +++ b/net/ipv4/sysctl_net_ipv4.c >> @@ -822,6 +822,13 @@ static struct ctl_table ipv4_net_table[] = { >> .mode = 0644, >> .proc_handler = ipv4_local_port_range, >> }, >> + { >> + .procname = "ip_local_port_step_width", >> + .maxlen = sizeof(u32), >> + .data = &init_net.ipv4.sysctl_ip_local_port_step_width, >> + .mode = 0644, >> + .proc_handler = proc_douintvec, >> + }, >> { >> .procname = "ip_local_reserved_ports", >> .data = &init_net.ipv4.sysctl_local_reserved_ports, >> -- >> 2.53.0 >> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution 2026-02-25 10:02 ` Fernando Fernandez Mancera @ 2026-02-25 17:33 ` Kuniyuki Iwashima 2026-02-26 10:39 ` Fernando Fernandez Mancera 0 siblings, 1 reply; 5+ messages in thread From: Kuniyuki Iwashima @ 2026-02-25 17:33 UTC (permalink / raw) To: Fernando Fernandez Mancera Cc: netdev, linux-kernel, ij, chia-yu.chang, idosch, willemb, dsahern, ncardwell, corbet, horms, pabeni, kuba, edumazet, davem On Wed, Feb 25, 2026 at 2:03 AM Fernando Fernandez Mancera <fmancera@suse.de> wrote: > > On 2/25/26 7:28 AM, Kuniyuki Iwashima wrote: > > On Tue, Feb 24, 2026 at 7:05 AM Fernando Fernandez Mancera > > <fmancera@suse.de> wrote: > >> > >> With the current port selection algorithm, ports after a reserved port > >> range or long time used port are used more often than others [1]. This > >> causes an uneven port usage distribution. This combines with cloud > >> environments blocking connections between the application server and the > >> database server if there was a previous connection with the same source > >> port, leading to connectivity problems between applications on cloud > >> environments. > >> > >> The real issue here is that these firewalls cannot cope with > >> standards-compliant port reuse. This is a workaround for such situations > >> and an improvement on the distribution of ports selected. > >> > >> The proposed solution is to implement a variant of RFC 6056 Algorithm 5. > >> The step size is selected randomly on every connect() call ensuring it > >> is a coprime with respect to the size of the range of ports we want to > >> scan. This way, we can ensure that all ports within the range are > >> scanned before returning an error. To enable this algorithm, the user > >> must configure the new sysctl option "net.ipv4.ip_local_port_step_width". > >> > >> In addition, on graphs generated we can observe that the distribution of > >> source ports is more even with the proposed approach. [2] > >> > >> [1] https://0xffsoftware.com/port_graph_current_alg.html > >> > >> [2] https://0xffsoftware.com/port_graph_random_step_alg.html > >> > >> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> > >> --- > >> Documentation/networking/ip-sysctl.rst | 9 ++++++++ > >> .../net_cachelines/netns_ipv4_sysctl.rst | 1 + > >> include/net/netns/ipv4.h | 1 + > >> net/ipv4/inet_hashtables.c | 22 ++++++++++++++++--- > >> net/ipv4/sysctl_net_ipv4.c | 7 ++++++ > >> 5 files changed, 37 insertions(+), 3 deletions(-) > >> > >> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > >> index 6921d8594b84..9e2625ee778c 100644 > >> --- a/Documentation/networking/ip-sysctl.rst > >> +++ b/Documentation/networking/ip-sysctl.rst > >> @@ -1612,6 +1612,15 @@ ip_local_reserved_ports - list of comma separated ranges > >> > >> Default: Empty > >> > >> +ip_local_port_step_width - INTEGER > >> + Defines the numerical maximum increment between successive port > >> + allocations within the ephemeral port range when an unavailable port is > >> + reached. This can be used to mitigate accumulated nodes in port > >> + distribution when reserved ports have been configured. Please note that > >> + port collisions may be more frequent in a system with a very high load. > >> + > >> + Default: 0 (disabled) > >> + > >> ip_unprivileged_port_start - INTEGER > >> This is a per-namespace sysctl. It defines the first > >> unprivileged port in the network namespace. Privileged ports > >> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst > >> index beaf1880a19b..c0e194a6e4ee 100644 > >> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst > >> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst > >> @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn > >> u8 sysctl_tcp_ecn_fallback > >> u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl > >> u8 sysctl_ip_no_pmtu_disc > >> +u32 sysctl_ip_local_port_step_width > >> u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu > >> u8 sysctl_ip_fwd_update_priority ip_forward > >> u8 sysctl_ip_nonlocal_bind > >> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > >> index 8e971c7bf164..fb7c2235af21 100644 > >> --- a/include/net/netns/ipv4.h > >> +++ b/include/net/netns/ipv4.h > >> @@ -166,6 +166,7 @@ struct netns_ipv4 { > >> u8 sysctl_ip_autobind_reuse; > >> /* Shall we try to damage output packets if routing dev changes? */ > >> u8 sysctl_ip_dynaddr; > >> + u32 sysctl_ip_local_port_step_width; > >> #ifdef CONFIG_NET_L3_MASTER_DEV > >> u8 sysctl_raw_l3mdev_accept; > >> #endif > >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c > >> index f5826ec4bcaa..1992dc21818f 100644 > >> --- a/net/ipv4/inet_hashtables.c > >> +++ b/net/ipv4/inet_hashtables.c > >> @@ -16,6 +16,7 @@ > >> #include <linux/wait.h> > >> #include <linux/vmalloc.h> > >> #include <linux/memblock.h> > >> +#include <linux/gcd.h> > >> > >> #include <net/addrconf.h> > >> #include <net/inet_connection_sock.h> > >> @@ -1046,12 +1047,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, > >> struct net *net = sock_net(sk); > >> struct inet_bind2_bucket *tb2; > >> struct inet_bind_bucket *tb; > >> + int step, scan_step, l3mdev; > >> + u32 index, max_rand_step; > >> bool tb_created = false; > >> u32 remaining, offset; > >> int ret, i, low, high; > >> bool local_ports; > >> - int step, l3mdev; > >> - u32 index; > >> > >> if (port) { > >> local_bh_disable(); > >> @@ -1065,6 +1066,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, > >> > >> local_ports = inet_sk_get_local_port_range(sk, &low, &high); > >> step = local_ports ? 1 : 2; > >> + scan_step = step; > >> + max_rand_step = READ_ONCE(net->ipv4.sysctl_ip_local_port_step_width); > >> > >> high++; /* [32768, 60999] -> [32768, 61000[ */ > >> remaining = high - low; > >> @@ -1083,9 +1086,22 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, > >> */ > >> if (!local_ports) > >> offset &= ~1U; > >> + > >> + if (max_rand_step && remaining > 1) { > >> + u32 range = (step == 1) ? remaining : (remaining / 2); > >> + u32 upper_bound = min(range, max_rand_step); > >> + > >> + scan_step = get_random_u32_inclusive(1, upper_bound); > >> + while (gcd(scan_step, range) != 1) { > >> + scan_step++; > > > > If both scan_step and range are even, an extra > > increment here saves 1/2 calls of gcd(). > > > > Ah right, thanks! > > > > >> + if (unlikely(scan_step > upper_bound)) > >> + scan_step = 1; > >> + } > >> + scan_step *= step; > >> + } > >> other_parity_scan: > > > > Doing "other_parity_scan" will be just redundant > > unless scan_step is 2 ? > > > > I have tried to preserve the parity behavior. Maybe I missed something, > let me explain why it isn't redundant in my opinion. > > In essence, when calculating the range we first look at "step". If step > == 1 we use all the remaining ports as range, otherwise we use remaining/2. > > If step == 1 we do not care about parity so let's look at step == 2. > > If step == 2, we calculate a step_scan that is coprime with remaining/2. > Once we have it, we multiply it by 2 so we make sure scan_step is even. Ah, I missed scan_step *= step. Then looks good. Maybe we can set range = remaining / step similarly. Thanks ! > > Then it works exactly like with the current algorithm, we look for the > even ports first, everytime we reach the high we subtract the size of > the range (remaining is actually a bad name IMHO) and continue. When i > >= remaining (keep on mind that i is incresed by step, that is by 2 on > each iteration), we start again for the odd numbers. > > They key piece is that step_scan/2 is coprime with remaining/2. As long > as that holds, we should visit first all the even numbers and then the > odd ones. > > Thanks, > Fernando. > > > > >> port = low + offset; > >> - for (i = 0; i < remaining; i += step, port += step) { > >> + for (i = 0; i < remaining; i += step, port += scan_step) { > >> if (unlikely(port >= high)) > >> port -= remaining; > >> if (inet_is_local_reserved_port(net, port)) > >> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > >> index 643763bc2142..c533374f656c 100644 > >> --- a/net/ipv4/sysctl_net_ipv4.c > >> +++ b/net/ipv4/sysctl_net_ipv4.c > >> @@ -822,6 +822,13 @@ static struct ctl_table ipv4_net_table[] = { > >> .mode = 0644, > >> .proc_handler = ipv4_local_port_range, > >> }, > >> + { > >> + .procname = "ip_local_port_step_width", > >> + .maxlen = sizeof(u32), > >> + .data = &init_net.ipv4.sysctl_ip_local_port_step_width, > >> + .mode = 0644, > >> + .proc_handler = proc_douintvec, > >> + }, > >> { > >> .procname = "ip_local_reserved_ports", > >> .data = &init_net.ipv4.sysctl_local_reserved_ports, > >> -- > >> 2.53.0 > >> > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution 2026-02-25 17:33 ` Kuniyuki Iwashima @ 2026-02-26 10:39 ` Fernando Fernandez Mancera 0 siblings, 0 replies; 5+ messages in thread From: Fernando Fernandez Mancera @ 2026-02-26 10:39 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: netdev, linux-kernel, ij, chia-yu.chang, idosch, willemb, dsahern, ncardwell, corbet, horms, pabeni, kuba, edumazet, davem On 2/25/26 6:33 PM, Kuniyuki Iwashima wrote: > On Wed, Feb 25, 2026 at 2:03 AM Fernando Fernandez Mancera > <fmancera@suse.de> wrote: >> >> On 2/25/26 7:28 AM, Kuniyuki Iwashima wrote: >>> On Tue, Feb 24, 2026 at 7:05 AM Fernando Fernandez Mancera >>> <fmancera@suse.de> wrote: >>>> >>>> With the current port selection algorithm, ports after a reserved port >>>> range or long time used port are used more often than others [1]. This >>>> causes an uneven port usage distribution. This combines with cloud >>>> environments blocking connections between the application server and the >>>> database server if there was a previous connection with the same source >>>> port, leading to connectivity problems between applications on cloud >>>> environments. >>>> >>>> The real issue here is that these firewalls cannot cope with >>>> standards-compliant port reuse. This is a workaround for such situations >>>> and an improvement on the distribution of ports selected. >>>> >>>> The proposed solution is to implement a variant of RFC 6056 Algorithm 5. >>>> The step size is selected randomly on every connect() call ensuring it >>>> is a coprime with respect to the size of the range of ports we want to >>>> scan. This way, we can ensure that all ports within the range are >>>> scanned before returning an error. To enable this algorithm, the user >>>> must configure the new sysctl option "net.ipv4.ip_local_port_step_width". >>>> >>>> In addition, on graphs generated we can observe that the distribution of >>>> source ports is more even with the proposed approach. [2] >>>> >>>> [1] https://0xffsoftware.com/port_graph_current_alg.html >>>> >>>> [2] https://0xffsoftware.com/port_graph_random_step_alg.html >>>> >>>> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> >>>> --- >>>> Documentation/networking/ip-sysctl.rst | 9 ++++++++ >>>> .../net_cachelines/netns_ipv4_sysctl.rst | 1 + >>>> include/net/netns/ipv4.h | 1 + >>>> net/ipv4/inet_hashtables.c | 22 ++++++++++++++++--- >>>> net/ipv4/sysctl_net_ipv4.c | 7 ++++++ >>>> 5 files changed, 37 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst >>>> index 6921d8594b84..9e2625ee778c 100644 >>>> --- a/Documentation/networking/ip-sysctl.rst >>>> +++ b/Documentation/networking/ip-sysctl.rst >>>> @@ -1612,6 +1612,15 @@ ip_local_reserved_ports - list of comma separated ranges >>>> >>>> Default: Empty >>>> >>>> +ip_local_port_step_width - INTEGER >>>> + Defines the numerical maximum increment between successive port >>>> + allocations within the ephemeral port range when an unavailable port is >>>> + reached. This can be used to mitigate accumulated nodes in port >>>> + distribution when reserved ports have been configured. Please note that >>>> + port collisions may be more frequent in a system with a very high load. >>>> + >>>> + Default: 0 (disabled) >>>> + >>>> ip_unprivileged_port_start - INTEGER >>>> This is a per-namespace sysctl. It defines the first >>>> unprivileged port in the network namespace. Privileged ports >>>> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst >>>> index beaf1880a19b..c0e194a6e4ee 100644 >>>> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst >>>> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst >>>> @@ -47,6 +47,7 @@ u8 sysctl_tcp_ecn >>>> u8 sysctl_tcp_ecn_fallback >>>> u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl >>>> u8 sysctl_ip_no_pmtu_disc >>>> +u32 sysctl_ip_local_port_step_width >>>> u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu >>>> u8 sysctl_ip_fwd_update_priority ip_forward >>>> u8 sysctl_ip_nonlocal_bind >>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h >>>> index 8e971c7bf164..fb7c2235af21 100644 >>>> --- a/include/net/netns/ipv4.h >>>> +++ b/include/net/netns/ipv4.h >>>> @@ -166,6 +166,7 @@ struct netns_ipv4 { >>>> u8 sysctl_ip_autobind_reuse; >>>> /* Shall we try to damage output packets if routing dev changes? */ >>>> u8 sysctl_ip_dynaddr; >>>> + u32 sysctl_ip_local_port_step_width; >>>> #ifdef CONFIG_NET_L3_MASTER_DEV >>>> u8 sysctl_raw_l3mdev_accept; >>>> #endif >>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c >>>> index f5826ec4bcaa..1992dc21818f 100644 >>>> --- a/net/ipv4/inet_hashtables.c >>>> +++ b/net/ipv4/inet_hashtables.c >>>> @@ -16,6 +16,7 @@ >>>> #include <linux/wait.h> >>>> #include <linux/vmalloc.h> >>>> #include <linux/memblock.h> >>>> +#include <linux/gcd.h> >>>> >>>> #include <net/addrconf.h> >>>> #include <net/inet_connection_sock.h> >>>> @@ -1046,12 +1047,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, >>>> struct net *net = sock_net(sk); >>>> struct inet_bind2_bucket *tb2; >>>> struct inet_bind_bucket *tb; >>>> + int step, scan_step, l3mdev; >>>> + u32 index, max_rand_step; >>>> bool tb_created = false; >>>> u32 remaining, offset; >>>> int ret, i, low, high; >>>> bool local_ports; >>>> - int step, l3mdev; >>>> - u32 index; >>>> >>>> if (port) { >>>> local_bh_disable(); >>>> @@ -1065,6 +1066,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, >>>> >>>> local_ports = inet_sk_get_local_port_range(sk, &low, &high); >>>> step = local_ports ? 1 : 2; >>>> + scan_step = step; >>>> + max_rand_step = READ_ONCE(net->ipv4.sysctl_ip_local_port_step_width); >>>> >>>> high++; /* [32768, 60999] -> [32768, 61000[ */ >>>> remaining = high - low; >>>> @@ -1083,9 +1086,22 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, >>>> */ >>>> if (!local_ports) >>>> offset &= ~1U; >>>> + >>>> + if (max_rand_step && remaining > 1) { >>>> + u32 range = (step == 1) ? remaining : (remaining / 2); >>>> + u32 upper_bound = min(range, max_rand_step); >>>> + >>>> + scan_step = get_random_u32_inclusive(1, upper_bound); >>>> + while (gcd(scan_step, range) != 1) { >>>> + scan_step++; >>> >>> If both scan_step and range are even, an extra >>> increment here saves 1/2 calls of gcd(). >>> >> >> Ah right, thanks! >> >>> >>>> + if (unlikely(scan_step > upper_bound)) >>>> + scan_step = 1; >>>> + } >>>> + scan_step *= step; >>>> + } >>>> other_parity_scan: >>> >>> Doing "other_parity_scan" will be just redundant >>> unless scan_step is 2 ? >>> >> >> I have tried to preserve the parity behavior. Maybe I missed something, >> let me explain why it isn't redundant in my opinion. >> >> In essence, when calculating the range we first look at "step". If step >> == 1 we use all the remaining ports as range, otherwise we use remaining/2. >> >> If step == 1 we do not care about parity so let's look at step == 2. >> >> If step == 2, we calculate a step_scan that is coprime with remaining/2. >> Once we have it, we multiply it by 2 so we make sure scan_step is even. > > Ah, I missed scan_step *= step. Then looks good. > Maybe we can set range = remaining / step similarly. Yes, let's do that. Thanks! ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-02-26 10:39 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-24 15:05 [PATCH net-next] inet: add ip_local_port_step_width sysctl to improve port usage distribution Fernando Fernandez Mancera 2026-02-25 6:28 ` Kuniyuki Iwashima 2026-02-25 10:02 ` Fernando Fernandez Mancera 2026-02-25 17:33 ` Kuniyuki Iwashima 2026-02-26 10:39 ` Fernando Fernandez Mancera
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox