* [PATCH net-next v3 0/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE
@ 2025-07-14 16:03 Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 1/3] tcp: Add RCU management to inet_bind2_bucket Jakub Sitnicki
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-14 16:03 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni, David S. Miller, Jakub Kicinski,
Neal Cardwell, Kuniyuki Iwashima
Cc: netdev, kernel-team, Lee Valentine
Please see patch 2 for details.
I stress tested it following the recipe from commit 86c2bc293b81 ("tcp: use
RCU lookup in __inet_hash_connect()"). Didn't notice any regression in
latency_mean or throughput.
Test command:
$ vng -r ~/src/linux 'ulimit -n 40000; \
./tcp_crr --nolog -6 -T 80 -F 12000 >/dev/null & \
./tcp_crr --nolog -6 -T 80 -F 12000 -c -H ::1 -l 120 --ip-local-port-range'
neper was patched to setsockopt(IP_LOCAL_PORT_RANGE, 1 | 65535 << 16) when
--ip-local-port-range flag is set.
---
Changes in v3:
- Make (struct inet_bind_bucket *)->bhash2 RCU safe (patch 1)
- Always skip inet_bind2_bucket's with v6 wildcard sockets
- Link to v2: https://lore.kernel.org/r/20250703-connect-port-search-harder-v2-1-d51bce6bd0a6@cloudflare.com
Changes in v2:
- Fix unused var warning when CONFIG_IPV6=n
- Convert INADDR_ANY to network byte order before comparison
- Link to v1: https://lore.kernel.org/r/20250626120247.1255223-1-jakub@cloudflare.com
---
To: Eric Dumazet <edumazet@google.com>
To: Paolo Abeni <pabeni@redhat.com>
To: David S. Miller <davem@davemloft.net>
To: Jakub Kicinski <kuba@kernel.org>
To: Neal Cardwell <ncardwell@google.com>
To: Kuniyuki Iwashima <kuniyu@google.com>
Cc: netdev@vger.kernel.org
Cc: kernel-team@cloudflare.com
---
Jakub Sitnicki (3):
tcp: Add RCU management to inet_bind2_bucket
tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE
selftests/net: Cover port sharing scenarios with IP_LOCAL_PORT_RANGE
include/net/inet_hashtables.h | 4 +-
include/net/inet_timewait_sock.h | 3 +-
net/ipv4/inet_connection_sock.c | 2 +-
net/ipv4/inet_hashtables.c | 72 ++-
net/ipv4/inet_timewait_sock.c | 8 +-
tools/testing/selftests/net/ip_local_port_range.c | 524 +++++++++++++++++++++
tools/testing/selftests/net/ip_local_port_range.sh | 14 +-
7 files changed, 602 insertions(+), 25 deletions(-)
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH net-next v3 1/3] tcp: Add RCU management to inet_bind2_bucket
2025-07-14 16:03 [PATCH net-next v3 0/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
@ 2025-07-14 16:03 ` Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios " Jakub Sitnicki
2 siblings, 0 replies; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-14 16:03 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni, David S. Miller, Jakub Kicinski,
Neal Cardwell, Kuniyuki Iwashima
Cc: netdev, kernel-team
Add RCU protection to inet_bind2_bucket structure akin to commit
d186f405fdf4 ("tcp: add RCU management to inet_bind_bucket").
This prepares us for walking (struct inet_bind_bucket *)->bhash2 list
without holding inet_bind_hashbucket spinlock.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
include/net/inet_hashtables.h | 4 ++--
include/net/inet_timewait_sock.h | 3 +--
net/ipv4/inet_connection_sock.c | 2 +-
net/ipv4/inet_hashtables.c | 16 ++++++++--------
net/ipv4/inet_timewait_sock.c | 8 +++-----
5 files changed, 15 insertions(+), 18 deletions(-)
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 19dbd9081d5a..a2ff18eea990 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -108,6 +108,7 @@ struct inet_bind2_bucket {
struct hlist_node bhash_node;
/* List of sockets hashed to this bucket */
struct hlist_head owners;
+ struct rcu_head rcu;
};
static inline struct net *ib_net(const struct inet_bind_bucket *ib)
@@ -228,8 +229,7 @@ inet_bind2_bucket_create(struct kmem_cache *cachep, struct net *net,
struct inet_bind_bucket *tb,
const struct sock *sk);
-void inet_bind2_bucket_destroy(struct kmem_cache *cachep,
- struct inet_bind2_bucket *tb);
+void inet_bind2_bucket_destroy(struct inet_bind2_bucket *tb);
struct inet_bind2_bucket *
inet_bind2_bucket_find(const struct inet_bind_hashbucket *head,
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 67a313575780..4f4e96b10cf3 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -92,8 +92,7 @@ static inline struct inet_timewait_sock *inet_twsk(const struct sock *sk)
void inet_twsk_free(struct inet_timewait_sock *tw);
void inet_twsk_put(struct inet_timewait_sock *tw);
-void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
- struct inet_hashinfo *hashinfo);
+void inet_twsk_bind_unhash(struct inet_timewait_sock *tw);
struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
struct inet_timewait_death_row *dr,
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1e2df51427fe..f58b93d3fa0e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -593,7 +593,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
fail_unlock:
if (ret) {
if (bhash2_created)
- inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, tb2);
+ inet_bind2_bucket_destroy(tb2);
if (bhash_created)
inet_bind_bucket_destroy(tb);
}
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ceeeec9b7290..d3ce6d0a514e 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -123,7 +123,7 @@ static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb2,
#endif
INIT_HLIST_HEAD(&tb2->owners);
hlist_add_head(&tb2->node, &head->chain);
- hlist_add_head(&tb2->bhash_node, &tb->bhash2);
+ hlist_add_head_rcu(&tb2->bhash_node, &tb->bhash2);
}
struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
@@ -141,12 +141,12 @@ struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
}
/* Caller must hold hashbucket lock for this tb with local BH disabled */
-void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
+void inet_bind2_bucket_destroy(struct inet_bind2_bucket *tb)
{
if (hlist_empty(&tb->owners)) {
__hlist_del(&tb->node);
- __hlist_del(&tb->bhash_node);
- kmem_cache_free(cachep, tb);
+ hlist_del_rcu(&tb->bhash_node);
+ kfree_rcu(tb, rcu);
}
}
@@ -198,7 +198,7 @@ static void __inet_put_port(struct sock *sk)
__sk_del_bind_node(sk);
inet_csk(sk)->icsk_bind2_hash = NULL;
- inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
+ inet_bind2_bucket_destroy(tb2);
}
spin_unlock(&head2->lock);
@@ -951,7 +951,7 @@ static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family,
spin_lock(&head2->lock);
__sk_del_bind_node(sk);
- inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, inet_csk(sk)->icsk_bind2_hash);
+ inet_bind2_bucket_destroy(inet_csk(sk)->icsk_bind2_hash);
spin_unlock(&head2->lock);
if (reset)
@@ -1154,7 +1154,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
inet_ehash_nolisten(sk, (struct sock *)tw, NULL);
}
if (tw)
- inet_twsk_bind_unhash(tw, hinfo);
+ inet_twsk_bind_unhash(tw);
spin_unlock(&head2->lock);
spin_unlock(&head->lock);
@@ -1179,7 +1179,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
inet_sk(sk)->inet_num = 0;
if (tw)
- inet_twsk_bind_unhash(tw, hinfo);
+ inet_twsk_bind_unhash(tw);
}
spin_unlock(&head2->lock);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 875ff923a8ed..ff286b179f4a 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -20,14 +20,12 @@
/**
* inet_twsk_bind_unhash - unhash a timewait socket from bind hash
* @tw: timewait socket
- * @hashinfo: hashinfo pointer
*
* unhash a timewait socket from bind hash, if hashed.
* bind hash lock must be held by caller.
* Returns 1 if caller should call inet_twsk_put() after lock release.
*/
-void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
- struct inet_hashinfo *hashinfo)
+void inet_twsk_bind_unhash(struct inet_timewait_sock *tw)
{
struct inet_bind2_bucket *tb2 = tw->tw_tb2;
struct inet_bind_bucket *tb = tw->tw_tb;
@@ -38,7 +36,7 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
__sk_del_bind_node((struct sock *)tw);
tw->tw_tb = NULL;
tw->tw_tb2 = NULL;
- inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
+ inet_bind2_bucket_destroy(tb2);
inet_bind_bucket_destroy(tb);
__sock_put((struct sock *)tw);
@@ -63,7 +61,7 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
spin_lock(&bhead->lock);
spin_lock(&bhead2->lock);
- inet_twsk_bind_unhash(tw, hashinfo);
+ inet_twsk_bind_unhash(tw);
spin_unlock(&bhead2->lock);
spin_unlock(&bhead->lock);
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE
2025-07-14 16:03 [PATCH net-next v3 0/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 1/3] tcp: Add RCU management to inet_bind2_bucket Jakub Sitnicki
@ 2025-07-14 16:03 ` Jakub Sitnicki
2025-07-17 9:23 ` Paolo Abeni
2025-07-14 16:03 ` [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios " Jakub Sitnicki
2 siblings, 1 reply; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-14 16:03 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni, David S. Miller, Jakub Kicinski,
Neal Cardwell, Kuniyuki Iwashima
Cc: netdev, kernel-team, Lee Valentine
Situation
---------
We observe the following scenario in production:
inet_bind_bucket
state for port 54321
--------------------
(bucket doesn't exist)
// Process A opens a long-lived connection:
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s1.bind(192.0.2.10, 0)
s1.connect(192.51.100.1, 443)
tb->reuse = -1
tb->reuseport = -1
s1.getsockname() -> 192.0.2.10:54321
s1.send()
s1.recv()
// ... s1 stays open.
// Process B opens a short-lived connection:
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SO_REUSEADDR)
s2.bind(192.0.2.20, 0)
tb->reuse = 0
tb->reuseport = 0
s2.connect(192.51.100.2, 53)
s2.getsockname() -> 192.0.2.20:54321
s2.send()
s2.recv()
s2.close()
// bucket remains in this
// state even though port
// was released by s2
tb->reuse = 0
tb->reuseport = 0
// Process A attempts to open another connection
// when there is connection pressure from
// 192.0.2.30:54000..54500 to 192.51.100.1:443.
// Assume only port 54321 is still available.
s3 = socket(AF_INET, SOCK_STREAM)
s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s3.bind(192.0.2.30, 0)
s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)
Problem
-------
We end up in a state where Process A can't reuse ephemeral port 54321 for
as long as there are sockets, like s1, that keep the bind bucket alive. The
bucket does not return to "reusable" state even when all sockets which
blocked it from reuse, like s2, are gone.
The ephemeral port becomes available for use again only after all sockets
bound to it are gone and the bind bucket is destroyed.
Programs which behave like Process B in this scenario - that is, binding to
an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
poorly written. However, the reality is that such implementation is not
actually uncommon. Trying to fix each and every such program is like
playing whack-a-mole.
For instance, it could be any software using Golang's net.Dialer with
LocalAddr provided:
dialer := &net.Dialer{
LocalAddr: &net.TCPAddr{IP: srcIP},
}
conn, err := dialer.Dial("tcp4", dialTarget)
Or even a ubiquitous tool like dig when using a specific local address:
$ dig -b 127.1.1.1 +tcp +short example.com
Hence, we are proposing a systematic fix in the network stack itself.
Solution
--------
If there is no IP address conflict with any socket bound to a given local
port, then from the protocol's perspective, the port can be safely shared.
With that in mind, modify the port search during connect(), that is
__inet_hash_connect, to consider all bind buckets (ports) when looking for
a local port for egress.
To achieve this, add an extra walk over bhash2 buckets for the port to
check for IP conflicts. The additional walk is not free, so perform it only
once per port - during the second phase of conflict checking, when the
bhash bucket is locked.
We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket
option is set. The rationale is that users are likely to care about using
every possible local port only when they have deliberately constrained the
ephemeral port range.
Limitations
-----------
1) Connected v4 sockets could share the local port with v6
non-dualstack (V6ONLY) sockets, but that would require traversing
(struct inet_bind2_bucket *)->owners, which is not RCU safe.
2) Sharing both the local IP and port with other established sockets, when
the remote address is unique is still not possible, when the bucket is in a
non-reusable state (tb->{fastreuse,fastreuseport} >= 0) due to a socket
explicitly bound to that port.
Alternatives
------------
* Update bind bucket state on port release
A valid solution to the described problem would also be to walk the bind
bucket owners when releasing the port and recalculate the
tb->{reuse,reuseport} state.
However, in comparison to the proposed solution, this alone would not allow
sharing the local port with other sockets bound to non-conflicting IPs for
as long as they exist.
Another downside is that we'd pay the extra cost on each unbind (more
frequent) rather than only when connecting with IP_LOCAL_PORT_RANGE
set (less frequent). Due to that we would also likely need to guard it
behind a sysctl (see below).
* Run your service in a dedicated netns
This would also solve the problem. While we don't rule out transitioning to
this setup in the future at a cost of shifting the complexity elsewhere.
Isolating your service in a netns requires assigning it dedicated IPs for
egress. If the egress IPs must be shared with other processes, as in our
case, then SNAT and connection tracking on egress are required - adding
complexity.
* Guard it behind a sysctl setting instead of a socket option
Users are unlikely to encounter this problem unless their workload connects
from a severely narrowed-down ephemeral port range. Hence, paying the bind
bucket walk cost for each connect() call doesn't seem sensible. Whereas
with a socket option, only a limited set of connections incur the
performance overhead.
Reported-by: Lee Valentine <lvalentine@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
net/ipv4/inet_hashtables.c | 56 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 54 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d3ce6d0a514e..9d8a9c7c8274 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1005,6 +1005,52 @@ EXPORT_IPV6_MOD(inet_bhash2_reset_saddr);
#define INET_TABLE_PERTURB_SIZE (1 << CONFIG_INET_TABLE_PERTURB_ORDER)
static u32 *table_perturb;
+/* True on source address conflict with another socket. False otherwise. */
+static inline bool check_bind2_bucket(const struct sock *sk,
+ const struct inet_bind2_bucket *tb2)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+ if (sk->sk_family == AF_INET6) {
+ return tb2->addr_type == IPV6_ADDR_ANY ||
+ ipv6_addr_equal(&tb2->v6_rcv_saddr,
+ &sk->sk_v6_rcv_saddr);
+ }
+
+ /* Assume there might be a non-V6ONLY wildcard socket,
+ * since walking tb2->owners is not RCU safe.
+ */
+ if (tb2->addr_type == IPV6_ADDR_ANY)
+ return true;
+
+ if (tb2->addr_type != IPV6_ADDR_MAPPED)
+ return false;
+#endif
+ return tb2->rcv_saddr == htonl(INADDR_ANY) ||
+ tb2->rcv_saddr == sk->sk_rcv_saddr;
+}
+
+static inline bool check_bind_bucket_rcu(const struct sock *sk,
+ const struct inet_bind_bucket *tb)
+{
+ const struct inet_bind2_bucket *tb2;
+
+ hlist_for_each_entry_rcu(tb2, &tb->bhash2, bhash_node)
+ if (check_bind2_bucket(sk, tb2))
+ return true;
+ return false;
+}
+
+static inline bool check_bind_bucket(const struct sock *sk,
+ const struct inet_bind_bucket *tb)
+{
+ const struct inet_bind2_bucket *tb2;
+
+ hlist_for_each_entry(tb2, &tb->bhash2, bhash_node)
+ if (check_bind2_bucket(sk, tb2))
+ return true;
+ return false;
+}
+
int __inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk, u64 port_offset,
u32 hash_port0,
@@ -1070,6 +1116,9 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
if (!inet_bind_bucket_match(tb, net, port, l3mdev))
continue;
if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) {
+ if (unlikely(local_ports &&
+ !check_bind_bucket_rcu(sk, tb)))
+ break;
rcu_read_unlock();
goto next_port;
}
@@ -1088,9 +1137,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
*/
inet_bind_bucket_for_each(tb, &head->chain) {
if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
- if (tb->fastreuse >= 0 ||
- tb->fastreuseport >= 0)
+ if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) {
+ if (unlikely(local_ports &&
+ !check_bind_bucket(sk, tb)))
+ goto ok;
goto next_port_unlock;
+ }
WARN_ON(hlist_empty(&tb->bhash2));
if (!check_established(death_row, sk,
port, &tw, false,
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios with IP_LOCAL_PORT_RANGE
2025-07-14 16:03 [PATCH net-next v3 0/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 1/3] tcp: Add RCU management to inet_bind2_bucket Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
@ 2025-07-14 16:03 ` Jakub Sitnicki
2025-07-17 9:27 ` Paolo Abeni
2025-07-17 9:34 ` Paolo Abeni
2 siblings, 2 replies; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-14 16:03 UTC (permalink / raw)
To: Eric Dumazet, Paolo Abeni, David S. Miller, Jakub Kicinski,
Neal Cardwell, Kuniyuki Iwashima
Cc: netdev, kernel-team
Expand the ip_local_port_range tests to check that when using
IP_LOCAL_PORT_RANGE socket option:
1) We can share the local port as long as there is no IP address conflict
with any other socket. Covered by tcp_port_reuse__no_ip_conflict* tests.
2) We cannot share the local port with wildcard sockets or when there is a
local IP conflict. Covered by tcp_port_reuse__ip_conflict* tests.
3) We cannot share the local IP and port to connect to different remote IPs
if the port bucket is in non-reuseable state, Corner case covered by
tcp_port_reuse__ip_port_conflict_with_unique_dst_after_bind test.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
tools/testing/selftests/net/ip_local_port_range.c | 524 +++++++++++++++++++++
tools/testing/selftests/net/ip_local_port_range.sh | 14 +-
2 files changed, 533 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/ip_local_port_range.c b/tools/testing/selftests/net/ip_local_port_range.c
index 29451d2244b7..d5ff64c14132 100644
--- a/tools/testing/selftests/net/ip_local_port_range.c
+++ b/tools/testing/selftests/net/ip_local_port_range.c
@@ -9,6 +9,7 @@
#include <fcntl.h>
#include <netinet/ip.h>
+#include <arpa/inet.h>
#include "../kselftest_harness.h"
@@ -20,6 +21,15 @@
#define IPPROTO_MPTCP 262
#endif
+static const int ONE = 1;
+
+__attribute__((nonnull)) static inline void close_fd(int *fd)
+{
+ close(*fd);
+}
+
+#define __close_fd __attribute__((cleanup(close_fd)))
+
static __u32 pack_port_range(__u16 lo, __u16 hi)
{
return (hi << 16) | (lo << 0);
@@ -116,6 +126,81 @@ static int get_ip_local_port_range(int fd, __u32 *range)
return 0;
}
+struct sockaddr_inet {
+ union {
+ struct sockaddr_storage ss;
+ struct sockaddr_in6 v6;
+ struct sockaddr_in v4;
+ struct sockaddr sa;
+ };
+ socklen_t len;
+};
+
+static void make_inet_addr(int af, const char *ip, __u16 port,
+ struct sockaddr_inet *addr)
+{
+ memset(addr, 0, sizeof(*addr));
+
+ switch (af) {
+ case AF_INET:
+ addr->len = sizeof(addr->v4);
+ addr->v4.sin_family = af;
+ addr->v4.sin_port = htons(port);
+ inet_pton(af, ip, &addr->v4.sin_addr);
+ break;
+ case AF_INET6:
+ addr->len = sizeof(addr->v6);
+ addr->v6.sin6_family = af;
+ addr->v6.sin6_port = htons(port);
+ inet_pton(af, ip, &addr->v6.sin6_addr);
+ break;
+ }
+}
+
+static bool is_v4mapped(const struct sockaddr_inet *a)
+{
+ return (a->sa.sa_family == AF_INET6 &&
+ IN6_IS_ADDR_V4MAPPED(&a->v6.sin6_addr));
+}
+
+static void v4mapped_to_ipv4(struct sockaddr_inet *a)
+{
+ in_port_t port = a->v6.sin6_port;
+ in_addr_t ip4 = *(in_addr_t *)&a->v6.sin6_addr.s6_addr[12];
+
+ memset(a, 0, sizeof(*a));
+ a->len = sizeof(a->v4);
+ a->v4.sin_family = AF_INET;
+ a->v4.sin_port = port;
+ a->v4.sin_addr.s_addr = ip4;
+}
+
+static void ipv4_to_v4mapped(struct sockaddr_inet *a)
+{
+ in_port_t port = a->v4.sin_port;
+ in_addr_t ip4 = a->v4.sin_addr.s_addr;
+
+ memset(a, 0, sizeof(*a));
+ a->len = sizeof(a->v6);
+ a->v6.sin6_family = AF_INET6;
+ a->v6.sin6_port = port;
+ a->v6.sin6_addr.s6_addr[10] = 0xff;
+ a->v6.sin6_addr.s6_addr[11] = 0xff;
+ memcpy(&a->v6.sin6_addr.s6_addr[12], &ip4, sizeof(ip4));
+}
+
+static __u16 inet_port(const struct sockaddr_inet *a)
+{
+ switch (a->sa.sa_family) {
+ case AF_INET:
+ return ntohs(a->v4.sin_port);
+ case AF_INET6:
+ return ntohs(a->v6.sin6_port);
+ default:
+ return 0;
+ }
+}
+
FIXTURE(ip_local_port_range) {};
FIXTURE_SETUP(ip_local_port_range)
@@ -460,4 +545,443 @@ TEST_F(ip_local_port_range, get_port_range)
ASSERT_TRUE(!err) TH_LOG("close failed");
}
+FIXTURE(tcp_port_reuse__no_ip_conflict) {};
+FIXTURE_SETUP(tcp_port_reuse__no_ip_conflict) {}
+FIXTURE_TEARDOWN(tcp_port_reuse__no_ip_conflict) {}
+
+FIXTURE_VARIANT(tcp_port_reuse__no_ip_conflict) {
+ int af_one;
+ const char *ip_one;
+ int af_two;
+ const char *ip_two;
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__no_ip_conflict, ipv4) {
+ .af_one = AF_INET,
+ .ip_one = "127.0.0.1",
+ .af_two = AF_INET,
+ .ip_two = "127.0.0.2",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__no_ip_conflict, ipv6_v4mapped) {
+ .af_one = AF_INET6,
+ .ip_one = "::ffff:127.0.0.1",
+ .af_two = AF_INET,
+ .ip_two = "127.0.0.2",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__no_ip_conflict, ipv6) {
+ .af_one = AF_INET6,
+ .ip_one = "2001:db8::1",
+ .af_two = AF_INET6,
+ .ip_two = "2001:db8::2",
+};
+
+/* Check that a connected socket, which is using IP_LOCAL_PORT_RANGE to relax
+ * port search restrictions at connect() time, can share a local port with a
+ * listening socket bound to a different IP.
+ */
+TEST_F(tcp_port_reuse__no_ip_conflict, share_port_with_listening_socket)
+{
+ const typeof(variant) v = variant;
+ struct sockaddr_inet addr;
+ __close_fd int ln = -1;
+ __close_fd int c = -1;
+ __close_fd int p = -1;
+ __u32 range;
+ int r;
+
+ /* Listen on <ip one>:40000 */
+ ln = socket(v->af_one, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket");
+
+ r = setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR)");
+
+ make_inet_addr(v->af_one, v->ip_one, 40000, &addr);
+ r = bind(ln, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind(<ip_one>:40000)");
+
+ r = listen(ln, 1);
+ ASSERT_EQ(r, 0) TH_LOG("listen");
+
+ /* Connect from <ip two>:40000 to <ip one>:40000 */
+ c = socket(v->af_two, SOCK_STREAM, 0);
+ ASSERT_GE(c, 0) TH_LOG("socket");
+
+ r = setsockopt(c, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_BIND_ADDRESS_NO_PORT)");
+
+ range = pack_port_range(40000, 40000);
+ r = setsockopt(c, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ make_inet_addr(v->af_two, v->ip_two, 0, &addr);
+ r = bind(c, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind(<ip_two>:0)");
+
+ make_inet_addr(v->af_one, v->ip_one, 40000, &addr);
+ if (is_v4mapped(&addr))
+ v4mapped_to_ipv4(&addr);
+ r = connect(c, &addr.sa, addr.len);
+ EXPECT_EQ(r, 0) TH_LOG("connect(<ip_one>:40000)");
+ EXPECT_EQ(get_sock_port(c), 40000);
+}
+
+/* Check that a connected socket, which is using IP_LOCAL_PORT_RANGE to relax
+ * port search restrictions at connect() time, can share a local port with
+ * another connected socket bound to a different IP without
+ * IP_BIND_ADDRESS_NO_PORT enabled.
+ */
+TEST_F(tcp_port_reuse__no_ip_conflict, share_port_with_connected_socket)
+{
+ const typeof(variant) v = variant;
+ struct sockaddr_inet dst = {};
+ struct sockaddr_inet src = {};
+ __close_fd int ln = -1;
+ __close_fd int c1 = -1;
+ __close_fd int c2 = -1;
+ __u32 range;
+ __u16 port;
+ int r;
+
+ /* Listen on wildcard. Same family as <ip_two>. */
+ ln = socket(v->af_two, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket");
+
+ r = setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR");
+
+ r = listen(ln, 2);
+ ASSERT_EQ(r, 0) TH_LOG("listen");
+
+ dst.len = sizeof(dst.ss);
+ r = getsockname(ln, &dst.sa, &dst.len);
+ ASSERT_EQ(r, 0) TH_LOG("getsockname");
+
+ /* Connect from <ip one> but without IP_BIND_ADDRESS_NO_PORT */
+ c1 = socket(v->af_one, SOCK_STREAM, 0);
+ ASSERT_GE(c1, 0) TH_LOG("socket");
+
+ make_inet_addr(v->af_one, v->ip_one, 0, &src);
+ r = bind(c1, &src.sa, src.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind");
+
+ if (src.sa.sa_family == AF_INET6 && dst.sa.sa_family == AF_INET)
+ ipv4_to_v4mapped(&dst);
+ r = connect(c1, &dst.sa, dst.len);
+ ASSERT_EQ(r, 0) TH_LOG("connect");
+
+ src.len = sizeof(src.ss);
+ r = getsockname(c1, &src.sa, &src.len);
+ ASSERT_EQ(r, 0) TH_LOG("getsockname");
+
+ /* Connect from <ip two>:<c1 port> with IP_BIND_ADDRESS_NO_PORT */
+ c2 = socket(v->af_two, SOCK_STREAM, 0);
+ ASSERT_GE(c2, 0) TH_LOG("socket");
+
+ r = setsockopt(c2, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_BIND_ADDRESS_NO_PORT)");
+
+ port = inet_port(&src);
+ range = pack_port_range(port, port);
+ r = setsockopt(c2, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ make_inet_addr(v->af_two, v->ip_two, 0, &src);
+ r = bind(c2, &src.sa, src.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind");
+
+ if (is_v4mapped(&dst))
+ v4mapped_to_ipv4(&dst);
+ r = connect(c2, &dst.sa, dst.len);
+ EXPECT_EQ(r, 0) TH_LOG("connect");
+ EXPECT_EQ(get_sock_port(c2), port);
+}
+
+/* Check that two sockets can share the local IP and the ephemeral port when the
+ * destination address differs.
+ */
+TEST(tcp_port_reuse__no_ip_conflict_with_unique_dst)
+{
+ struct sockaddr_inet addr;
+ __close_fd int ln = -1;
+ __close_fd int c1 = -1;
+ __close_fd int c2 = -1;
+ __u32 range;
+ int r;
+
+ /* Listen on 0.0.0.0:30000 */
+ ln = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket");
+
+ r = setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR)");
+
+ make_inet_addr(AF_INET, "0.0.0.0", 30000, &addr);
+ r = bind(ln, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind");
+
+ r = listen(ln, 2);
+ ASSERT_EQ(r, 0) TH_LOG("listen");
+
+ /* Connect from 127.0.0.1:40000 to 127.1.1.1:30000 */
+ c1 = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(c1, 0) TH_LOG("socket");
+
+ range = pack_port_range(40000, 40000);
+ r = setsockopt(c1, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ make_inet_addr(AF_INET, "127.1.1.1", 30000, &addr);
+ r = connect(c1, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("connect(127.1.1.1:30000)");
+ ASSERT_EQ(get_sock_port(c1), 40000);
+
+ /* Connect from 127.0.0.1:40000 to 127.2.2.2:30000 */
+ c2 = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(c2, 0) TH_LOG("socket");
+
+ range = pack_port_range(40000, 40000);
+ r = setsockopt(c2, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ make_inet_addr(AF_INET, "127.2.2.2", 30000, &addr);
+ r = connect(c2, &addr.sa, addr.len);
+ EXPECT_EQ(r, 0) TH_LOG("connect(127.1.1.1:30000)");
+ EXPECT_EQ(get_sock_port(c2), 40000);
+}
+
+FIXTURE(tcp_port_reuse__ip_conflict) {};
+FIXTURE_SETUP(tcp_port_reuse__ip_conflict) {}
+FIXTURE_TEARDOWN(tcp_port_reuse__ip_conflict) {}
+
+FIXTURE_VARIANT(tcp_port_reuse__ip_conflict) {
+ int af_one;
+ const char *ip_one;
+ int af_two;
+ const char *ip_two;
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__ip_conflict, ipv4) {
+ .af_one = AF_INET,
+ .ip_one = "127.0.0.1",
+ .af_two = AF_INET,
+ .ip_two = "127.0.0.1",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__ip_conflict, ipv6_v4mapped) {
+ .af_one = AF_INET6,
+ .ip_one = "::ffff:127.0.0.1",
+ .af_two = AF_INET,
+ .ip_two = "127.0.0.1",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__ip_conflict, ipv6) {
+ .af_one = AF_INET6,
+ .ip_one = "2001:db8::1",
+ .af_two = AF_INET6,
+ .ip_two = "2001:db8::1",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__ip_conflict, ipv4_wildcard) {
+ .af_one = AF_INET,
+ .ip_one = "0.0.0.0",
+ .af_two = AF_INET,
+ .ip_two = "127.0.0.1",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__ip_conflict, ipv6_v4mapped_wildcard) {
+ .af_one = AF_INET6,
+ .ip_one = "::ffff:0.0.0.0",
+ .af_two = AF_INET,
+ .ip_two = "127.0.0.1",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__ip_conflict, ipv6_wildcard) {
+ .af_one = AF_INET6,
+ .ip_one = "::",
+ .af_two = AF_INET6,
+ .ip_two = "2001:db8::1",
+};
+
+FIXTURE_VARIANT_ADD(tcp_port_reuse__ip_conflict, dualstack_wildcard) {
+ .af_one = AF_INET6,
+ .ip_one = "::",
+ .af_two = AF_INET6,
+ .ip_two = "127.0.0.1",
+};
+
+/* Check that a socket, which using IP_LOCAL_PORT_RANGE to relax local port
+ * search restrictions at connect() time, can't share a local port with a
+ * listening socket when there is IP address conflict.
+ */
+TEST_F(tcp_port_reuse__ip_conflict, cannot_share_port)
+{
+ const typeof(variant) v = variant;
+ struct sockaddr_inet dst, src;
+ __close_fd int ln = -1;
+ __close_fd int c = -1;
+ __u32 range;
+ int r;
+
+ /* Listen on <ip_one>:40000 */
+ ln = socket(v->af_one, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket");
+
+ r = setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR)");
+
+ make_inet_addr(v->af_one, v->ip_one, 40000, &dst);
+ r = bind(ln, &dst.sa, dst.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind(<ip_one>:40000)");
+
+ r = listen(ln, 1);
+ ASSERT_EQ(r, 0) TH_LOG("listen");
+
+ /* Attempt to connect from <ip two>:40000 */
+ c = socket(v->af_two, SOCK_STREAM, 0);
+ ASSERT_GE(c, 0) TH_LOG("socket");
+
+ r = setsockopt(c, SOL_IP, IP_BIND_ADDRESS_NO_PORT, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_BIND_ADDRESS_NO_PORT)");
+
+ range = pack_port_range(40000, 40000);
+ r = setsockopt(c, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ make_inet_addr(v->af_two, v->ip_two, 0, &src);
+ r = bind(c, &src.sa, src.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind(<ip_two>:40000)");
+
+ if (is_v4mapped(&dst))
+ v4mapped_to_ipv4(&dst);
+ r = connect(c, &dst.sa, dst.len);
+ EXPECT_EQ(r, -1) TH_LOG("connect(*:40000)");
+ EXPECT_EQ(errno, EADDRNOTAVAIL);
+}
+
+/* Demonstrate a limitation - a local port cannot be shared by IPv4 sockets with
+ * an IPv6 wildcard socket which is v6-only (i.e., not dualstack).
+ */
+TEST(tcp_port_reuse__ip_conflict_wildcard_v6only)
+{
+ struct sockaddr_inet addr;
+ __close_fd int ln4 = -1;
+ __close_fd int ln6 = -1;
+ __close_fd int c = -1;
+ __u32 range;
+ int r;
+
+ /* Listen on [::]:40000 (v6only) */
+ ln6 = socket(AF_INET6, SOCK_STREAM, 0);
+ ASSERT_GE(ln6, 0) TH_LOG("socket");
+
+ r = setsockopt(ln6, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR)");
+
+ r = setsockopt(ln6, IPPROTO_IPV6, IPV6_V6ONLY, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IPV6_V6ONLY)");
+
+ make_inet_addr(AF_INET6, "::", 40000, &addr);
+ r = bind(ln6, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind([::]:40000)");
+
+ r = listen(ln6, 1);
+ ASSERT_EQ(r, 0) TH_LOG("listen");
+
+ /* Listen on 127.0.0.1:30000 */
+ ln4 = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(ln4, 0) TH_LOG("socket");
+
+ r = setsockopt(ln4, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR)");
+
+ make_inet_addr(AF_INET, "127.0.0.1", 30000, &addr);
+ r = bind(ln4, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind(127.0.0.1:30000)");
+
+ r = listen(ln4, 1);
+ ASSERT_EQ(r, 0) TH_LOG("listen");
+
+ /* Connect from 127.0.0.1:40000 to 127.0.0.1:30000*/
+ c = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(c, 0) TH_LOG("socket");
+
+ range = pack_port_range(40000, 40000);
+ r = setsockopt(c, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ r = connect(c, &addr.sa, addr.len);
+ EXPECT_EQ(r, -1) TH_LOG("connect(127.0.0.1:30000)");
+ EXPECT_EQ(errno, EADDRNOTAVAIL);
+}
+
+/* Demonstrate a limitation - a local IP and port can't be shared any more, even
+ * when the remote address is unique, after explicitly binding to that port.
+ */
+TEST(tcp_port_reuse__ip_port_conflict_with_unique_dst_after_bind)
+{
+ struct sockaddr_inet addr;
+ __close_fd int ln = -1;
+ __close_fd int c1 = -1;
+ __close_fd int c2 = -1;
+ __u32 range;
+ int s, r;
+
+ /* Listen on 0.0.0.0:30000 */
+ ln = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(ln, 0) TH_LOG("socket");
+
+ r = setsockopt(ln, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR)");
+
+ make_inet_addr(AF_INET, "0.0.0.0", 30000, &addr);
+ r = bind(ln, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind(0.0.0.0:30000)");
+
+ r = listen(ln, 2);
+ ASSERT_EQ(r, 0) TH_LOG("listen");
+
+ /* Connect from 127.0.0.1:40000 to 127.1.1.1:30000 */
+ c1 = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(c1, 0) TH_LOG("socket");
+
+ range = pack_port_range(40000, 40000);
+ r = setsockopt(c1, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ make_inet_addr(AF_INET, "127.1.1.1", 30000, &addr);
+ r = connect(c1, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("connect(127.1.1.1:30000)");
+ ASSERT_EQ(get_sock_port(c1), 40000);
+
+ /* Block the port. Bind to 127.9.9.9:40000 and unbind immediately */
+ s = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(s, 0) TH_LOG("socket");
+
+ r = setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &ONE, sizeof(ONE));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(SO_REUSEADDR)");
+
+ make_inet_addr(AF_INET, "127.9.9.9", 40000, &addr);
+ r = bind(s, &addr.sa, addr.len);
+ ASSERT_EQ(r, 0) TH_LOG("bind(127.9.9.9:40000)");
+
+ r = close(s);
+ ASSERT_EQ(r, 0) TH_LOG("close");
+
+ /* Connect from 127.0.0.1:40000 to 127.2.2.2:30000 */
+ c2 = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(c2, 0) TH_LOG("socket");
+
+ range = pack_port_range(40000, 40000);
+ r = setsockopt(c2, SOL_IP, IP_LOCAL_PORT_RANGE, &range, sizeof(range));
+ ASSERT_EQ(r, 0) TH_LOG("setsockopt(IP_LOCAL_PORT_RANGE)");
+
+ make_inet_addr(AF_INET, "127.2.2.2", 30000, &addr);
+ r = connect(c2, &addr.sa, addr.len);
+ EXPECT_EQ(r, -1) TH_LOG("connect(127.1.1.1:30000)");
+ EXPECT_EQ(errno, EADDRNOTAVAIL);
+}
+
TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/net/ip_local_port_range.sh b/tools/testing/selftests/net/ip_local_port_range.sh
index 4ff746db1256..3fc151545b2d 100755
--- a/tools/testing/selftests/net/ip_local_port_range.sh
+++ b/tools/testing/selftests/net/ip_local_port_range.sh
@@ -1,7 +1,11 @@
-#!/bin/sh
+#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
-./in_netns.sh \
- sh -c 'sysctl -q -w net.mptcp.enabled=1 && \
- sysctl -q -w net.ipv4.ip_local_port_range="40000 49999" && \
- ./ip_local_port_range'
+./in_netns.sh sh <(cat <<-EOF
+ sysctl -q -w net.mptcp.enabled=1
+ sysctl -q -w net.ipv4.ip_local_port_range="40000 49999"
+ ip -6 addr add dev lo 2001:db8::1/32 nodad
+ ip -6 addr add dev lo 2001:db8::2/32 nodad
+ exec ./ip_local_port_range
+EOF
+)
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE
2025-07-14 16:03 ` [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
@ 2025-07-17 9:23 ` Paolo Abeni
2025-07-17 9:44 ` Jakub Sitnicki
0 siblings, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2025-07-17 9:23 UTC (permalink / raw)
To: Jakub Sitnicki, Eric Dumazet, David S. Miller, Jakub Kicinski,
Neal Cardwell, Kuniyuki Iwashima
Cc: netdev, kernel-team, Lee Valentine
On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
> Solution
> --------
>
> If there is no IP address conflict with any socket bound to a given local
> port, then from the protocol's perspective, the port can be safely shared.
>
> With that in mind, modify the port search during connect(), that is
> __inet_hash_connect, to consider all bind buckets (ports) when looking for
> a local port for egress.
>
> To achieve this, add an extra walk over bhash2 buckets for the port to
> check for IP conflicts. The additional walk is not free, so perform it only
> once per port - during the second phase of conflict checking, when the
> bhash bucket is locked.
>
> We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket
> option is set. The rationale is that users are likely to care about using
> every possible local port only when they have deliberately constrained the
> ephemeral port range.
I'm not a big fan of piggybacking additional semantic on existing
socketopt, have you considered a new one?
At very least you will need to update the man page.
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index d3ce6d0a514e..9d8a9c7c8274 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1005,6 +1005,52 @@ EXPORT_IPV6_MOD(inet_bhash2_reset_saddr);
> #define INET_TABLE_PERTURB_SIZE (1 << CONFIG_INET_TABLE_PERTURB_ORDER)
> static u32 *table_perturb;
>
> +/* True on source address conflict with another socket. False otherwise. */
> +static inline bool check_bind2_bucket(const struct sock *sk,
> + const struct inet_bind2_bucket *tb2)
Please no inline function in c files.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios with IP_LOCAL_PORT_RANGE
2025-07-14 16:03 ` [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios " Jakub Sitnicki
@ 2025-07-17 9:27 ` Paolo Abeni
2025-07-17 9:44 ` Jakub Sitnicki
2025-07-17 9:34 ` Paolo Abeni
1 sibling, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2025-07-17 9:27 UTC (permalink / raw)
To: Jakub Sitnicki, Eric Dumazet, David S. Miller, Jakub Kicinski,
Neal Cardwell, Kuniyuki Iwashima
Cc: netdev, kernel-team
On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
> diff --git a/tools/testing/selftests/net/ip_local_port_range.sh b/tools/testing/selftests/net/ip_local_port_range.sh
> index 4ff746db1256..3fc151545b2d 100755
> --- a/tools/testing/selftests/net/ip_local_port_range.sh
> +++ b/tools/testing/selftests/net/ip_local_port_range.sh
> @@ -1,7 +1,11 @@
> -#!/bin/sh
> +#!/bin/bash
> # SPDX-License-Identifier: GPL-2.0
>
> -./in_netns.sh \
> - sh -c 'sysctl -q -w net.mptcp.enabled=1 && \
> - sysctl -q -w net.ipv4.ip_local_port_range="40000 49999" && \
> - ./ip_local_port_range'
> +./in_netns.sh sh <(cat <<-EOF
> + sysctl -q -w net.mptcp.enabled=1
> + sysctl -q -w net.ipv4.ip_local_port_range="40000 49999"
> + ip -6 addr add dev lo 2001:db8::1/32 nodad
> + ip -6 addr add dev lo 2001:db8::2/32 nodad
> + exec ./ip_local_port_range
Minor nit: it looks like you could simply add the additional statements
to the '-c' argument without changing the used shell.
/P
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios with IP_LOCAL_PORT_RANGE
2025-07-14 16:03 ` [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios " Jakub Sitnicki
2025-07-17 9:27 ` Paolo Abeni
@ 2025-07-17 9:34 ` Paolo Abeni
2025-07-17 9:50 ` Jakub Sitnicki
1 sibling, 1 reply; 11+ messages in thread
From: Paolo Abeni @ 2025-07-17 9:34 UTC (permalink / raw)
To: Jakub Sitnicki, Eric Dumazet, David S. Miller, Jakub Kicinski,
Neal Cardwell, Kuniyuki Iwashima
Cc: netdev, kernel-team
On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
> diff --git a/tools/testing/selftests/net/ip_local_port_range.c b/tools/testing/selftests/net/ip_local_port_range.c
> index 29451d2244b7..d5ff64c14132 100644
> --- a/tools/testing/selftests/net/ip_local_port_range.c
> +++ b/tools/testing/selftests/net/ip_local_port_range.c
> @@ -9,6 +9,7 @@
>
> #include <fcntl.h>
> #include <netinet/ip.h>
> +#include <arpa/inet.h>
>
> #include "../kselftest_harness.h"
>
> @@ -20,6 +21,15 @@
> #define IPPROTO_MPTCP 262
> #endif
>
> +static const int ONE = 1;
> +
> +__attribute__((nonnull)) static inline void close_fd(int *fd)
Please no inline functions in c files.
> +{
> + close(*fd);
> +}
> +
> +#define __close_fd __attribute__((cleanup(close_fd)))
I almost missed this. IMHO it's a little overkill and the macro
definition foul the static checker:
WARNING: Missing a blank line after declarations
#181: FILE: tools/testing/selftests/net/ip_local_port_range.c:588:
+ struct sockaddr_inet addr;
+ __close_fd int ln = -1;
You could either use the fixture teardown, or simply close the fds at
test end, ignoring the error paths (fds will be closed at exit time).
/P
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE
2025-07-17 9:23 ` Paolo Abeni
@ 2025-07-17 9:44 ` Jakub Sitnicki
2025-07-17 11:21 ` Jakub Sitnicki
0 siblings, 1 reply; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-17 9:44 UTC (permalink / raw)
To: Paolo Abeni
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Kuniyuki Iwashima, netdev, kernel-team, Lee Valentine
On Thu, Jul 17, 2025 at 11:23 AM +02, Paolo Abeni wrote:
> On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
>> Solution
>> --------
>>
>> If there is no IP address conflict with any socket bound to a given local
>> port, then from the protocol's perspective, the port can be safely shared.
>>
>> With that in mind, modify the port search during connect(), that is
>> __inet_hash_connect, to consider all bind buckets (ports) when looking for
>> a local port for egress.
>>
>> To achieve this, add an extra walk over bhash2 buckets for the port to
>> check for IP conflicts. The additional walk is not free, so perform it only
>> once per port - during the second phase of conflict checking, when the
>> bhash bucket is locked.
>>
>> We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket
>> option is set. The rationale is that users are likely to care about using
>> every possible local port only when they have deliberately constrained the
>> ephemeral port range.
>
> I'm not a big fan of piggybacking additional semantic on existing
> socketopt, have you considered a new one?
That's a fair point. Though a dedicated sysctl seems more appropriate in
this case. Akin to how we have ip_autobind_reuse to enable amore
aggresive port sharing strategy on bind() side. How does that sound?
>
> At very least you will need to update the man page.
>
>
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index d3ce6d0a514e..9d8a9c7c8274 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -1005,6 +1005,52 @@ EXPORT_IPV6_MOD(inet_bhash2_reset_saddr);
>> #define INET_TABLE_PERTURB_SIZE (1 << CONFIG_INET_TABLE_PERTURB_ORDER)
>> static u32 *table_perturb;
>>
>> +/* True on source address conflict with another socket. False otherwise. */
>> +static inline bool check_bind2_bucket(const struct sock *sk,
>> + const struct inet_bind2_bucket *tb2)
>
> Please no inline function in c files.
Will fix.
Thanks for feedback.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios with IP_LOCAL_PORT_RANGE
2025-07-17 9:27 ` Paolo Abeni
@ 2025-07-17 9:44 ` Jakub Sitnicki
0 siblings, 0 replies; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-17 9:44 UTC (permalink / raw)
To: Paolo Abeni
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Kuniyuki Iwashima, netdev, kernel-team
On Thu, Jul 17, 2025 at 11:27 AM +02, Paolo Abeni wrote:
> On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
>> diff --git a/tools/testing/selftests/net/ip_local_port_range.sh b/tools/testing/selftests/net/ip_local_port_range.sh
>> index 4ff746db1256..3fc151545b2d 100755
>> --- a/tools/testing/selftests/net/ip_local_port_range.sh
>> +++ b/tools/testing/selftests/net/ip_local_port_range.sh
>> @@ -1,7 +1,11 @@
>> -#!/bin/sh
>> +#!/bin/bash
>> # SPDX-License-Identifier: GPL-2.0
>>
>> -./in_netns.sh \
>> - sh -c 'sysctl -q -w net.mptcp.enabled=1 && \
>> - sysctl -q -w net.ipv4.ip_local_port_range="40000 49999" && \
>> - ./ip_local_port_range'
>> +./in_netns.sh sh <(cat <<-EOF
>> + sysctl -q -w net.mptcp.enabled=1
>> + sysctl -q -w net.ipv4.ip_local_port_range="40000 49999"
>> + ip -6 addr add dev lo 2001:db8::1/32 nodad
>> + ip -6 addr add dev lo 2001:db8::2/32 nodad
>> + exec ./ip_local_port_range
>
> Minor nit: it looks like you could simply add the additional statements
> to the '-c' argument without changing the used shell.
I might have gone overboard here. Will revert.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios with IP_LOCAL_PORT_RANGE
2025-07-17 9:34 ` Paolo Abeni
@ 2025-07-17 9:50 ` Jakub Sitnicki
0 siblings, 0 replies; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-17 9:50 UTC (permalink / raw)
To: Paolo Abeni
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Kuniyuki Iwashima, netdev, kernel-team
On Thu, Jul 17, 2025 at 11:34 AM +02, Paolo Abeni wrote:
> On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
>> diff --git a/tools/testing/selftests/net/ip_local_port_range.c b/tools/testing/selftests/net/ip_local_port_range.c
>> index 29451d2244b7..d5ff64c14132 100644
>> --- a/tools/testing/selftests/net/ip_local_port_range.c
>> +++ b/tools/testing/selftests/net/ip_local_port_range.c
>> @@ -9,6 +9,7 @@
>>
>> #include <fcntl.h>
>> #include <netinet/ip.h>
>> +#include <arpa/inet.h>
>>
>> #include "../kselftest_harness.h"
>>
>> @@ -20,6 +21,15 @@
>> #define IPPROTO_MPTCP 262
>> #endif
>>
>> +static const int ONE = 1;
>> +
>> +__attribute__((nonnull)) static inline void close_fd(int *fd)
>
> Please no inline functions in c files.
>
>> +{
>> + close(*fd);
>> +}
>> +
>> +#define __close_fd __attribute__((cleanup(close_fd)))
>
> I almost missed this. IMHO it's a little overkill and the macro
> definition foul the static checker:
>
> WARNING: Missing a blank line after declarations
> #181: FILE: tools/testing/selftests/net/ip_local_port_range.c:588:
> + struct sockaddr_inet addr;
> + __close_fd int ln = -1;
>
> You could either use the fixture teardown, or simply close the fds at
> test end, ignoring the error paths (fds will be closed at exit time).
Not a problem. Will switch to managing FDs the classic way.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE
2025-07-17 9:44 ` Jakub Sitnicki
@ 2025-07-17 11:21 ` Jakub Sitnicki
0 siblings, 0 replies; 11+ messages in thread
From: Jakub Sitnicki @ 2025-07-17 11:21 UTC (permalink / raw)
To: Paolo Abeni
Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Neal Cardwell,
Kuniyuki Iwashima, netdev, kernel-team, Lee Valentine
On Thu, Jul 17, 2025 at 11:44 AM +02, Jakub Sitnicki wrote:
> On Thu, Jul 17, 2025 at 11:23 AM +02, Paolo Abeni wrote:
>> On 7/14/25 6:03 PM, Jakub Sitnicki wrote:
>>> Solution
>>> --------
>>>
>>> If there is no IP address conflict with any socket bound to a given local
>>> port, then from the protocol's perspective, the port can be safely shared.
>>>
>>> With that in mind, modify the port search during connect(), that is
>>> __inet_hash_connect, to consider all bind buckets (ports) when looking for
>>> a local port for egress.
>>>
>>> To achieve this, add an extra walk over bhash2 buckets for the port to
>>> check for IP conflicts. The additional walk is not free, so perform it only
>>> once per port - during the second phase of conflict checking, when the
>>> bhash bucket is locked.
>>>
>>> We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket
>>> option is set. The rationale is that users are likely to care about using
>>> every possible local port only when they have deliberately constrained the
>>> ephemeral port range.
>>
>> I'm not a big fan of piggybacking additional semantic on existing
>> socketopt, have you considered a new one?
>
> That's a fair point. Though a dedicated sysctl seems more appropriate in
> this case. Akin to how we have ip_autobind_reuse to enable amore
> aggresive port sharing strategy on bind() side. How does that sound?
Thinking about this some more - if we're considering a dedicated sysctl
guard for this, perhaps this merits giving a shot to the more
comprehensive fix first.
That is to update the inet_bind_bucket state (fastreuse, fastreuseport)
on socket unbind to reflect the change in bucket owners. IOW, pivot to
one of the alternatives that I've highlighted:
| Alternatives
| ------------
|
| * Update bind bucket state on port release
|
| A valid solution to the described problem would also be to walk the bind
| bucket owners when releasing the port and recalculate the
| tb->{reuse,reuseport} state.
|
| However, in comparison to the proposed solution, this alone would not allow
| sharing the local port with other sockets bound to non-conflicting IPs for
| as long as they exist.
|
| Another downside is that we'd pay the extra cost on each unbind (more
| frequent) rather than only when connecting with IP_LOCAL_PORT_RANGE
| set (less frequent). Due to that we would also likely need to guard it
| behind a sysctl (see below).
Right now the inet_bind_bucket fastreuse{,port} state is being
mismanaged, IMO. This would be the fix for the actual root cause here.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-07-17 11:21 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 16:03 [PATCH net-next v3 0/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 1/3] tcp: Add RCU management to inet_bind2_bucket Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 2/3] tcp: Consider every port when connecting with IP_LOCAL_PORT_RANGE Jakub Sitnicki
2025-07-17 9:23 ` Paolo Abeni
2025-07-17 9:44 ` Jakub Sitnicki
2025-07-17 11:21 ` Jakub Sitnicki
2025-07-14 16:03 ` [PATCH net-next v3 3/3] selftests/net: Cover port sharing scenarios " Jakub Sitnicki
2025-07-17 9:27 ` Paolo Abeni
2025-07-17 9:44 ` Jakub Sitnicki
2025-07-17 9:34 ` Paolo Abeni
2025-07-17 9:50 ` Jakub Sitnicki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).