linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] net: ip: lookup the best matched listen socket
@ 2025-07-31 12:33 Menglong Dong
  2025-07-31 13:01 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Menglong Dong @ 2025-07-31 12:33 UTC (permalink / raw)
  To: edumazet
  Cc: ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms, netdev,
	linux-kernel

For now, the socket lookup will terminate if the socket is reuse port in
inet_lhash2_lookup(), which makes the socket is not the best match.

For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
but socket1 bind on "eth0". We create socket1 first, and then socket2.
Then, all connections will goto socket2, which is not expected, as socket1
has higher priority.

This can cause unexpected behavior if TCP MD5 keys is used, as described
in Documentation/networking/vrf.rst -> Applications.

Therefor, we lookup the best matched socket first, and then do the reuse
port logic. This can increase some overhead if there are many reuse port
socket :/

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 net/ipv4/inet_hashtables.c  | 13 +++++++------
 net/ipv6/inet6_hashtables.c | 13 +++++++------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ceeeec9b7290..51751337f394 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -389,17 +389,18 @@ static struct sock *inet_lhash2_lookup(const struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
 		score = compute_score(sk, net, hnum, daddr, dif, sdif);
 		if (score > hiscore) {
-			result = inet_lookup_reuseport(net, sk, skb, doff,
-						       saddr, sport, daddr, hnum, inet_ehashfn);
-			if (result)
-				return result;
-
 			result = sk;
 			hiscore = score;
 		}
 	}
 
-	return result;
+	if (!result)
+		return NULL;
+
+	sk = inet_lookup_reuseport(net, result, skb, doff,
+				   saddr, sport, daddr, hnum, inet_ehashfn);
+
+	return sk ? sk : result;
 }
 
 struct sock *inet_lookup_run_sk_lookup(const struct net *net,
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 76ee521189eb..2554f26d6c20 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -161,17 +161,18 @@ static struct sock *inet6_lhash2_lookup(const struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &ilb2->nulls_head) {
 		score = compute_score(sk, net, hnum, daddr, dif, sdif);
 		if (score > hiscore) {
-			result = inet6_lookup_reuseport(net, sk, skb, doff,
-							saddr, sport, daddr, hnum, inet6_ehashfn);
-			if (result)
-				return result;
-
 			result = sk;
 			hiscore = score;
 		}
 	}
 
-	return result;
+	if (!result)
+		return NULL;
+
+	sk = inet6_lookup_reuseport(net, result, skb, doff,
+				    saddr, sport, daddr, hnum, inet6_ehashfn);
+
+	return sk ? sk : result;
 }
 
 struct sock *inet6_lookup_run_sk_lookup(const struct net *net,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net: ip: lookup the best matched listen socket
  2025-07-31 12:33 [PATCH net-next] net: ip: lookup the best matched listen socket Menglong Dong
@ 2025-07-31 13:01 ` Eric Dumazet
  2025-07-31 17:52   ` Kuniyuki Iwashima
  2025-08-01  1:34   ` Menglong Dong
  0 siblings, 2 replies; 7+ messages in thread
From: Eric Dumazet @ 2025-07-31 13:01 UTC (permalink / raw)
  To: Menglong Dong
  Cc: ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms, netdev,
	linux-kernel, Martin KaFai Lau

On Thu, Jul 31, 2025 at 5:33 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> For now, the socket lookup will terminate if the socket is reuse port in
> inet_lhash2_lookup(), which makes the socket is not the best match.
>
> For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> but socket1 bind on "eth0". We create socket1 first, and then socket2.
> Then, all connections will goto socket2, which is not expected, as socket1
> has higher priority.
>
> This can cause unexpected behavior if TCP MD5 keys is used, as described
> in Documentation/networking/vrf.rst -> Applications.
>
> Therefor, we lookup the best matched socket first, and then do the reuse
> port logic. This can increase some overhead if there are many reuse port
> socket :/
>
> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>

I do not think net-next is open yet ?

It seems this would be net material.

Any way you could provide a test ?

Please CC Martin KaFai Lau <kafai@fb.com>, as this was added in :

commit 61b7c691c7317529375f90f0a81a331990b1ec1b
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Fri Dec 1 12:52:31 2017 -0800

    inet: Add a 2nd listener hashtable (port+addr)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net: ip: lookup the best matched listen socket
  2025-07-31 13:01 ` Eric Dumazet
@ 2025-07-31 17:52   ` Kuniyuki Iwashima
  2025-08-01  1:31     ` Menglong Dong
  2025-08-01  1:34   ` Menglong Dong
  1 sibling, 1 reply; 7+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-31 17:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Menglong Dong, ncardwell, davem, dsahern, kuba, pabeni, horms,
	netdev, linux-kernel, Martin KaFai Lau, Craig Gallek

On Thu, Jul 31, 2025 at 6:01 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jul 31, 2025 at 5:33 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > For now, the socket lookup will terminate if the socket is reuse port in
> > inet_lhash2_lookup(), which makes the socket is not the best match.
> >
> > For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> > but socket1 bind on "eth0". We create socket1 first, and then socket2.
> > Then, all connections will goto socket2, which is not expected, as socket1
> > has higher priority.
> >
> > This can cause unexpected behavior if TCP MD5 keys is used, as described
> > in Documentation/networking/vrf.rst -> Applications.
> >
> > Therefor, we lookup the best matched socket first, and then do the reuse
> > port logic. This can increase some overhead if there are many reuse port
> > socket :/

This kills O(1) lookup for reuseport...

Another option would be to try hard in __inet_hash() to sort
reuseport groups.


> >
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
>
> I do not think net-next is open yet ?
>
> It seems this would be net material.
>
> Any way you could provide a test ?

Probably it will look like below and make sure we get
the opposite result:

# python3
>>> from socket import *
>>>
>>> s1 = socket()
>>> s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
>>> s1.bind(('localhost', 8000))
>>> s1.setsockopt(SOL_SOCKET, SO_BINDTODEVICE, b'lo')
>>> s1.listen()
>>>
>>> s2 = socket()
>>> s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
>>> s2.bind(('localhost', 8000))
>>> s2.listen()
>>>
>>> cs = []
>>> for i in range(3):
...     c = socket()
...     c.connect(('localhost', 8000))
...     cs.append(c)
...
>>> s1.setblocking(False)
>>> s1.accept()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/socket.py", line 295, in accept
    fd, addr = self._accept()
               ^^^^^^^^^^^^^^
BlockingIOError: [Errno 11] Resource temporarily unavailable
>>>
>>> s2.accept()
(<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1',
8000), raddr=('127.0.0.1', 44580)>, ('127.0.0.1', 44580))
>>> s2.accept()
(<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1',
8000), raddr=('127.0.0.1', 44584)>, ('127.0.0.1', 44584))
>>> s2.accept()
(<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1',
8000), raddr=('127.0.0.1', 44588)>, ('127.0.0.1', 44588))



>
> Please CC Martin KaFai Lau <kafai@fb.com>, as this was added in :
>
> commit 61b7c691c7317529375f90f0a81a331990b1ec1b
> Author: Martin KaFai Lau <kafai@fb.com>
> Date:   Fri Dec 1 12:52:31 2017 -0800
>
>     inet: Add a 2nd listener hashtable (port+addr)

I think this issue exists from day 1 of reuseport support

commit c125e80b88687b25b321795457309eaaee4bf270
Author: Craig Gallek <kraig@google.com>
Date:   Wed Feb 10 16:50:40 2016

    soreuseport: fast reuseport TCP socket selection

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net: ip: lookup the best matched listen socket
  2025-07-31 17:52   ` Kuniyuki Iwashima
@ 2025-08-01  1:31     ` Menglong Dong
  2025-08-01  4:06       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 7+ messages in thread
From: Menglong Dong @ 2025-08-01  1:31 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Eric Dumazet, ncardwell, davem, dsahern, kuba, pabeni, horms,
	netdev, linux-kernel, Martin KaFai Lau, Craig Gallek

On Fri, Aug 1, 2025 at 1:52 AM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:01 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Jul 31, 2025 at 5:33 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > >
> > > For now, the socket lookup will terminate if the socket is reuse port in
> > > inet_lhash2_lookup(), which makes the socket is not the best match.
> > >
> > > For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> > > but socket1 bind on "eth0". We create socket1 first, and then socket2.
> > > Then, all connections will goto socket2, which is not expected, as socket1
> > > has higher priority.
> > >
> > > This can cause unexpected behavior if TCP MD5 keys is used, as described
> > > in Documentation/networking/vrf.rst -> Applications.
> > >
> > > Therefor, we lookup the best matched socket first, and then do the reuse
> > > port logic. This can increase some overhead if there are many reuse port
> > > socket :/
>
> This kills O(1) lookup for reuseport...
>
> Another option would be to try hard in __inet_hash() to sort
> reuseport groups.

Good idea. For the reuse port case, we can compute a score
for the reuseport sockets and insert the high score to front of
the list. I'll have a try this way.

>
>
> > >
> > > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> >
> > I do not think net-next is open yet ?
> >
> > It seems this would be net material.
> >
> > Any way you could provide a test ?
>
> Probably it will look like below and make sure we get
> the opposite result:
>
> # python3
> >>> from socket import *
> >>>
> >>> s1 = socket()
> >>> s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
> >>> s1.bind(('localhost', 8000))
> >>> s1.setsockopt(SOL_SOCKET, SO_BINDTODEVICE, b'lo')
> >>> s1.listen()
> >>>
> >>> s2 = socket()
> >>> s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
> >>> s2.bind(('localhost', 8000))
> >>> s2.listen()
> >>>
> >>> cs = []
> >>> for i in range(3):
> ...     c = socket()
> ...     c.connect(('localhost', 8000))
> ...     cs.append(c)
> ...
> >>> s1.setblocking(False)
> >>> s1.accept()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/lib/python3.12/socket.py", line 295, in accept
>     fd, addr = self._accept()
>                ^^^^^^^^^^^^^^
> BlockingIOError: [Errno 11] Resource temporarily unavailable
> >>>
> >>> s2.accept()
> (<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1',
> 8000), raddr=('127.0.0.1', 44580)>, ('127.0.0.1', 44580))
> >>> s2.accept()
> (<socket.socket fd=16, family=2, type=1, proto=0, laddr=('127.0.0.1',
> 8000), raddr=('127.0.0.1', 44584)>, ('127.0.0.1', 44584))
> >>> s2.accept()
> (<socket.socket fd=15, family=2, type=1, proto=0, laddr=('127.0.0.1',
> 8000), raddr=('127.0.0.1', 44588)>, ('127.0.0.1', 44588))

I have a C test case, but this test case is good enough.

>
>
>
> >
> > Please CC Martin KaFai Lau <kafai@fb.com>, as this was added in :
> >
> > commit 61b7c691c7317529375f90f0a81a331990b1ec1b
> > Author: Martin KaFai Lau <kafai@fb.com>
> > Date:   Fri Dec 1 12:52:31 2017 -0800
> >
> >     inet: Add a 2nd listener hashtable (port+addr)
>
> I think this issue exists from day 1 of reuseport support

Yeah, it seems that it exists from the beginning of the reuseport
support.

>
> commit c125e80b88687b25b321795457309eaaee4bf270
> Author: Craig Gallek <kraig@google.com>
> Date:   Wed Feb 10 16:50:40 2016
>
>     soreuseport: fast reuseport TCP socket selection

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net: ip: lookup the best matched listen socket
  2025-07-31 13:01 ` Eric Dumazet
  2025-07-31 17:52   ` Kuniyuki Iwashima
@ 2025-08-01  1:34   ` Menglong Dong
  1 sibling, 0 replies; 7+ messages in thread
From: Menglong Dong @ 2025-08-01  1:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms, netdev,
	linux-kernel, Martin KaFai Lau

On Thu, Jul 31, 2025 at 9:01 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jul 31, 2025 at 5:33 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > For now, the socket lookup will terminate if the socket is reuse port in
> > inet_lhash2_lookup(), which makes the socket is not the best match.
> >
> > For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> > but socket1 bind on "eth0". We create socket1 first, and then socket2.
> > Then, all connections will goto socket2, which is not expected, as socket1
> > has higher priority.
> >
> > This can cause unexpected behavior if TCP MD5 keys is used, as described
> > in Documentation/networking/vrf.rst -> Applications.
> >
> > Therefor, we lookup the best matched socket first, and then do the reuse
> > port logic. This can increase some overhead if there are many reuse port
> > socket :/
> >
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
>
> I do not think net-next is open yet ?

Yeah, net-next is closed, which I just realized :/

>
> It seems this would be net material.

Ok, I'll send the V2 to the net.

Thanks!
Menglong Dong

>
> Any way you could provide a test ?
>
> Please CC Martin KaFai Lau <kafai@fb.com>, as this was added in :
>
> commit 61b7c691c7317529375f90f0a81a331990b1ec1b
> Author: Martin KaFai Lau <kafai@fb.com>
> Date:   Fri Dec 1 12:52:31 2017 -0800
>
>     inet: Add a 2nd listener hashtable (port+addr)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net: ip: lookup the best matched listen socket
  2025-08-01  1:31     ` Menglong Dong
@ 2025-08-01  4:06       ` Kuniyuki Iwashima
  2025-08-01  7:33         ` Menglong Dong
  0 siblings, 1 reply; 7+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-01  4:06 UTC (permalink / raw)
  To: menglong8.dong
  Cc: davem, dsahern, edumazet, horms, kafai, kraig, kuba, kuniyu,
	linux-kernel, ncardwell, netdev, pabeni

From: Menglong Dong <menglong8.dong@gmail.com>
Date: Fri, 1 Aug 2025 09:31:43 +0800
> On Fri, Aug 1, 2025 at 1:52 AM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > On Thu, Jul 31, 2025 at 6:01 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Jul 31, 2025 at 5:33 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > >
> > > > For now, the socket lookup will terminate if the socket is reuse port in
> > > > inet_lhash2_lookup(), which makes the socket is not the best match.
> > > >
> > > > For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> > > > but socket1 bind on "eth0". We create socket1 first, and then socket2.
> > > > Then, all connections will goto socket2, which is not expected, as socket1
> > > > has higher priority.
> > > >
> > > > This can cause unexpected behavior if TCP MD5 keys is used, as described
> > > > in Documentation/networking/vrf.rst -> Applications.
> > > >
> > > > Therefor, we lookup the best matched socket first, and then do the reuse
> > > > port logic. This can increase some overhead if there are many reuse port
> > > > socket :/
> >
> > This kills O(1) lookup for reuseport...
> >
> > Another option would be to try hard in __inet_hash() to sort
> > reuseport groups.
> 
> Good idea. For the reuse port case, we can compute a score
> for the reuseport sockets and insert the high score to front of
> the list. I'll have a try this way.

I remember you reported the same issue in Feburary and
I hacked up a patch below based on a draft diff in my stash.

This fixes the issue, but we still have corner cases where
SO_REUSEPORT/SO_BINDTODEVICE are changed after listen(),
which should not be allowed given TCP does not have ->rehash()
and confuses/breaks the reuseport logic/rule.

---8<---
diff --git a/include/net/sock.h b/include/net/sock.h
index c8a4b283df6f..8436e352732f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -885,6 +885,18 @@ static inline void __sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nu
 	hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
 }
 
+static inline void __sk_nulls_insert_after_node_rcu(struct sock *sk,
+						    struct hlist_nulls_node *prev)
+{
+	struct hlist_nulls_node *n = &sk->sk_nulls_node;
+
+	n->next = prev->next;
+	n->pprev = &prev->next;
+	if (!is_a_nulls(n->next))
+		WRITE_ONCE(n->next->pprev, &n->next);
+	rcu_assign_pointer(hlist_nulls_next_rcu(prev), n);
+}
+
 static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
 {
 	sock_hold(sk);
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 6e4faf3ee76f..4a3e9d6887a6 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -23,12 +23,14 @@ struct sock_reuseport {
 	unsigned int		synq_overflow_ts;
 	/* ID stays the same even after the size of socks[] grows. */
 	unsigned int		reuseport_id;
-	unsigned int		bind_inany:1;
-	unsigned int		has_conns:1;
+	unsigned short		bind_inany:1;
+	unsigned short		has_conns:1;
+	unsigned short		score;
 	struct bpf_prog __rcu	*prog;		/* optional BPF sock selector */
 	struct sock		*socks[] __counted_by(max_socks);
 };
 
+unsigned short reuseport_compute_score(struct sock *sk, bool bind_inany);
 extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
 			      bool bind_inany);
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 4211710393a8..df3d1a6f3178 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -6,6 +6,7 @@
  * selecting the socket index from the array of available sockets.
  */
 
+#include <net/addrconf.h>
 #include <net/ip.h>
 #include <net/sock_reuseport.h>
 #include <linux/bpf.h>
@@ -185,6 +186,25 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
 	return reuse;
 }
 
+unsigned short reuseport_compute_score(struct sock *sk, bool bind_inany)
+{
+	unsigned short score = 0;
+
+	if (sk->sk_family == AF_INET)
+		score += 10;
+
+	if (ipv6_only_sock(sk))
+		score++;
+
+	if (!bind_inany)
+		score++;
+
+	if (sk->sk_bound_dev_if)
+		score++;
+
+	return score;
+}
+
 int reuseport_alloc(struct sock *sk, bool bind_inany)
 {
 	struct sock_reuseport *reuse;
@@ -233,6 +253,7 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
 	reuse->bind_inany = bind_inany;
 	reuse->socks[0] = sk;
 	reuse->num_socks = 1;
+	reuse->score = reuseport_compute_score(sk, bind_inany);
 	reuseport_get_incoming_cpu(sk, reuse);
 	rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
 
@@ -278,6 +299,7 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
 	more_reuse->bind_inany = reuse->bind_inany;
 	more_reuse->has_conns = reuse->has_conns;
 	more_reuse->incoming_cpu = reuse->incoming_cpu;
+	more_reuse->score = reuse->score;
 
 	memcpy(more_reuse->socks, reuse->socks,
 	       reuse->num_socks * sizeof(struct sock *));
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ceeeec9b7290..042a65372d00 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -739,6 +739,44 @@ static int inet_reuseport_add_sock(struct sock *sk,
 	return reuseport_alloc(sk, inet_rcv_saddr_any(sk));
 }
 
+static void inet_reuseport_hash_sock(struct sock *sk,
+				     struct inet_listen_hashbucket *ilb2)
+{
+	struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
+	const struct hlist_nulls_node *node;
+	struct sock *sk2, *sk_anchor = NULL;
+	unsigned short score, hiscore;
+	struct sock_reuseport *reuse;
+
+	reuse = rcu_dereference_protected(sk->sk_reuseport_cb, 1);
+	score = reuse->score;
+
+	sk_nulls_for_each_rcu(sk2, node, &ilb2->nulls_head) {
+		if (!sk2->sk_reuseport)
+			continue;
+
+		if (inet_csk(sk2)->icsk_bind_hash != tb)
+			continue;
+
+		reuse = rcu_dereference_protected(sk2->sk_reuseport_cb, 1);
+		if (likely(reuse))
+			hiscore = reuse->score;
+		else
+			hiscore = reuseport_compute_score(sk2,
+							  inet_rcv_saddr_any(sk2));
+
+		if (hiscore <= score)
+			break;
+
+		sk_anchor = sk2;
+	}
+
+	if (sk_anchor)
+		__sk_nulls_insert_after_node_rcu(sk, &sk_anchor->sk_nulls_node);
+	else
+		__sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
+}
+
 int __inet_hash(struct sock *sk, struct sock *osk)
 {
 	struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
@@ -759,13 +797,14 @@ int __inet_hash(struct sock *sk, struct sock *osk)
 		err = inet_reuseport_add_sock(sk, ilb2);
 		if (err)
 			goto unlock;
-	}
-	sock_set_flag(sk, SOCK_RCU_FREE);
-	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
-		sk->sk_family == AF_INET6)
-		__sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
-	else
+
+		sock_set_flag(sk, SOCK_RCU_FREE);
+		inet_reuseport_hash_sock(sk, ilb2);
+	} else {
+		sock_set_flag(sk, SOCK_RCU_FREE);
 		__sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
+	}
+
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 unlock:
 	spin_unlock(&ilb2->lock);
---8<---


Tested:

---8<---
[root@fedora ~]# cat a.py
from socket import *

s1 = socket()
s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
s1.setsockopt(SOL_SOCKET, SO_BINDTODEVICE, b'lo')
s1.listen()
s1.setblocking(False)

s2 = socket()
s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
s2.bind(s1.getsockname())
s2.listen()
s2.setblocking(False)

for i in range(3):
    c = socket()
    c.connect(s1.getsockname())
    try:
        print("assigned properly:", s1.accept())
    except:
        print("wrong assignment")
[root@fedora ~]# python3 a.py
assigned properly: (<socket.socket fd=6, family=2, type=1, proto=0, laddr=('127.0.0.1', 36733), raddr=('127.0.0.1', 39478)>, ('127.0.0.1', 39478))
assigned properly: (<socket.socket fd=5, family=2, type=1, proto=0, laddr=('127.0.0.1', 36733), raddr=('127.0.0.1', 39490)>, ('127.0.0.1', 39490))
assigned properly: (<socket.socket fd=6, family=2, type=1, proto=0, laddr=('127.0.0.1', 36733), raddr=('127.0.0.1', 39506)>, ('127.0.0.1', 39506))
---8<---

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] net: ip: lookup the best matched listen socket
  2025-08-01  4:06       ` Kuniyuki Iwashima
@ 2025-08-01  7:33         ` Menglong Dong
  0 siblings, 0 replies; 7+ messages in thread
From: Menglong Dong @ 2025-08-01  7:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, dsahern, edumazet, horms, kafai, kraig, kuba, linux-kernel,
	ncardwell, netdev, pabeni

On Fri, Aug 1, 2025 at 12:07 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> From: Menglong Dong <menglong8.dong@gmail.com>
> Date: Fri, 1 Aug 2025 09:31:43 +0800
> > On Fri, Aug 1, 2025 at 1:52 AM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > >
> > > On Thu, Jul 31, 2025 at 6:01 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Jul 31, 2025 at 5:33 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > >
> > > > > For now, the socket lookup will terminate if the socket is reuse port in
> > > > > inet_lhash2_lookup(), which makes the socket is not the best match.
> > > > >
> > > > > For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> > > > > but socket1 bind on "eth0". We create socket1 first, and then socket2.
> > > > > Then, all connections will goto socket2, which is not expected, as socket1
> > > > > has higher priority.
> > > > >
> > > > > This can cause unexpected behavior if TCP MD5 keys is used, as described
> > > > > in Documentation/networking/vrf.rst -> Applications.
> > > > >
> > > > > Therefor, we lookup the best matched socket first, and then do the reuse
> > > > > port logic. This can increase some overhead if there are many reuse port
> > > > > socket :/
> > >
> > > This kills O(1) lookup for reuseport...
> > >
> > > Another option would be to try hard in __inet_hash() to sort
> > > reuseport groups.
> >
> > Good idea. For the reuse port case, we can compute a score
> > for the reuseport sockets and insert the high score to front of
> > the list. I'll have a try this way.
>
> I remember you reported the same issue in Feburary and
> I hacked up a patch below based on a draft diff in my stash.

Yeah, I reported before in this link:
https://lore.kernel.org/netdev/20250227123137.138778-1-dongml2@chinatelecom.cn/

And I made a wrong commit log in that patch, which made the
patch you hacked up didn't solve the problem :/

>
> This fixes the issue, but we still have corner cases where
> SO_REUSEPORT/SO_BINDTODEVICE are changed after listen(),
> which should not be allowed given TCP does not have ->rehash()
> and confuses/breaks the reuseport logic/rule.

Do we need to save the score? Which will make the logic
more complex :/

I just wrote a similar patch, which should work too:

---------------------------------------patch
begin---------------------------------
diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index 89186c499dd4..da500f4ae142 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct
hlist_nulls_node *n)
 #define hlist_nulls_next_rcu(node) \
     (*((struct hlist_nulls_node __rcu __force **)&(node)->next))

+/**
+ * hlist_nulls_pprev_rcu - returns the element of the list after @node.
+ * @node: element of the list.
+ */
+#define hlist_nulls_pprev_rcu(node) \
+    (*((struct hlist_nulls_node __rcu __force **)&(node)->pprev))
+
 /**
  * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
  * @n: the element to delete from the hash list.
@@ -145,6 +152,33 @@ static inline void
hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
     }
 }

+/**
+ * hlist_nulls_add_before_rcu
+ * @n: the new element to add to the hash list.
+ * @next: the existing element to add the new element before.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist
+ * before the specified node while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
+ * or hlist_nulls_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.
+ */
+static inline void hlist_nulls_add_before_rcu(struct hlist_nulls_node *n,
+                          struct hlist_nulls_node *next)
+{
+    WRITE_ONCE(n->pprev, next->pprev);
+    n->next = next;
+    rcu_assign_pointer(hlist_nulls_pprev_rcu(n), n);
+    WRITE_ONCE(next->pprev, &n->next);
+}
+
 /* after that hlist_nulls_del will work */
 static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index c8a4b283df6f..42aa1919eeee 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -885,6 +885,11 @@ static inline void
__sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nu
     hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
 }

+static inline void __sk_nulls_add_node_before_rcu(struct sock *sk,
struct sock *next)
+{
+    hlist_nulls_add_before_rcu(&sk->sk_nulls_node, &next->sk_nulls_node);
+}
+
 static inline void sk_nulls_add_node_rcu(struct sock *sk, struct
hlist_nulls_head *list)
 {
     sock_hold(sk);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ceeeec9b7290..53a72a8b6094 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -334,6 +334,21 @@ static inline int compute_score(struct sock *sk,
const struct net *net,
     return score;
 }

+static inline int compute_reuseport_score(struct sock *sk)
+{
+    int score = 0;
+
+    if (sk->sk_bound_dev_if)
+        score++;
+
+    if (sk->sk_family == PF_INET)
+        score += 5;
+    if (READ_ONCE(sk->sk_incoming_cpu))
+        score++;
+
+    return score;
+}
+
 /**
  * inet_lookup_reuseport() - execute reuseport logic on AF_INET
socket if necessary.
  * @net: network namespace.
@@ -739,6 +754,27 @@ static int inet_reuseport_add_sock(struct sock *sk,
     return reuseport_alloc(sk, inet_rcv_saddr_any(sk));
 }

+static void inet_hash_reuseport(struct sock *sk, struct hlist_nulls_head *head)
+{
+    const struct hlist_nulls_node *node;
+    int score, curscore;
+    struct sock *sk2;
+
+    curscore = compute_reuseport_score(sk);
+    /* lookup the socket to insert before */
+    sk_nulls_for_each_rcu(sk2, node, head) {
+        if (!sk2->sk_reuseport)
+            continue;
+        score = compute_reuseport_score(sk2);
+        if (score <= curscore) {
+            __sk_nulls_add_node_before_rcu(sk, sk2);
+            return;
+        }
+    }
+
+    __sk_nulls_add_node_tail_rcu(sk, head);
+}
+
 int __inet_hash(struct sock *sk, struct sock *osk)
 {
     struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
@@ -761,11 +797,11 @@ int __inet_hash(struct sock *sk, struct sock *osk)
             goto unlock;
     }
     sock_set_flag(sk, SOCK_RCU_FREE);
-    if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
-        sk->sk_family == AF_INET6)
-        __sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
-    else
+    if (!sk->sk_reuseport)
         __sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
+    else
+        inet_hash_reuseport(sk, &ilb2->nulls_head);
+
     sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 unlock:
     spin_unlock(&ilb2->lock);
------------------------------patch end--------------------------------------

>
> ---8<---
> diff --git a/include/net/sock.h b/include/net/sock.h
> index c8a4b283df6f..8436e352732f 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -885,6 +885,18 @@ static inline void __sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nu
>         hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
>  }
>
> +static inline void __sk_nulls_insert_after_node_rcu(struct sock *sk,
> +                                                   struct hlist_nulls_node *prev)
> +{
> +       struct hlist_nulls_node *n = &sk->sk_nulls_node;
> +
> +       n->next = prev->next;
> +       n->pprev = &prev->next;
> +       if (!is_a_nulls(n->next))
> +               WRITE_ONCE(n->next->pprev, &n->next);
> +       rcu_assign_pointer(hlist_nulls_next_rcu(prev), n);
> +}
> +
>  static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
>  {
>         sock_hold(sk);
> diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
> index 6e4faf3ee76f..4a3e9d6887a6 100644
> --- a/include/net/sock_reuseport.h
> +++ b/include/net/sock_reuseport.h
> @@ -23,12 +23,14 @@ struct sock_reuseport {
>         unsigned int            synq_overflow_ts;
>         /* ID stays the same even after the size of socks[] grows. */
>         unsigned int            reuseport_id;
> -       unsigned int            bind_inany:1;
> -       unsigned int            has_conns:1;
> +       unsigned short          bind_inany:1;
> +       unsigned short          has_conns:1;
> +       unsigned short          score;
>         struct bpf_prog __rcu   *prog;          /* optional BPF sock selector */
>         struct sock             *socks[] __counted_by(max_socks);
>  };
>
> +unsigned short reuseport_compute_score(struct sock *sk, bool bind_inany);
>  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
>  extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
>                               bool bind_inany);
> diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> index 4211710393a8..df3d1a6f3178 100644
> --- a/net/core/sock_reuseport.c
> +++ b/net/core/sock_reuseport.c
> @@ -6,6 +6,7 @@
>   * selecting the socket index from the array of available sockets.
>   */
>
> +#include <net/addrconf.h>
>  #include <net/ip.h>
>  #include <net/sock_reuseport.h>
>  #include <linux/bpf.h>
> @@ -185,6 +186,25 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks)
>         return reuse;
>  }
>
> +unsigned short reuseport_compute_score(struct sock *sk, bool bind_inany)
> +{
> +       unsigned short score = 0;
> +
> +       if (sk->sk_family == AF_INET)
> +               score += 10;
> +
> +       if (ipv6_only_sock(sk))
> +               score++;
> +
> +       if (!bind_inany)
> +               score++;
> +
> +       if (sk->sk_bound_dev_if)
> +               score++;
> +
> +       return score;
> +}
> +
>  int reuseport_alloc(struct sock *sk, bool bind_inany)
>  {
>         struct sock_reuseport *reuse;
> @@ -233,6 +253,7 @@ int reuseport_alloc(struct sock *sk, bool bind_inany)
>         reuse->bind_inany = bind_inany;
>         reuse->socks[0] = sk;
>         reuse->num_socks = 1;
> +       reuse->score = reuseport_compute_score(sk, bind_inany);
>         reuseport_get_incoming_cpu(sk, reuse);
>         rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
>
> @@ -278,6 +299,7 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
>         more_reuse->bind_inany = reuse->bind_inany;
>         more_reuse->has_conns = reuse->has_conns;
>         more_reuse->incoming_cpu = reuse->incoming_cpu;
> +       more_reuse->score = reuse->score;
>
>         memcpy(more_reuse->socks, reuse->socks,
>                reuse->num_socks * sizeof(struct sock *));
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index ceeeec9b7290..042a65372d00 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -739,6 +739,44 @@ static int inet_reuseport_add_sock(struct sock *sk,
>         return reuseport_alloc(sk, inet_rcv_saddr_any(sk));
>  }
>
> +static void inet_reuseport_hash_sock(struct sock *sk,
> +                                    struct inet_listen_hashbucket *ilb2)
> +{
> +       struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
> +       const struct hlist_nulls_node *node;
> +       struct sock *sk2, *sk_anchor = NULL;
> +       unsigned short score, hiscore;
> +       struct sock_reuseport *reuse;
> +
> +       reuse = rcu_dereference_protected(sk->sk_reuseport_cb, 1);
> +       score = reuse->score;
> +
> +       sk_nulls_for_each_rcu(sk2, node, &ilb2->nulls_head) {
> +               if (!sk2->sk_reuseport)
> +                       continue;
> +
> +               if (inet_csk(sk2)->icsk_bind_hash != tb)
> +                       continue;
> +
> +               reuse = rcu_dereference_protected(sk2->sk_reuseport_cb, 1);
> +               if (likely(reuse))
> +                       hiscore = reuse->score;
> +               else
> +                       hiscore = reuseport_compute_score(sk2,
> +                                                         inet_rcv_saddr_any(sk2));
> +
> +               if (hiscore <= score)
> +                       break;
> +
> +               sk_anchor = sk2;
> +       }
> +
> +       if (sk_anchor)
> +               __sk_nulls_insert_after_node_rcu(sk, &sk_anchor->sk_nulls_node);
> +       else
> +               __sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
> +}
> +
>  int __inet_hash(struct sock *sk, struct sock *osk)
>  {
>         struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
> @@ -759,13 +797,14 @@ int __inet_hash(struct sock *sk, struct sock *osk)
>                 err = inet_reuseport_add_sock(sk, ilb2);
>                 if (err)
>                         goto unlock;
> -       }
> -       sock_set_flag(sk, SOCK_RCU_FREE);
> -       if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
> -               sk->sk_family == AF_INET6)
> -               __sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
> -       else
> +
> +               sock_set_flag(sk, SOCK_RCU_FREE);
> +               inet_reuseport_hash_sock(sk, ilb2);
> +       } else {
> +               sock_set_flag(sk, SOCK_RCU_FREE);
>                 __sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
> +       }
> +
>         sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
>  unlock:
>         spin_unlock(&ilb2->lock);
> ---8<---
>
>
> Tested:
>
> ---8<---
> [root@fedora ~]# cat a.py
> from socket import *
>
> s1 = socket()
> s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
> s1.setsockopt(SOL_SOCKET, SO_BINDTODEVICE, b'lo')
> s1.listen()
> s1.setblocking(False)
>
> s2 = socket()
> s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
> s2.bind(s1.getsockname())
> s2.listen()
> s2.setblocking(False)
>
> for i in range(3):
>     c = socket()
>     c.connect(s1.getsockname())
>     try:
>         print("assigned properly:", s1.accept())
>     except:
>         print("wrong assignment")
> [root@fedora ~]# python3 a.py
> assigned properly: (<socket.socket fd=6, family=2, type=1, proto=0, laddr=('127.0.0.1', 36733), raddr=('127.0.0.1', 39478)>, ('127.0.0.1', 39478))
> assigned properly: (<socket.socket fd=5, family=2, type=1, proto=0, laddr=('127.0.0.1', 36733), raddr=('127.0.0.1', 39490)>, ('127.0.0.1', 39490))
> assigned properly: (<socket.socket fd=6, family=2, type=1, proto=0, laddr=('127.0.0.1', 36733), raddr=('127.0.0.1', 39506)>, ('127.0.0.1', 39506))
> ---8<---

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-08-01  7:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31 12:33 [PATCH net-next] net: ip: lookup the best matched listen socket Menglong Dong
2025-07-31 13:01 ` Eric Dumazet
2025-07-31 17:52   ` Kuniyuki Iwashima
2025-08-01  1:31     ` Menglong Dong
2025-08-01  4:06       ` Kuniyuki Iwashima
2025-08-01  7:33         ` Menglong Dong
2025-08-01  1:34   ` Menglong Dong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).