public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/2] Preserve UDP socket addresses on abort
@ 2026-03-03 17:01 Jordan Rife
  2026-03-03 17:01 ` [PATCH net-next v2 1/2] udp: " Jordan Rife
  2026-03-03 17:01 ` [PATCH net-next v2 2/2] selftests/bpf: Ensure dst addr/port are preserved after socket abort Jordan Rife
  0 siblings, 2 replies; 10+ messages in thread
From: Jordan Rife @ 2026-03-03 17:01 UTC (permalink / raw)
  To: netdev
  Cc: Jordan Rife, bpf, Willem de Bruijn, Eric Dumazet, Daniel Borkmann,
	Martin KaFai Lau, Stanislav Fomichev, Andrii Nakryiko,
	Yusuke Suzuki, Jakub Kicinski

BPF cgroup/sock_release hooks can be useful for performing cleanup,
map maintenance, or other bookkeeping when sockets are released. Cilium
uses cgroup/sock_release hooks to do just that, cleaning up maps keyed
by socket destination addresses. This works fine for TCP and connected
UDP sockets when they're closed gracefully, but Yusuke reported that
this fails for connected UDP when sockets are aborted prior to closing
the socket. Cilium in particular actively destroys/aborts connected UDP
sockets, so this scenario can happen easily.

From Yusuke's testing results in [1]:

1. socket connects to "127.0.0.1:8000"
2. socket aborted - diag_destroy (tcp_abort/udp_abort)
3. close(sock_fd)
4. BPF_CGROUP_RUN_PROG_INET_SOCK_RELEASE(sk) runs my_sock_release()

__section("cgroup/sock_release")
int my_sock_release(struct bpf_sock *ctx)
{
    ...
    /* For TCP, dst_ip4 == "127.0.0.1" but for connected UDP
     * dst_ip4 == 0.
     */
    ...
}

Case  Protocol        IP Family  Type	        Result
1     TCP             IPv4       Regular close  Cleaned up
2     TCP	      IPv4       Abort          Cleaned up
3     TCP	      IPv6       Regular close  Cleaned up
4     TCP             IPv6       Abort          Cleaned up
5     UDP (Connected) IPv4       Regular close  Cleaned up
6     UDP (Connected) IPv4       Abort          Not cleaned up
7     UDP (Connected) IPv6       Regular close  Cleaned up
8     UDP (Connected) IPv6       Abort          Not cleaned up

This patch aims to make the behavior consistent between TCP and
connected UDP by preserving the socket destination address and port
through to the sock_release hook.

Regarding my approach:

Currently, udp_abort() calls __udp_disconnect(), but this seems like
overkill in the case of a socket abort. __udp_disconnect() handles two
special cases:

1. When a socket binds to 0.0.0.0, connects, then disconnects using
   AF_UNSPEC it needs to be rehashed as described in commit 303d0403b8c2
   ("udp: rehash on disconnect").
2. Avoids unhashing the socket in cases where it was bound to a specific
   port.

This makes sense in the case of a graceful disconnect (AF_UNSPEC) where
the socket remains intact and may be used in the future, but it seemed
sufficient in the case of a socket abort to simply set sk_err so that
future operations on that socket fail and to unconditionally unhash the
socket if it is currently hashed, thus avoiding any of the field
manipulation that __udp_disconnect() would otherwise do.

[1]: https://github.com/cilium/cilium/issues/42649

CHANGES
=======

v1 -> v2:
* Set connect_fd back to -1 after calling destroy() in the selftest
  (Jakub).

Jordan Rife (2):
  udp: Preserve UDP socket addresses on abort
  selftests/bpf: Ensure dst addr/port are preserved after socket abort

 net/ipv4/udp.c                                |   4 +-
 .../bpf/prog_tests/sock_destroy_release.c     | 136 ++++++++++++++++++
 .../bpf/progs/sock_destroy_release.c          |  59 ++++++++
 3 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy_release.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_release.c

-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-03 17:01 [PATCH net-next v2 0/2] Preserve UDP socket addresses on abort Jordan Rife
@ 2026-03-03 17:01 ` Jordan Rife
  2026-03-04 21:21   ` Kuniyuki Iwashima
  2026-03-03 17:01 ` [PATCH net-next v2 2/2] selftests/bpf: Ensure dst addr/port are preserved after socket abort Jordan Rife
  1 sibling, 1 reply; 10+ messages in thread
From: Jordan Rife @ 2026-03-03 17:01 UTC (permalink / raw)
  To: netdev
  Cc: Jordan Rife, bpf, Willem de Bruijn, Eric Dumazet, Daniel Borkmann,
	Martin KaFai Lau, Stanislav Fomichev, Andrii Nakryiko,
	Yusuke Suzuki, Jakub Kicinski

Currently, udp_abort() invokes __udp_disconnect() which clears out
socket fields like inet_daddr and inet_dport. This makes fields like
dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
hooks following an abort on a connected UDP socket, since they'll always
equal zero. This differs from the behavior for TCP sockets where a
cgroup/sock_release hook will see the address and port that the socket
was connected to at the time it was aborted. This causes issues in
Cilium where a sock_release hook is used to perform map maintenance when
sockets are released, since Cilium actively destroys connected UDP
sockets in some cases.

Make the behavior consistent between TCP and UDP sockets by not clearing
out these fields in udp_abort(). Instead, just unhash the socket, mark
its state as TCP_CLOSE, and preserve the state of the other fields.

Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
Signed-off-by: Jordan Rife <jrife@google.com>
---
 net/ipv4/udp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b96e47f1c8a2..01a0a0fbcd56 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
 
 	sk->sk_err = err;
 	sk_error_report(sk);
-	__udp_disconnect(sk, 0);
+	sk->sk_state = TCP_CLOSE;
+	sk->sk_prot->unhash(sk);
+	sk_dst_reset(sk);
 
 out:
 	if (!has_current_bpf_ctx())
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next v2 2/2] selftests/bpf: Ensure dst addr/port are preserved after socket abort
  2026-03-03 17:01 [PATCH net-next v2 0/2] Preserve UDP socket addresses on abort Jordan Rife
  2026-03-03 17:01 ` [PATCH net-next v2 1/2] udp: " Jordan Rife
@ 2026-03-03 17:01 ` Jordan Rife
  1 sibling, 0 replies; 10+ messages in thread
From: Jordan Rife @ 2026-03-03 17:01 UTC (permalink / raw)
  To: netdev
  Cc: Jordan Rife, bpf, Willem de Bruijn, Eric Dumazet, Daniel Borkmann,
	Martin KaFai Lau, Stanislav Fomichev, Andrii Nakryiko,
	Yusuke Suzuki, Jakub Kicinski

Ensure that sock_release hooks see the original dst_ip4, dst_ip6, and
dst_port values for connected UDP and TCP sockets following a socket
abort.

Signed-off-by: Jordan Rife <jrife@google.com>
---
 .../bpf/prog_tests/sock_destroy_release.c     | 136 ++++++++++++++++++
 .../bpf/progs/sock_destroy_release.c          |  59 ++++++++
 2 files changed, 195 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy_release.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_release.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy_release.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy_release.c
new file mode 100644
index 000000000000..a2db4f31f6fa
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy_release.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include "network_helpers.h"
+#include "sock_destroy_release.skel.h"
+
+#define TEST_NS "sock_destroy_release_netns"
+
+static __u64 socket_cookie(int fd)
+{
+	__u64 cookie;
+	socklen_t cookie_len = sizeof(cookie);
+
+	if (!ASSERT_OK(getsockopt(fd, SOL_SOCKET, SO_COOKIE, &cookie,
+				  &cookie_len), "getsockopt(SO_COOKIE)"))
+		return 0;
+	return cookie;
+}
+
+static void destroy(struct sock_destroy_release *skel, int fd, int sock_type)
+{
+	__u64 cookie = socket_cookie(fd);
+	struct bpf_link *link = NULL;
+	int iter_fd = -1;
+	int nread;
+	__u64 out;
+
+	skel->bss->abort_cookie = cookie;
+
+	link = bpf_program__attach_iter(sock_type == SOCK_STREAM ?
+					skel->progs.abort_tcp :
+					skel->progs.abort_udp, NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
+		goto done;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_OK_FD(iter_fd, "bpf_iter_create"))
+		goto done;
+
+	/* Delete matching socket. */
+	nread = read(iter_fd, &out, sizeof(out));
+	ASSERT_GE(nread, 0, "nread");
+	if (nread)
+		ASSERT_EQ(out, cookie, "cookie matches");
+done:
+	if (iter_fd >= 0)
+		close(iter_fd);
+	bpf_link__destroy(link);
+	close(fd);
+}
+
+static void do_test(struct sock_destroy_release *skel, int sock_type,
+		    int family)
+{
+	const char *addr = family == AF_INET ? "127.0.0.1" : "::1";
+	int listen_fd = -1, connect_fd = -1;
+	static const int port = 10001;
+
+	listen_fd = start_server(family, sock_type, addr, port, 0);
+	if (!ASSERT_OK_FD(listen_fd, "start_server"))
+		goto cleanup;
+
+	connect_fd = connect_to_fd(listen_fd, 0);
+	if (!ASSERT_OK_FD(connect_fd, "connect_to_fd"))
+		goto cleanup;
+
+	memset(&skel->bss->sk, 0, sizeof(skel->bss->sk));
+	destroy(skel, connect_fd, sock_type);
+	connect_fd = -1;
+	if (!ASSERT_EQ(ntohs(skel->bss->sk.dst_port), port, "dst_port"))
+		goto cleanup;
+	if (family == AF_INET) {
+		if (!ASSERT_EQ(ntohl(skel->bss->sk.dst_ip4), 0x7f000001,
+			       "dst_ip4"))
+			goto cleanup;
+	} else {
+		if (!ASSERT_EQ(skel->bss->sk.dst_ip6[0], 0, "dst_ip6[0]"))
+			goto cleanup;
+		if (!ASSERT_EQ(skel->bss->sk.dst_ip6[1], 0, "dst_ip6[1]"))
+			goto cleanup;
+		if (!ASSERT_EQ(skel->bss->sk.dst_ip6[2], 0, "dst_ip6[2]"))
+			goto cleanup;
+		if (!ASSERT_EQ(ntohl(skel->bss->sk.dst_ip6[3]), 0x1,
+			       "dst_ip6[3]"))
+			goto cleanup;
+	}
+cleanup:
+	if (connect_fd >= 0)
+		close(connect_fd);
+	if (listen_fd >= 0)
+		close(listen_fd);
+}
+
+void test_sock_destroy_release(void)
+{
+	struct sock_destroy_release *skel = NULL;
+	struct nstoken *nstoken = NULL;
+	int cgroup_fd = -1;
+
+	skel = sock_destroy_release__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		goto done;
+
+	cgroup_fd = test__join_cgroup("/sock_destroy_release");
+	if (!ASSERT_OK_FD(cgroup_fd, "join_cgroup"))
+		goto done;
+
+	skel->links.sock_release = bpf_program__attach_cgroup(
+		skel->progs.sock_release, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.sock_release, "attach_cgroup"))
+		goto done;
+
+	SYS_NOFAIL("ip netns del " TEST_NS);
+	SYS(done, "ip netns add %s", TEST_NS);
+	SYS(done, "ip -net %s link set dev lo up", TEST_NS);
+
+	nstoken = open_netns(TEST_NS);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto done;
+
+	if (test__start_subtest("tcp")) {
+		do_test(skel, SOCK_STREAM, AF_INET);
+		do_test(skel, SOCK_STREAM, AF_INET6);
+	}
+	if (test__start_subtest("udp")) {
+		do_test(skel, SOCK_DGRAM, AF_INET);
+		do_test(skel, SOCK_DGRAM, AF_INET6);
+	}
+done:
+	if (nstoken)
+		close_netns(nstoken);
+	if (cgroup_fd >= 0)
+		close(cgroup_fd);
+	SYS_NOFAIL("ip netns del " TEST_NS);
+	sock_destroy_release__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_release.c b/tools/testing/selftests/bpf/progs/sock_destroy_release.c
new file mode 100644
index 000000000000..add322ab1303
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_release.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_kfuncs.h"
+
+volatile __u64 abort_cookie;
+
+void maybe_abort(struct sock_common *sk, struct seq_file *seq)
+{
+	__u64 sock_cookie;
+
+	if (!sk)
+		return;
+
+	sock_cookie = bpf_get_socket_cookie(sk);
+	if (sock_cookie != abort_cookie)
+		return;
+
+	bpf_sock_destroy(sk);
+	bpf_seq_write(seq, &sock_cookie, sizeof(sock_cookie));
+}
+
+SEC("iter/udp")
+int abort_udp(struct bpf_iter__udp *ctx)
+{
+	maybe_abort((struct sock_common *)ctx->udp_sk,
+		    ctx->meta->seq);
+
+	return 0;
+}
+
+SEC("iter/tcp")
+int abort_tcp(struct bpf_iter__tcp *ctx)
+{
+	maybe_abort((struct sock_common *)ctx->sk_common,
+		    ctx->meta->seq);
+
+	return 0;
+}
+
+struct bpf_sock sk = {};
+
+SEC("cgroup/sock_release")
+int sock_release(struct bpf_sock *ctx)
+{
+	sk.dst_ip4 = ctx->dst_ip4;
+	sk.dst_ip6[0] = ctx->dst_ip6[0];
+	sk.dst_ip6[1] = ctx->dst_ip6[1];
+	sk.dst_ip6[2] = ctx->dst_ip6[2];
+	sk.dst_ip6[3] = ctx->dst_ip6[3];
+	sk.dst_port = ctx->dst_port;
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-03 17:01 ` [PATCH net-next v2 1/2] udp: " Jordan Rife
@ 2026-03-04 21:21   ` Kuniyuki Iwashima
  2026-03-04 23:38     ` Willem de Bruijn
  2026-03-04 23:52     ` Jordan Rife
  0 siblings, 2 replies; 10+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-04 21:21 UTC (permalink / raw)
  To: jrife
  Cc: andrii, bpf, daniel, edumazet, kuba, martin.lau, netdev, sdf,
	willemdebruijn.kernel, yusuke.suzuki

From: Jordan Rife <jrife@google.com>
Date: Tue,  3 Mar 2026 17:01:02 +0000
> Currently, udp_abort() invokes __udp_disconnect() which clears out
> socket fields like inet_daddr and inet_dport. This makes fields like
> dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> hooks following an abort on a connected UDP socket, since they'll always
> equal zero. This differs from the behavior for TCP sockets where a
> cgroup/sock_release hook will see the address and port that the socket
> was connected to at the time it was aborted. This causes issues in
> Cilium where a sock_release hook is used to perform map maintenance when
> sockets are released, since Cilium actively destroys connected UDP
> sockets in some cases.

Just curious why Cilium does not unregister the socket just
after aborting it ?

At least Cilium should know the key at that moment.


> 
> Make the behavior consistent between TCP and UDP sockets by not clearing

This patch makes another inconsistency with TCP.

As you mentioned in the cover letter, before the patch, even if
aborted, a bound socket still reserves a port, but it does not now.

This might be surprising for non-reuseaddr/port sockets.


> out these fields in udp_abort(). Instead, just unhash the socket, mark
> its state as TCP_CLOSE, and preserve the state of the other fields.
> 
> Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> Signed-off-by: Jordan Rife <jrife@google.com>
> ---
>  net/ipv4/udp.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index b96e47f1c8a2..01a0a0fbcd56 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
>  
>  	sk->sk_err = err;
>  	sk_error_report(sk);
> -	__udp_disconnect(sk, 0);

How about passing another flag and avoid unwanted init only for
bpf iterator ?  This will need another layer of __udp_disconnect()
like ____udp_disconnect() (I'm not good at naming), or must make
sure the flag is never used in uAPI.



> +	sk->sk_state = TCP_CLOSE;
> +	sk->sk_prot->unhash(sk);
> +	sk_dst_reset(sk);
>  
>  out:
>  	if (!has_current_bpf_ctx())
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-04 21:21   ` Kuniyuki Iwashima
@ 2026-03-04 23:38     ` Willem de Bruijn
  2026-03-04 23:52     ` Jordan Rife
  1 sibling, 0 replies; 10+ messages in thread
From: Willem de Bruijn @ 2026-03-04 23:38 UTC (permalink / raw)
  To: Kuniyuki Iwashima, jrife
  Cc: andrii, bpf, daniel, edumazet, kuba, martin.lau, netdev, sdf,
	willemdebruijn.kernel, yusuke.suzuki

Kuniyuki Iwashima wrote:
> From: Jordan Rife <jrife@google.com>
> Date: Tue,  3 Mar 2026 17:01:02 +0000
> > Currently, udp_abort() invokes __udp_disconnect() which clears out
> > socket fields like inet_daddr and inet_dport. This makes fields like
> > dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> > hooks following an abort on a connected UDP socket, since they'll always
> > equal zero. This differs from the behavior for TCP sockets where a
> > cgroup/sock_release hook will see the address and port that the socket
> > was connected to at the time it was aborted. This causes issues in
> > Cilium where a sock_release hook is used to perform map maintenance when
> > sockets are released, since Cilium actively destroys connected UDP
> > sockets in some cases.
> 
> Just curious why Cilium does not unregister the socket just
> after aborting it ?
> 
> At least Cilium should know the key at that moment.
> 
> 
> > 
> > Make the behavior consistent between TCP and UDP sockets by not clearing
> 
> This patch makes another inconsistency with TCP.
> 
> As you mentioned in the cover letter, before the patch, even if
> aborted, a bound socket still reserves a port, but it does not now.
> 
> This might be surprising for non-reuseaddr/port sockets.
> 
> 
> > out these fields in udp_abort(). Instead, just unhash the socket, mark
> > its state as TCP_CLOSE, and preserve the state of the other fields.
> > 
> > Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> > Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> > Signed-off-by: Jordan Rife <jrife@google.com>
> > ---
> >  net/ipv4/udp.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index b96e47f1c8a2..01a0a0fbcd56 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
> >  
> >  	sk->sk_err = err;
> >  	sk_error_report(sk);
> > -	__udp_disconnect(sk, 0);
> 
> How about passing another flag and avoid unwanted init only for
> bpf iterator ?  This will need another layer of __udp_disconnect()
> like ____udp_disconnect() (I'm not good at naming), or must make
> sure the flag is never used in uAPI.

If the only goal is to not clear inet_daddr and inet_dport in
__udp_disconnect, then another variant of that that takes an extra
bool might indeed be the more obviously correct approach.

> 
> > +	sk->sk_state = TCP_CLOSE;
> > +	sk->sk_prot->unhash(sk);
> > +	sk_dst_reset(sk);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-04 21:21   ` Kuniyuki Iwashima
  2026-03-04 23:38     ` Willem de Bruijn
@ 2026-03-04 23:52     ` Jordan Rife
  2026-03-05  3:25       ` Kuniyuki Iwashima
  1 sibling, 1 reply; 10+ messages in thread
From: Jordan Rife @ 2026-03-04 23:52 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, bpf, daniel, edumazet, kuba, martin.lau, netdev, sdf,
	willemdebruijn.kernel, yusuke.suzuki

On Wed, Mar 04, 2026 at 09:21:03PM +0000, Kuniyuki Iwashima wrote:
> From: Jordan Rife <jrife@google.com>
> Date: Tue,  3 Mar 2026 17:01:02 +0000
> > Currently, udp_abort() invokes __udp_disconnect() which clears out
> > socket fields like inet_daddr and inet_dport. This makes fields like
> > dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> > hooks following an abort on a connected UDP socket, since they'll always
> > equal zero. This differs from the behavior for TCP sockets where a
> > cgroup/sock_release hook will see the address and port that the socket
> > was connected to at the time it was aborted. This causes issues in
> > Cilium where a sock_release hook is used to perform map maintenance when
> > sockets are released, since Cilium actively destroys connected UDP
> > sockets in some cases.
> 
> Just curious why Cilium does not unregister the socket just
> after aborting it ?
> 
> At least Cilium should know the key at that moment.

Originally, socket aborts in Cilium were handled with sock_diag (and
still are on older kernels), so you'd need to do some map maintenance in
userspace directly before or after aborting the socket. I suppose it was
just cleaner to put common cleanup code in a sock_release hook.

> > 
> > Make the behavior consistent between TCP and UDP sockets by not clearing
> 
> This patch makes another inconsistency with TCP.
> 
> As you mentioned in the cover letter, before the patch, even if
> aborted, a bound socket still reserves a port, but it does not now.
> 
> This might be surprising for non-reuseaddr/port sockets.

Yeah, true. I'm curious if this behavior is something that is useful at
all or simply an accident of the current implementation? If I'm
aborting/destroying a socket, why should it hold onto bound ports until
the socket is closed? I wasn't quite sure about this.

> > out these fields in udp_abort(). Instead, just unhash the socket, mark
> > its state as TCP_CLOSE, and preserve the state of the other fields.
> > 
> > Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> > Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> > Signed-off-by: Jordan Rife <jrife@google.com>
> > ---
> >  net/ipv4/udp.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index b96e47f1c8a2..01a0a0fbcd56 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
> >  
> >  	sk->sk_err = err;
> >  	sk_error_report(sk);
> > -	__udp_disconnect(sk, 0);
> 
> How about passing another flag and avoid unwanted init only for
> bpf iterator ?  This will need another layer of __udp_disconnect()
> like ____udp_disconnect() (I'm not good at naming), or must make
> sure the flag is never used in uAPI.

To make sure we're on the same page about what you're describing, I'm
imagining:

static int ___udp_disconnect(struct sock *sk, int flags, bool reinit)
{                                                 
	same as before, but if reinit == false then avoid reinitializing
	fields like daddr/dport while holding onto reserved ports
}

int __udp_disconnect(struct sock *sk, int flags)
{
	___udp_disconnect(sk, flags, true);
}

...

int udp_abort(struct sock *sk, int err)
{
	...
	___udp_disconnect(sk, flags, false); <--- set reinit arg to false
	...
}

This would have the same behavior with regards to reserved ports
following udp_abort() while preserving fields like daddr. I was
considering something like this initially, but it wasn't clear to me why
you'd want to hold onto reserved ports if the socket was destroyed as I
mentioned above. If it's important to maintain that aspect of the
behavior, then I'm OK with this approach.

> 
> 
> 
> > +	sk->sk_state = TCP_CLOSE;
> > +	sk->sk_prot->unhash(sk);
> > +	sk_dst_reset(sk);
> >  
> >  out:
> >  	if (!has_current_bpf_ctx())
> > -- 
> > 2.53.0.473.g4a7958ca14-goog
> > 

Thanks!

Jordan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-04 23:52     ` Jordan Rife
@ 2026-03-05  3:25       ` Kuniyuki Iwashima
  2026-03-05 18:03         ` Jordan Rife
  0 siblings, 1 reply; 10+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-05  3:25 UTC (permalink / raw)
  To: jrife
  Cc: andrii, bpf, daniel, edumazet, kuba, kuniyu, martin.lau, netdev,
	sdf, willemdebruijn.kernel, yusuke.suzuki

From: Jordan Rife <jrife@google.com>
Date: Wed, 4 Mar 2026 23:52:22 +0000
> On Wed, Mar 04, 2026 at 09:21:03PM +0000, Kuniyuki Iwashima wrote:
> > From: Jordan Rife <jrife@google.com>
> > Date: Tue,  3 Mar 2026 17:01:02 +0000
> > > Currently, udp_abort() invokes __udp_disconnect() which clears out
> > > socket fields like inet_daddr and inet_dport. This makes fields like
> > > dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> > > hooks following an abort on a connected UDP socket, since they'll always
> > > equal zero. This differs from the behavior for TCP sockets where a
> > > cgroup/sock_release hook will see the address and port that the socket
> > > was connected to at the time it was aborted. This causes issues in
> > > Cilium where a sock_release hook is used to perform map maintenance when
> > > sockets are released, since Cilium actively destroys connected UDP
> > > sockets in some cases.
> > 
> > Just curious why Cilium does not unregister the socket just
> > after aborting it ?
> > 
> > At least Cilium should know the key at that moment.
> 
> Originally, socket aborts in Cilium were handled with sock_diag (and
> still are on older kernels), so you'd need to do some map maintenance in
> userspace directly before or after aborting the socket. I suppose it was
> just cleaner to put common cleanup code in a sock_release hook.
> 
> > > 
> > > Make the behavior consistent between TCP and UDP sockets by not clearing
> > 
> > This patch makes another inconsistency with TCP.
> > 
> > As you mentioned in the cover letter, before the patch, even if
> > aborted, a bound socket still reserves a port, but it does not now.
> > 
> > This might be surprising for non-reuseaddr/port sockets.
> 
> Yeah, true. I'm curious if this behavior is something that is useful at
> all or simply an accident of the current implementation? If I'm
> aborting/destroying a socket, why should it hold onto bound ports until
> the socket is closed? I wasn't quite sure about this.

Digging into the first (long) thread of SOCK_DESTROY (TCP),
the motivation was to kill sockets bound to an IP address
that Android/laptop removes when it switches networks.
https://lore.kernel.org/netdev/1447811024-8553-1-git-send-email-lorenzo@google.com/

In that sense, holding a port seems unnecessary.

But there are a few concerns like

  * someone could use the feature for server and reuse port
    too quickly.
    https://lore.kernel.org/netdev/1447879416.562854.443622857.62708268@webmail.messagingengine.com/

  * it's a bit worrisome if this feature is viewed as a way for
    resouce management.
    https://lore.kernel.org/netdev/CALx6S35Q_pc-vBNpA0hipn5K-vSmPDsV-V=D13sODESNO7Htdw@mail.gmail.com/

Considering these, it seems fair to only notify an error
via epoll or blocked syscall and let the application to
close() sockets.

In a thread, the author mentioned this :

---8<---
Its purpose is to interrupt
blocked userspace socket calls, not to release resources.
---8<---
https://lore.kernel.org/netdev/CAKD1Yr0Gpr6Ex2i1TXEWyETn48P4g5vBvFZ9=_g+Lz_7-PqBDQ@mail.gmail.com/

Also, there was an assumption that a sane application will
close() sockets after receiving errors.
https://lore.kernel.org/netdev/1447972381.22599.278.camel@edumazet-glaptop2.roam.corp.google.com/

And yes, this does not apply to UDP, and sadly UDP patch's
thread has no such context nor discussion.
https://lore.kernel.org/netdev/?q=udp_abort&o=-1

But anyway, now I think we can solve your problem more
easily, see below.


> 
> > > out these fields in udp_abort(). Instead, just unhash the socket, mark
> > > its state as TCP_CLOSE, and preserve the state of the other fields.
> > > 
> > > Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> > > Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> > > Signed-off-by: Jordan Rife <jrife@google.com>
> > > ---
> > >  net/ipv4/udp.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index b96e47f1c8a2..01a0a0fbcd56 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
> > >  
> > >  	sk->sk_err = err;
> > >  	sk_error_report(sk);
> > > -	__udp_disconnect(sk, 0);
> > 
> > How about passing another flag and avoid unwanted init only for
> > bpf iterator ?  This will need another layer of __udp_disconnect()
> > like ____udp_disconnect() (I'm not good at naming), or must make
> > sure the flag is never used in uAPI.
> 
> To make sure we're on the same page about what you're describing, I'm
> imagining:
> 
> static int ___udp_disconnect(struct sock *sk, int flags, bool reinit)
> {                                                 
> 	same as before, but if reinit == false then avoid reinitializing
> 	fields like daddr/dport while holding onto reserved ports
> }
> 
> int __udp_disconnect(struct sock *sk, int flags)
> {
> 	___udp_disconnect(sk, flags, true);
> }
> 
> ...
> 
> int udp_abort(struct sock *sk, int err)
> {
> 	...
> 	___udp_disconnect(sk, flags, false); <--- set reinit arg to false
> 	...
> }
> 
> This would have the same behavior with regards to reserved ports
> following udp_abort() while preserving fields like daddr. I was
> considering something like this initially, but it wasn't clear to me why
> you'd want to hold onto reserved ports if the socket was destroyed as I
> mentioned above. If it's important to maintain that aspect of the
> behavior, then I'm OK with this approach.

This was what's in my mind, but probably we could just avoid
the initialisation:

---8<---
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ce49ed2bd36f..d44be0020446 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2213,21 +2213,21 @@ int __udp_disconnect(struct sock *sk, int flags)
 	 */
 
 	sk->sk_state = TCP_CLOSE;
-	inet->inet_daddr = 0;
-	inet->inet_dport = 0;
 	sock_rps_reset_rxhash(sk);
 	sk->sk_bound_dev_if = 0;
-	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
-		inet_reset_saddr(sk);
-		if (sk->sk_prot->rehash &&
-		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK))
-			sk->sk_prot->rehash(sk);
-	}
 
 	if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
 		sk->sk_prot->unhash(sk);
-		inet->inet_sport = 0;
+	} else if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
+		if (sk->sk_prot->rehash &&
+		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
+			inet_reset_saddr(sk);
+			inet->inet_daddr = 0;
+			inet->inet_dport = 0;
+			sk->sk_prot->rehash(sk);
+		}
 	}
+
 	sk_dst_reset(sk);
 	return 0;
 }
---8<---

When we need to rehash(), of course we have to clear
saddr/daddr/dport, but otherwise, we don't, even sport.

When ->unhash()ed, only thing we need to do is to clear
inet_sk(sk)->inet_num.

Then, even if bind() and connect() are called later, it
works because each syscall does not care about the old
saddr/sport/daddr/dport and just assign a new value if
inet->inet_num == 0.

This seems a bit fragile, but TCP already relies on it and
it's unlikely that we change the behaviour in __inet_bind(),
inet_autobind(), and inet_dgram_connect().

And all the other users of __udp_disconnect() are also
AF_INET or AF_INET6, l2tp, ping, raw.


> 
> > 
> > 
> > 
> > > +	sk->sk_state = TCP_CLOSE;
> > > +	sk->sk_prot->unhash(sk);
> > > +	sk_dst_reset(sk);
> > >  
> > >  out:
> > >  	if (!has_current_bpf_ctx())
> > > -- 
> > > 2.53.0.473.g4a7958ca14-goog
> > > 
> 
> Thanks!
> 
> Jordan

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-05  3:25       ` Kuniyuki Iwashima
@ 2026-03-05 18:03         ` Jordan Rife
  2026-03-05 19:31           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 10+ messages in thread
From: Jordan Rife @ 2026-03-05 18:03 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, bpf, daniel, edumazet, kuba, martin.lau, netdev, sdf,
	willemdebruijn.kernel, yusuke.suzuki

On Thu, Mar 05, 2026 at 03:25:21AM +0000, Kuniyuki Iwashima wrote:
> From: Jordan Rife <jrife@google.com>
> Date: Wed, 4 Mar 2026 23:52:22 +0000
> > On Wed, Mar 04, 2026 at 09:21:03PM +0000, Kuniyuki Iwashima wrote:
> > > From: Jordan Rife <jrife@google.com>
> > > Date: Tue,  3 Mar 2026 17:01:02 +0000
> > > > Currently, udp_abort() invokes __udp_disconnect() which clears out
> > > > socket fields like inet_daddr and inet_dport. This makes fields like
> > > > dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> > > > hooks following an abort on a connected UDP socket, since they'll always
> > > > equal zero. This differs from the behavior for TCP sockets where a
> > > > cgroup/sock_release hook will see the address and port that the socket
> > > > was connected to at the time it was aborted. This causes issues in
> > > > Cilium where a sock_release hook is used to perform map maintenance when
> > > > sockets are released, since Cilium actively destroys connected UDP
> > > > sockets in some cases.
> > > 
> > > Just curious why Cilium does not unregister the socket just
> > > after aborting it ?
> > > 
> > > At least Cilium should know the key at that moment.
> > 
> > Originally, socket aborts in Cilium were handled with sock_diag (and
> > still are on older kernels), so you'd need to do some map maintenance in
> > userspace directly before or after aborting the socket. I suppose it was
> > just cleaner to put common cleanup code in a sock_release hook.
> > 
> > > > 
> > > > Make the behavior consistent between TCP and UDP sockets by not clearing
> > > 
> > > This patch makes another inconsistency with TCP.
> > > 
> > > As you mentioned in the cover letter, before the patch, even if
> > > aborted, a bound socket still reserves a port, but it does not now.
> > > 
> > > This might be surprising for non-reuseaddr/port sockets.
> > 
> > Yeah, true. I'm curious if this behavior is something that is useful at
> > all or simply an accident of the current implementation? If I'm
> > aborting/destroying a socket, why should it hold onto bound ports until
> > the socket is closed? I wasn't quite sure about this.
> 
> Digging into the first (long) thread of SOCK_DESTROY (TCP),
> the motivation was to kill sockets bound to an IP address
> that Android/laptop removes when it switches networks.
> https://lore.kernel.org/netdev/1447811024-8553-1-git-send-email-lorenzo@google.com/
> 
> In that sense, holding a port seems unnecessary.
> 
> But there are a few concerns like
> 
>   * someone could use the feature for server and reuse port
>     too quickly.
>     https://lore.kernel.org/netdev/1447879416.562854.443622857.62708268@webmail.messagingengine.com/
> 
>   * it's a bit worrisome if this feature is viewed as a way for
>     resouce management.
>     https://lore.kernel.org/netdev/CALx6S35Q_pc-vBNpA0hipn5K-vSmPDsV-V=D13sODESNO7Htdw@mail.gmail.com/
> 
> Considering these, it seems fair to only notify an error
> via epoll or blocked syscall and let the application to
> close() sockets.
> 
> In a thread, the author mentioned this :
> 
> ---8<---
> Its purpose is to interrupt
> blocked userspace socket calls, not to release resources.
> ---8<---
> https://lore.kernel.org/netdev/CAKD1Yr0Gpr6Ex2i1TXEWyETn48P4g5vBvFZ9=_g+Lz_7-PqBDQ@mail.gmail.com/
> 
> Also, there was an assumption that a sane application will
> close() sockets after receiving errors.
> https://lore.kernel.org/netdev/1447972381.22599.278.camel@edumazet-glaptop2.roam.corp.google.com/
> 
> And yes, this does not apply to UDP, and sadly UDP patch's
> thread has no such context nor discussion.
> https://lore.kernel.org/netdev/?q=udp_abort&o=-1
> 

Interesting, thanks for taking the time to dig into this.

> But anyway, now I think we can solve your problem more
> easily, see below.
> 
> 
> > 
> > > > out these fields in udp_abort(). Instead, just unhash the socket, mark
> > > > its state as TCP_CLOSE, and preserve the state of the other fields.
> > > > 
> > > > Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> > > > Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> > > > Signed-off-by: Jordan Rife <jrife@google.com>
> > > > ---
> > > >  net/ipv4/udp.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > index b96e47f1c8a2..01a0a0fbcd56 100644
> > > > --- a/net/ipv4/udp.c
> > > > +++ b/net/ipv4/udp.c
> > > > @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
> > > >  
> > > >  	sk->sk_err = err;
> > > >  	sk_error_report(sk);
> > > > -	__udp_disconnect(sk, 0);
> > > 
> > > How about passing another flag and avoid unwanted init only for
> > > bpf iterator ?  This will need another layer of __udp_disconnect()
> > > like ____udp_disconnect() (I'm not good at naming), or must make
> > > sure the flag is never used in uAPI.
> > 
> > To make sure we're on the same page about what you're describing, I'm
> > imagining:
> > 
> > static int ___udp_disconnect(struct sock *sk, int flags, bool reinit)
> > {                                                 
> > 	same as before, but if reinit == false then avoid reinitializing
> > 	fields like daddr/dport while holding onto reserved ports
> > }
> > 
> > int __udp_disconnect(struct sock *sk, int flags)
> > {
> > 	___udp_disconnect(sk, flags, true);
> > }
> > 
> > ...
> > 
> > int udp_abort(struct sock *sk, int err)
> > {
> > 	...
> > 	___udp_disconnect(sk, flags, false); <--- set reinit arg to false
> > 	...
> > }
> > 
> > This would have the same behavior with regards to reserved ports
> > following udp_abort() while preserving fields like daddr. I was
> > considering something like this initially, but it wasn't clear to me why
> > you'd want to hold onto reserved ports if the socket was destroyed as I
> > mentioned above. If it's important to maintain that aspect of the
> > behavior, then I'm OK with this approach.
> 
> This was what's in my mind, but probably we could just avoid
> the initialisation:
> 
> ---8<---
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ce49ed2bd36f..d44be0020446 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2213,21 +2213,21 @@ int __udp_disconnect(struct sock *sk, int flags)
>  	 */
>  
>  	sk->sk_state = TCP_CLOSE;
> -	inet->inet_daddr = 0;
> -	inet->inet_dport = 0;
>  	sock_rps_reset_rxhash(sk);
>  	sk->sk_bound_dev_if = 0;
> -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
> -		inet_reset_saddr(sk);
> -		if (sk->sk_prot->rehash &&
> -		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK))
> -			sk->sk_prot->rehash(sk);
> -	}
>  
>  	if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
>  		sk->sk_prot->unhash(sk);
> -		inet->inet_sport = 0;
> +	} else if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
> +		if (sk->sk_prot->rehash &&
> +		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
> +			inet_reset_saddr(sk);
> +			inet->inet_daddr = 0;
> +			inet->inet_dport = 0;
> +			sk->sk_prot->rehash(sk);
> +		}
>  	}
> +
>  	sk_dst_reset(sk);
>  	return 0;
>  }
> ---8<---
> 
> When we need to rehash(), of course we have to clear
> saddr/daddr/dport, but otherwise, we don't, even sport.

What about the case where you bind to a port and use connect?

1. socket bind() to 0.0.0.0:10000
2. socket connect() to 1.2.3.4:20000
3. diag_destroy -> udp_abort()
4. socket close()
5. sock_release hook runs and daddr/dport have been cleared

compared to just connect:

1. socket connect() to 1.2.3.4:20000
2. diag_destroy -> udp_abort()
3. socket close()
4. sock_release hook runs and daddr/dport remain intact

It seems confusing that these two sequences would behave differently.
 
> When ->unhash()ed, only thing we need to do is to clear
> inet_sk(sk)->inet_num.
> 
> Then, even if bind() and connect() are called later, it
> works because each syscall does not care about the old
> saddr/sport/daddr/dport and just assign a new value if
> inet->inet_num == 0.
> 
> This seems a bit fragile, but TCP already relies on it and
> it's unlikely that we change the behaviour in __inet_bind(),
> inet_autobind(), and inet_dgram_connect().
> 
> And all the other users of __udp_disconnect() are also
> AF_INET or AF_INET6, l2tp, ping, raw.

Taking another look at tcp_abort() to compare how it behaves, I saw the
following:

tcp_abort(sk, err)
  maybe send active rst
  tcp_done_with_error(sk, err)
    tcp_done(sk)
      tcp_set_state(sk, TCP_CLOSE)
        sk->sk_prot->unhash(sk) if state == TCP_CLOSE
	  inet_unhash(sk)
	    removes sk from lhash/ehash

I don't see a scenario where diag_destroy (tcp_abort) doesn't lead to
the socket getting immediately unhashed, even if bound to a specific
port (SOCK_BINDPORT_LOCK). This seems consistent with my observations
in [1] where I used diag_destroy to immediately remove TCP sockets from
the ehash while testing socket iterators. In contrast, I remember in [2]
encountering this rehash behavior with UDP sockets leading to seeing the
same socket twice during iteration if you bpf_sock_destroy()ed the
socket. This wasn't inherently a problem at the time, but it was a bit
surprising.

> This patch makes another inconsistency with TCP.

All this is to say, I'm still not sure why preserving how rehash works
in the case of udp_abort() is necessary, and doing so can result in some
unexpected behavior both in the case of socket iteration and
sock_release as in the example above. To keep things consistent with
TCP, doesn't it make more sense to just unhash instead of sometimes
keeping the socket in the UDP hash under certain conditions? Am I just
not seeing the place where a TCP socket would hold onto a port after you
destroy it?

[1]: https://lore.kernel.org/netdev/20250714180919.127192-13-jordan@jrife.io/
[2]: https://lore.kernel.org/bpf/CADKFtnQyiz_r_vfyYfTvzi3MvNpRt62mDrNyEvp9tm82UcSFjQ@mail.gmail.com/

> 
> 
> > 
> > > 
> > > 
> > > 
> > > > +	sk->sk_state = TCP_CLOSE;
> > > > +	sk->sk_prot->unhash(sk);
> > > > +	sk_dst_reset(sk);
> > > >  
> > > >  out:
> > > >  	if (!has_current_bpf_ctx())
> > > > -- 
> > > > 2.53.0.473.g4a7958ca14-goog
> > > > 
> > 
> > Thanks!
> > 
> > Jordan

Thanks,
Jordan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-05 18:03         ` Jordan Rife
@ 2026-03-05 19:31           ` Kuniyuki Iwashima
  2026-03-05 23:18             ` Jordan Rife
  0 siblings, 1 reply; 10+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-05 19:31 UTC (permalink / raw)
  To: jrife
  Cc: andrii, bpf, daniel, edumazet, kuba, kuniyu, martin.lau, netdev,
	sdf, willemdebruijn.kernel, yusuke.suzuki

From: Jordan Rife <jrife@google.com>
Date: Thu, 5 Mar 2026 18:03:09 +0000
> On Thu, Mar 05, 2026 at 03:25:21AM +0000, Kuniyuki Iwashima wrote:
> > From: Jordan Rife <jrife@google.com>
> > Date: Wed, 4 Mar 2026 23:52:22 +0000
> > > On Wed, Mar 04, 2026 at 09:21:03PM +0000, Kuniyuki Iwashima wrote:
> > > > From: Jordan Rife <jrife@google.com>
> > > > Date: Tue,  3 Mar 2026 17:01:02 +0000
> > > > > Currently, udp_abort() invokes __udp_disconnect() which clears out
> > > > > socket fields like inet_daddr and inet_dport. This makes fields like
> > > > > dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> > > > > hooks following an abort on a connected UDP socket, since they'll always
> > > > > equal zero. This differs from the behavior for TCP sockets where a
> > > > > cgroup/sock_release hook will see the address and port that the socket
> > > > > was connected to at the time it was aborted. This causes issues in
> > > > > Cilium where a sock_release hook is used to perform map maintenance when
> > > > > sockets are released, since Cilium actively destroys connected UDP
> > > > > sockets in some cases.
> > > > 
> > > > Just curious why Cilium does not unregister the socket just
> > > > after aborting it ?
> > > > 
> > > > At least Cilium should know the key at that moment.
> > > 
> > > Originally, socket aborts in Cilium were handled with sock_diag (and
> > > still are on older kernels), so you'd need to do some map maintenance in
> > > userspace directly before or after aborting the socket. I suppose it was
> > > just cleaner to put common cleanup code in a sock_release hook.
> > > 
> > > > > 
> > > > > Make the behavior consistent between TCP and UDP sockets by not clearing
> > > > 
> > > > This patch makes another inconsistency with TCP.
> > > > 
> > > > As you mentioned in the cover letter, before the patch, even if
> > > > aborted, a bound socket still reserves a port, but it does not now.
> > > > 
> > > > This might be surprising for non-reuseaddr/port sockets.
> > > 
> > > Yeah, true. I'm curious if this behavior is something that is useful at
> > > all or simply an accident of the current implementation? If I'm
> > > aborting/destroying a socket, why should it hold onto bound ports until
> > > the socket is closed? I wasn't quite sure about this.
> > 
> > Digging into the first (long) thread of SOCK_DESTROY (TCP),
> > the motivation was to kill sockets bound to an IP address
> > that Android/laptop removes when it switches networks.
> > https://lore.kernel.org/netdev/1447811024-8553-1-git-send-email-lorenzo@google.com/
> > 
> > In that sense, holding a port seems unnecessary.
> > 
> > But there are a few concerns like
> > 
> >   * someone could use the feature for server and reuse port
> >     too quickly.
> >     https://lore.kernel.org/netdev/1447879416.562854.443622857.62708268@webmail.messagingengine.com/
> > 
> >   * it's a bit worrisome if this feature is viewed as a way for
> >     resouce management.
> >     https://lore.kernel.org/netdev/CALx6S35Q_pc-vBNpA0hipn5K-vSmPDsV-V=D13sODESNO7Htdw@mail.gmail.com/
> > 
> > Considering these, it seems fair to only notify an error
> > via epoll or blocked syscall and let the application to
> > close() sockets.
> > 
> > In a thread, the author mentioned this :
> > 
> > ---8<---
> > Its purpose is to interrupt
> > blocked userspace socket calls, not to release resources.
> > ---8<---
> > https://lore.kernel.org/netdev/CAKD1Yr0Gpr6Ex2i1TXEWyETn48P4g5vBvFZ9=_g+Lz_7-PqBDQ@mail.gmail.com/
> > 
> > Also, there was an assumption that a sane application will
> > close() sockets after receiving errors.
> > https://lore.kernel.org/netdev/1447972381.22599.278.camel@edumazet-glaptop2.roam.corp.google.com/
> > 
> > And yes, this does not apply to UDP, and sadly UDP patch's
> > thread has no such context nor discussion.
> > https://lore.kernel.org/netdev/?q=udp_abort&o=-1
> > 
> 
> Interesting, thanks for taking the time to dig into this.
> 
> > But anyway, now I think we can solve your problem more
> > easily, see below.
> > 
> > 
> > > 
> > > > > out these fields in udp_abort(). Instead, just unhash the socket, mark
> > > > > its state as TCP_CLOSE, and preserve the state of the other fields.
> > > > > 
> > > > > Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> > > > > Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> > > > > Signed-off-by: Jordan Rife <jrife@google.com>
> > > > > ---
> > > > >  net/ipv4/udp.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > index b96e47f1c8a2..01a0a0fbcd56 100644
> > > > > --- a/net/ipv4/udp.c
> > > > > +++ b/net/ipv4/udp.c
> > > > > @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
> > > > >  
> > > > >  	sk->sk_err = err;
> > > > >  	sk_error_report(sk);
> > > > > -	__udp_disconnect(sk, 0);
> > > > 
> > > > How about passing another flag and avoid unwanted init only for
> > > > bpf iterator ?  This will need another layer of __udp_disconnect()
> > > > like ____udp_disconnect() (I'm not good at naming), or must make
> > > > sure the flag is never used in uAPI.
> > > 
> > > To make sure we're on the same page about what you're describing, I'm
> > > imagining:
> > > 
> > > static int ___udp_disconnect(struct sock *sk, int flags, bool reinit)
> > > {                                                 
> > > 	same as before, but if reinit == false then avoid reinitializing
> > > 	fields like daddr/dport while holding onto reserved ports
> > > }
> > > 
> > > int __udp_disconnect(struct sock *sk, int flags)
> > > {
> > > 	___udp_disconnect(sk, flags, true);
> > > }
> > > 
> > > ...
> > > 
> > > int udp_abort(struct sock *sk, int err)
> > > {
> > > 	...
> > > 	___udp_disconnect(sk, flags, false); <--- set reinit arg to false
> > > 	...
> > > }
> > > 
> > > This would have the same behavior with regards to reserved ports
> > > following udp_abort() while preserving fields like daddr. I was
> > > considering something like this initially, but it wasn't clear to me why
> > > you'd want to hold onto reserved ports if the socket was destroyed as I
> > > mentioned above. If it's important to maintain that aspect of the
> > > behavior, then I'm OK with this approach.
> > 
> > This was what's in my mind, but probably we could just avoid
> > the initialisation:
> > 
> > ---8<---
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index ce49ed2bd36f..d44be0020446 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -2213,21 +2213,21 @@ int __udp_disconnect(struct sock *sk, int flags)
> >  	 */
> >  
> >  	sk->sk_state = TCP_CLOSE;
> > -	inet->inet_daddr = 0;
> > -	inet->inet_dport = 0;
> >  	sock_rps_reset_rxhash(sk);
> >  	sk->sk_bound_dev_if = 0;
> > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
> > -		inet_reset_saddr(sk);
> > -		if (sk->sk_prot->rehash &&
> > -		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK))
> > -			sk->sk_prot->rehash(sk);
> > -	}
> >  
> >  	if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
> >  		sk->sk_prot->unhash(sk);
> > -		inet->inet_sport = 0;
> > +	} else if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
> > +		if (sk->sk_prot->rehash &&
> > +		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
> > +			inet_reset_saddr(sk);
> > +			inet->inet_daddr = 0;
> > +			inet->inet_dport = 0;
> > +			sk->sk_prot->rehash(sk);
> > +		}
> >  	}
> > +
> >  	sk_dst_reset(sk);
> >  	return 0;
> >  }
> > ---8<---
> > 
> > When we need to rehash(), of course we have to clear
> > saddr/daddr/dport, but otherwise, we don't, even sport.
> 
> What about the case where you bind to a port and use connect?
> 
> 1. socket bind() to 0.0.0.0:10000
> 2. socket connect() to 1.2.3.4:20000
> 3. diag_destroy -> udp_abort()
> 4. socket close()
> 5. sock_release hook runs and daddr/dport have been cleared
> 
> compared to just connect:
> 
> 1. socket connect() to 1.2.3.4:20000
> 2. diag_destroy -> udp_abort()
> 3. socket close()
> 4. sock_release hook runs and daddr/dport remain intact
> 
> It seems confusing that these two sequences would behave differently.

I assumed from "a connected UDP socket" that your use case
is a client without explicit bind(), but if it's not, yes,
this will not work.


>  
> > When ->unhash()ed, only thing we need to do is to clear
> > inet_sk(sk)->inet_num.
> > 
> > Then, even if bind() and connect() are called later, it
> > works because each syscall does not care about the old
> > saddr/sport/daddr/dport and just assign a new value if
> > inet->inet_num == 0.
> > 
> > This seems a bit fragile, but TCP already relies on it and
> > it's unlikely that we change the behaviour in __inet_bind(),
> > inet_autobind(), and inet_dgram_connect().
> > 
> > And all the other users of __udp_disconnect() are also
> > AF_INET or AF_INET6, l2tp, ping, raw.
> 
> Taking another look at tcp_abort() to compare how it behaves, I saw the
> following:
> 
> tcp_abort(sk, err)
>   maybe send active rst
>   tcp_done_with_error(sk, err)
>     tcp_done(sk)
>       tcp_set_state(sk, TCP_CLOSE)
>         sk->sk_prot->unhash(sk) if state == TCP_CLOSE
> 	  inet_unhash(sk)
> 	    removes sk from lhash/ehash
> 
> I don't see a scenario where diag_destroy (tcp_abort) doesn't lead to
> the socket getting immediately unhashed, even if bound to a specific
> port (SOCK_BINDPORT_LOCK).

TCP has 3 hash tables, lhash for TCP_LISTEN, ehash for other
states except for TCP_CLOSE, and bhash is for all sockets bound
to a port regardless of explicit/implicit bind().

If the socket is explicitly bind()ed and has SOCK_BINDPORT_LOCK,
inet_put_port() is not called, which does not unhash it from bhash,
meaning the socket still reserves a port.


> This seems consistent with my observations
> in [1] where I used diag_destroy to immediately remove TCP sockets from
> the ehash while testing socket iterators.

BPF iter only iterates over ehash and lhash that have sockets
with real work.  This behaviour derives from /proc/net/tcp, which
cannot catch TCP_CLOSE sockets.

Recently INET_DIAG dump (not destroy) started to support
bhash though, see 91051f0039484.


> In contrast, I remember in [2]
> encountering this rehash behavior with UDP sockets leading to seeing the
> same socket twice during iteration if you bpf_sock_destroy()ed the
> socket. This wasn't inherently a problem at the time, but it was a bit
> surprising.

Right, this happens because UDP combines bhash/ehash/lhash
into a single hash table (technically it has 1-tuple/2-tuple/4-tuple
hash tables but they are not separated in terms of resource
management).


> 
> > This patch makes another inconsistency with TCP.
> 
> All this is to say, I'm still not sure why preserving how rehash works
> in the case of udp_abort() is necessary, and doing so can result in some
> unexpected behavior both in the case of socket iteration and
> sock_release as in the example above. To keep things consistent with
> TCP, doesn't it make more sense to just unhash instead of sometimes
> keeping the socket in the UDP hash under certain conditions? Am I just
> not seeing the place where a TCP socket would hold onto a port after you
> destroy it?

As mentioned, that's bhash, and you can see TCP does
not do "just unhash" behaviour even with tcp_abort().

---8<---
# python3
>>> from socket import *
>>> 
>>> s1 = socket()
>>> s1.bind(('', 80))
>>> s1.listen()
>>> 
>>> import subprocess
>>> subprocess.run("ss -K -t state listening sport :80".split(' '))
Recv-Q                Send-Q                                Local Address:Port                                 Peer Address:Port
0                     128                                         0.0.0.0:http                                      0.0.0.0:*
CompletedProcess(args=['ss', '-K', '-t', 'state', 'listening', 'sport', ':80'], returncode=0)
>>> subprocess.run("ss -tB".split(' '))
Recv-Q                Send-Q                                Local Address:Port                                 Peer Address:Port
0                     0                                           0.0.0.0:http                                      0.0.0.0:*
CompletedProcess(args=['ss', '-tB'], returncode=0)
>>> 
>>> s2 = socket()
>>> s2.bind(s1.getsockname())
Traceback (most recent call last):
  File "<python-input-12>", line 1, in <module>
    s2.bind(s1.getsockname())
    ~~~~~~~^^^^^^^^^^^^^^^^^^
OSError: [Errno 98] Address already in use
---8<---



> 
> [1]: https://lore.kernel.org/netdev/20250714180919.127192-13-jordan@jrife.io/
> [2]: https://lore.kernel.org/bpf/CADKFtnQyiz_r_vfyYfTvzi3MvNpRt62mDrNyEvp9tm82UcSFjQ@mail.gmail.com/
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
  2026-03-05 19:31           ` Kuniyuki Iwashima
@ 2026-03-05 23:18             ` Jordan Rife
  0 siblings, 0 replies; 10+ messages in thread
From: Jordan Rife @ 2026-03-05 23:18 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, bpf, daniel, edumazet, kuba, martin.lau, netdev, sdf,
	willemdebruijn.kernel, yusuke.suzuki

On Thu, Mar 05, 2026 at 07:31:45PM +0000, Kuniyuki Iwashima wrote:
> From: Jordan Rife <jrife@google.com>
> Date: Thu, 5 Mar 2026 18:03:09 +0000
> > On Thu, Mar 05, 2026 at 03:25:21AM +0000, Kuniyuki Iwashima wrote:
> > > From: Jordan Rife <jrife@google.com>
> > > Date: Wed, 4 Mar 2026 23:52:22 +0000
> > > > On Wed, Mar 04, 2026 at 09:21:03PM +0000, Kuniyuki Iwashima wrote:
> > > > > From: Jordan Rife <jrife@google.com>
> > > > > Date: Tue,  3 Mar 2026 17:01:02 +0000
> > > > > > Currently, udp_abort() invokes __udp_disconnect() which clears out
> > > > > > socket fields like inet_daddr and inet_dport. This makes fields like
> > > > > > dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> > > > > > hooks following an abort on a connected UDP socket, since they'll always
> > > > > > equal zero. This differs from the behavior for TCP sockets where a
> > > > > > cgroup/sock_release hook will see the address and port that the socket
> > > > > > was connected to at the time it was aborted. This causes issues in
> > > > > > Cilium where a sock_release hook is used to perform map maintenance when
> > > > > > sockets are released, since Cilium actively destroys connected UDP
> > > > > > sockets in some cases.
> > > > > 
> > > > > Just curious why Cilium does not unregister the socket just
> > > > > after aborting it ?
> > > > > 
> > > > > At least Cilium should know the key at that moment.
> > > > 
> > > > Originally, socket aborts in Cilium were handled with sock_diag (and
> > > > still are on older kernels), so you'd need to do some map maintenance in
> > > > userspace directly before or after aborting the socket. I suppose it was
> > > > just cleaner to put common cleanup code in a sock_release hook.
> > > > 
> > > > > > 
> > > > > > Make the behavior consistent between TCP and UDP sockets by not clearing
> > > > > 
> > > > > This patch makes another inconsistency with TCP.
> > > > > 
> > > > > As you mentioned in the cover letter, before the patch, even if
> > > > > aborted, a bound socket still reserves a port, but it does not now.
> > > > > 
> > > > > This might be surprising for non-reuseaddr/port sockets.
> > > > 
> > > > Yeah, true. I'm curious if this behavior is something that is useful at
> > > > all or simply an accident of the current implementation? If I'm
> > > > aborting/destroying a socket, why should it hold onto bound ports until
> > > > the socket is closed? I wasn't quite sure about this.
> > > 
> > > Digging into the first (long) thread of SOCK_DESTROY (TCP),
> > > the motivation was to kill sockets bound to an IP address
> > > that Android/laptop removes when it switches networks.
> > > https://lore.kernel.org/netdev/1447811024-8553-1-git-send-email-lorenzo@google.com/
> > > 
> > > In that sense, holding a port seems unnecessary.
> > > 
> > > But there are a few concerns like
> > > 
> > >   * someone could use the feature for server and reuse port
> > >     too quickly.
> > >     https://lore.kernel.org/netdev/1447879416.562854.443622857.62708268@webmail.messagingengine.com/
> > > 
> > >   * it's a bit worrisome if this feature is viewed as a way for
> > >     resouce management.
> > >     https://lore.kernel.org/netdev/CALx6S35Q_pc-vBNpA0hipn5K-vSmPDsV-V=D13sODESNO7Htdw@mail.gmail.com/
> > > 
> > > Considering these, it seems fair to only notify an error
> > > via epoll or blocked syscall and let the application to
> > > close() sockets.
> > > 
> > > In a thread, the author mentioned this :
> > > 
> > > ---8<---
> > > Its purpose is to interrupt
> > > blocked userspace socket calls, not to release resources.
> > > ---8<---
> > > https://lore.kernel.org/netdev/CAKD1Yr0Gpr6Ex2i1TXEWyETn48P4g5vBvFZ9=_g+Lz_7-PqBDQ@mail.gmail.com/
> > > 
> > > Also, there was an assumption that a sane application will
> > > close() sockets after receiving errors.
> > > https://lore.kernel.org/netdev/1447972381.22599.278.camel@edumazet-glaptop2.roam.corp.google.com/
> > > 
> > > And yes, this does not apply to UDP, and sadly UDP patch's
> > > thread has no such context nor discussion.
> > > https://lore.kernel.org/netdev/?q=udp_abort&o=-1
> > > 
> > 
> > Interesting, thanks for taking the time to dig into this.
> > 
> > > But anyway, now I think we can solve your problem more
> > > easily, see below.
> > > 
> > > 
> > > > 
> > > > > > out these fields in udp_abort(). Instead, just unhash the socket, mark
> > > > > > its state as TCP_CLOSE, and preserve the state of the other fields.
> > > > > > 
> > > > > > Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> > > > > > Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> > > > > > Signed-off-by: Jordan Rife <jrife@google.com>
> > > > > > ---
> > > > > >  net/ipv4/udp.c | 4 +++-
> > > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > > index b96e47f1c8a2..01a0a0fbcd56 100644
> > > > > > --- a/net/ipv4/udp.c
> > > > > > +++ b/net/ipv4/udp.c
> > > > > > @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
> > > > > >  
> > > > > >  	sk->sk_err = err;
> > > > > >  	sk_error_report(sk);
> > > > > > -	__udp_disconnect(sk, 0);
> > > > > 
> > > > > How about passing another flag and avoid unwanted init only for
> > > > > bpf iterator ?  This will need another layer of __udp_disconnect()
> > > > > like ____udp_disconnect() (I'm not good at naming), or must make
> > > > > sure the flag is never used in uAPI.
> > > > 
> > > > To make sure we're on the same page about what you're describing, I'm
> > > > imagining:
> > > > 
> > > > static int ___udp_disconnect(struct sock *sk, int flags, bool reinit)
> > > > {                                                 
> > > > 	same as before, but if reinit == false then avoid reinitializing
> > > > 	fields like daddr/dport while holding onto reserved ports
> > > > }
> > > > 
> > > > int __udp_disconnect(struct sock *sk, int flags)
> > > > {
> > > > 	___udp_disconnect(sk, flags, true);
> > > > }
> > > > 
> > > > ...
> > > > 
> > > > int udp_abort(struct sock *sk, int err)
> > > > {
> > > > 	...
> > > > 	___udp_disconnect(sk, flags, false); <--- set reinit arg to false
> > > > 	...
> > > > }
> > > > 
> > > > This would have the same behavior with regards to reserved ports
> > > > following udp_abort() while preserving fields like daddr. I was
> > > > considering something like this initially, but it wasn't clear to me why
> > > > you'd want to hold onto reserved ports if the socket was destroyed as I
> > > > mentioned above. If it's important to maintain that aspect of the
> > > > behavior, then I'm OK with this approach.
> > > 
> > > This was what's in my mind, but probably we could just avoid
> > > the initialisation:
> > > 
> > > ---8<---
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index ce49ed2bd36f..d44be0020446 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -2213,21 +2213,21 @@ int __udp_disconnect(struct sock *sk, int flags)
> > >  	 */
> > >  
> > >  	sk->sk_state = TCP_CLOSE;
> > > -	inet->inet_daddr = 0;
> > > -	inet->inet_dport = 0;
> > >  	sock_rps_reset_rxhash(sk);
> > >  	sk->sk_bound_dev_if = 0;
> > > -	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
> > > -		inet_reset_saddr(sk);
> > > -		if (sk->sk_prot->rehash &&
> > > -		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK))
> > > -			sk->sk_prot->rehash(sk);
> > > -	}
> > >  
> > >  	if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
> > >  		sk->sk_prot->unhash(sk);
> > > -		inet->inet_sport = 0;
> > > +	} else if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
> > > +		if (sk->sk_prot->rehash &&
> > > +		    (sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
> > > +			inet_reset_saddr(sk);
> > > +			inet->inet_daddr = 0;
> > > +			inet->inet_dport = 0;
> > > +			sk->sk_prot->rehash(sk);
> > > +		}
> > >  	}
> > > +
> > >  	sk_dst_reset(sk);
> > >  	return 0;
> > >  }
> > > ---8<---
> > > 
> > > When we need to rehash(), of course we have to clear
> > > saddr/daddr/dport, but otherwise, we don't, even sport.
> > 
> > What about the case where you bind to a port and use connect?
> > 
> > 1. socket bind() to 0.0.0.0:10000
> > 2. socket connect() to 1.2.3.4:20000
> > 3. diag_destroy -> udp_abort()
> > 4. socket close()
> > 5. sock_release hook runs and daddr/dport have been cleared
> > 
> > compared to just connect:
> > 
> > 1. socket connect() to 1.2.3.4:20000
> > 2. diag_destroy -> udp_abort()
> > 3. socket close()
> > 4. sock_release hook runs and daddr/dport remain intact
> > 
> > It seems confusing that these two sequences would behave differently.
> 
> I assumed from "a connected UDP socket" that your use case
> is a client without explicit bind(), but if it's not, yes,
> this will not work.
> 
> 
> >  
> > > When ->unhash()ed, only thing we need to do is to clear
> > > inet_sk(sk)->inet_num.
> > > 
> > > Then, even if bind() and connect() are called later, it
> > > works because each syscall does not care about the old
> > > saddr/sport/daddr/dport and just assign a new value if
> > > inet->inet_num == 0.
> > > 
> > > This seems a bit fragile, but TCP already relies on it and
> > > it's unlikely that we change the behaviour in __inet_bind(),
> > > inet_autobind(), and inet_dgram_connect().
> > > 
> > > And all the other users of __udp_disconnect() are also
> > > AF_INET or AF_INET6, l2tp, ping, raw.
> > 
> > Taking another look at tcp_abort() to compare how it behaves, I saw the
> > following:
> > 
> > tcp_abort(sk, err)
> >   maybe send active rst
> >   tcp_done_with_error(sk, err)
> >     tcp_done(sk)
> >       tcp_set_state(sk, TCP_CLOSE)
> >         sk->sk_prot->unhash(sk) if state == TCP_CLOSE
> > 	  inet_unhash(sk)
> > 	    removes sk from lhash/ehash
> > 
> > I don't see a scenario where diag_destroy (tcp_abort) doesn't lead to
> > the socket getting immediately unhashed, even if bound to a specific
> > port (SOCK_BINDPORT_LOCK).
> 
> TCP has 3 hash tables, lhash for TCP_LISTEN, ehash for other
> states except for TCP_CLOSE, and bhash is for all sockets bound
> to a port regardless of explicit/implicit bind().
> 
> If the socket is explicitly bind()ed and has SOCK_BINDPORT_LOCK,
> inet_put_port() is not called, which does not unhash it from bhash,
> meaning the socket still reserves a port.

Ah, I see now. Thanks for the explanation and sorry for the
misunderstanding.

> 
> 
> > This seems consistent with my observations
> > in [1] where I used diag_destroy to immediately remove TCP sockets from
> > the ehash while testing socket iterators.
> 
> BPF iter only iterates over ehash and lhash that have sockets
> with real work.  This behaviour derives from /proc/net/tcp, which
> cannot catch TCP_CLOSE sockets.
> 
> Recently INET_DIAG dump (not destroy) started to support
> bhash though, see 91051f0039484.
> 
> 
> > In contrast, I remember in [2]
> > encountering this rehash behavior with UDP sockets leading to seeing the
> > same socket twice during iteration if you bpf_sock_destroy()ed the
> > socket. This wasn't inherently a problem at the time, but it was a bit
> > surprising.
> 
> Right, this happens because UDP combines bhash/ehash/lhash
> into a single hash table (technically it has 1-tuple/2-tuple/4-tuple
> hash tables but they are not separated in terms of resource
> management).
> 
> 
> > 
> > > This patch makes another inconsistency with TCP.
> > 
> > All this is to say, I'm still not sure why preserving how rehash works
> > in the case of udp_abort() is necessary, and doing so can result in some
> > unexpected behavior both in the case of socket iteration and
> > sock_release as in the example above. To keep things consistent with
> > TCP, doesn't it make more sense to just unhash instead of sometimes
> > keeping the socket in the UDP hash under certain conditions? Am I just
> > not seeing the place where a TCP socket would hold onto a port after you
> > destroy it?
> 
> As mentioned, that's bhash, and you can see TCP does
> not do "just unhash" behaviour even with tcp_abort().

Yeah, I see now what you meant by "this patch makes another
inconsistency with TCP". With that in mind, your previous suggestion
indeed seems like the best option. It would at least avoid clearing
daddr/dport with connect()ed UDP, and I don't see a way around it where
SOCK_BINDPORT_LOCK is concerned.
 
> ---8<---
> # python3
> >>> from socket import *
> >>> 
> >>> s1 = socket()
> >>> s1.bind(('', 80))
> >>> s1.listen()
> >>> 
> >>> import subprocess
> >>> subprocess.run("ss -K -t state listening sport :80".split(' '))
> Recv-Q                Send-Q                                Local Address:Port                                 Peer Address:Port
> 0                     128                                         0.0.0.0:http                                      0.0.0.0:*
> CompletedProcess(args=['ss', '-K', '-t', 'state', 'listening', 'sport', ':80'], returncode=0)
> >>> subprocess.run("ss -tB".split(' '))
> Recv-Q                Send-Q                                Local Address:Port                                 Peer Address:Port
> 0                     0                                           0.0.0.0:http                                      0.0.0.0:*
> CompletedProcess(args=['ss', '-tB'], returncode=0)
> >>> 
> >>> s2 = socket()
> >>> s2.bind(s1.getsockname())
> Traceback (most recent call last):
>   File "<python-input-12>", line 1, in <module>
>     s2.bind(s1.getsockname())
>     ~~~~~~~^^^^^^^^^^^^^^^^^^
> OSError: [Errno 98] Address already in use
> ---8<---
> 
> 
> 
> > 
> > [1]: https://lore.kernel.org/netdev/20250714180919.127192-13-jordan@jrife.io/
> > [2]: https://lore.kernel.org/bpf/CADKFtnQyiz_r_vfyYfTvzi3MvNpRt62mDrNyEvp9tm82UcSFjQ@mail.gmail.com/
> > 

Jordan

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-05 23:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-03 17:01 [PATCH net-next v2 0/2] Preserve UDP socket addresses on abort Jordan Rife
2026-03-03 17:01 ` [PATCH net-next v2 1/2] udp: " Jordan Rife
2026-03-04 21:21   ` Kuniyuki Iwashima
2026-03-04 23:38     ` Willem de Bruijn
2026-03-04 23:52     ` Jordan Rife
2026-03-05  3:25       ` Kuniyuki Iwashima
2026-03-05 18:03         ` Jordan Rife
2026-03-05 19:31           ` Kuniyuki Iwashima
2026-03-05 23:18             ` Jordan Rife
2026-03-03 17:01 ` [PATCH net-next v2 2/2] selftests/bpf: Ensure dst addr/port are preserved after socket abort Jordan Rife

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox