From: Kuniyuki Iwashima <kuniyu@google.com>
To: jrife@google.com
Cc: andrii@kernel.org, bpf@vger.kernel.org, daniel@iogearbox.net,
edumazet@google.com, kuba@kernel.org, kuniyu@google.com,
martin.lau@linux.dev, netdev@vger.kernel.org, sdf@fomichev.me,
willemdebruijn.kernel@gmail.com, yusuke.suzuki@isovalent.com
Subject: Re: [PATCH net-next v2 1/2] udp: Preserve UDP socket addresses on abort
Date: Thu, 5 Mar 2026 03:25:21 +0000 [thread overview]
Message-ID: <20260305032524.2176479-1-kuniyu@google.com> (raw)
In-Reply-To: <skxd6uimyxzi4x5nfmz24ez3d45qiubdnwkqawtbomjfzy5qrs@yb724ykvvkec>
From: Jordan Rife <jrife@google.com>
Date: Wed, 4 Mar 2026 23:52:22 +0000
> On Wed, Mar 04, 2026 at 09:21:03PM +0000, Kuniyuki Iwashima wrote:
> > From: Jordan Rife <jrife@google.com>
> > Date: Tue, 3 Mar 2026 17:01:02 +0000
> > > Currently, udp_abort() invokes __udp_disconnect() which clears out
> > > socket fields like inet_daddr and inet_dport. This makes fields like
> > > dst_ip4, dst_ip6, and dst_port useless inside cgroup/sock_release BPF
> > > hooks following an abort on a connected UDP socket, since they'll always
> > > equal zero. This differs from the behavior for TCP sockets where a
> > > cgroup/sock_release hook will see the address and port that the socket
> > > was connected to at the time it was aborted. This causes issues in
> > > Cilium where a sock_release hook is used to perform map maintenance when
> > > sockets are released, since Cilium actively destroys connected UDP
> > > sockets in some cases.
> >
> > Just curious why Cilium does not unregister the socket just
> > after aborting it ?
> >
> > At least Cilium should know the key at that moment.
>
> Originally, socket aborts in Cilium were handled with sock_diag (and
> still are on older kernels), so you'd need to do some map maintenance in
> userspace directly before or after aborting the socket. I suppose it was
> just cleaner to put common cleanup code in a sock_release hook.
>
> > >
> > > Make the behavior consistent between TCP and UDP sockets by not clearing
> >
> > This patch makes another inconsistency with TCP.
> >
> > As you mentioned in the cover letter, before the patch, even if
> > aborted, a bound socket still reserves a port, but it does not now.
> >
> > This might be surprising for non-reuseaddr/port sockets.
>
> Yeah, true. I'm curious if this behavior is something that is useful at
> all or simply an accident of the current implementation? If I'm
> aborting/destroying a socket, why should it hold onto bound ports until
> the socket is closed? I wasn't quite sure about this.
Digging into the first (long) thread of SOCK_DESTROY (TCP),
the motivation was to kill sockets bound to an IP address
that Android/laptop removes when it switches networks.
https://lore.kernel.org/netdev/1447811024-8553-1-git-send-email-lorenzo@google.com/
In that sense, holding a port seems unnecessary.
But there are a few concerns like
* someone could use the feature for server and reuse port
too quickly.
https://lore.kernel.org/netdev/1447879416.562854.443622857.62708268@webmail.messagingengine.com/
* it's a bit worrisome if this feature is viewed as a way for
resouce management.
https://lore.kernel.org/netdev/CALx6S35Q_pc-vBNpA0hipn5K-vSmPDsV-V=D13sODESNO7Htdw@mail.gmail.com/
Considering these, it seems fair to only notify an error
via epoll or blocked syscall and let the application to
close() sockets.
In a thread, the author mentioned this :
---8<---
Its purpose is to interrupt
blocked userspace socket calls, not to release resources.
---8<---
https://lore.kernel.org/netdev/CAKD1Yr0Gpr6Ex2i1TXEWyETn48P4g5vBvFZ9=_g+Lz_7-PqBDQ@mail.gmail.com/
Also, there was an assumption that a sane application will
close() sockets after receiving errors.
https://lore.kernel.org/netdev/1447972381.22599.278.camel@edumazet-glaptop2.roam.corp.google.com/
And yes, this does not apply to UDP, and sadly UDP patch's
thread has no such context nor discussion.
https://lore.kernel.org/netdev/?q=udp_abort&o=-1
But anyway, now I think we can solve your problem more
easily, see below.
>
> > > out these fields in udp_abort(). Instead, just unhash the socket, mark
> > > its state as TCP_CLOSE, and preserve the state of the other fields.
> > >
> > > Fixes: f5836749c9c0 ("bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook")
> > > Reported-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
> > > Signed-off-by: Jordan Rife <jrife@google.com>
> > > ---
> > > net/ipv4/udp.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index b96e47f1c8a2..01a0a0fbcd56 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -3247,7 +3247,9 @@ int udp_abort(struct sock *sk, int err)
> > >
> > > sk->sk_err = err;
> > > sk_error_report(sk);
> > > - __udp_disconnect(sk, 0);
> >
> > How about passing another flag and avoid unwanted init only for
> > bpf iterator ? This will need another layer of __udp_disconnect()
> > like ____udp_disconnect() (I'm not good at naming), or must make
> > sure the flag is never used in uAPI.
>
> To make sure we're on the same page about what you're describing, I'm
> imagining:
>
> static int ___udp_disconnect(struct sock *sk, int flags, bool reinit)
> {
> same as before, but if reinit == false then avoid reinitializing
> fields like daddr/dport while holding onto reserved ports
> }
>
> int __udp_disconnect(struct sock *sk, int flags)
> {
> ___udp_disconnect(sk, flags, true);
> }
>
> ...
>
> int udp_abort(struct sock *sk, int err)
> {
> ...
> ___udp_disconnect(sk, flags, false); <--- set reinit arg to false
> ...
> }
>
> This would have the same behavior with regards to reserved ports
> following udp_abort() while preserving fields like daddr. I was
> considering something like this initially, but it wasn't clear to me why
> you'd want to hold onto reserved ports if the socket was destroyed as I
> mentioned above. If it's important to maintain that aspect of the
> behavior, then I'm OK with this approach.
This was what's in my mind, but probably we could just avoid
the initialisation:
---8<---
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ce49ed2bd36f..d44be0020446 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2213,21 +2213,21 @@ int __udp_disconnect(struct sock *sk, int flags)
*/
sk->sk_state = TCP_CLOSE;
- inet->inet_daddr = 0;
- inet->inet_dport = 0;
sock_rps_reset_rxhash(sk);
sk->sk_bound_dev_if = 0;
- if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
- inet_reset_saddr(sk);
- if (sk->sk_prot->rehash &&
- (sk->sk_userlocks & SOCK_BINDPORT_LOCK))
- sk->sk_prot->rehash(sk);
- }
if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
sk->sk_prot->unhash(sk);
- inet->inet_sport = 0;
+ } else if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) {
+ if (sk->sk_prot->rehash &&
+ (sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
+ inet_reset_saddr(sk);
+ inet->inet_daddr = 0;
+ inet->inet_dport = 0;
+ sk->sk_prot->rehash(sk);
+ }
}
+
sk_dst_reset(sk);
return 0;
}
---8<---
When we need to rehash(), of course we have to clear
saddr/daddr/dport, but otherwise, we don't, even sport.
When ->unhash()ed, only thing we need to do is to clear
inet_sk(sk)->inet_num.
Then, even if bind() and connect() are called later, it
works because each syscall does not care about the old
saddr/sport/daddr/dport and just assign a new value if
inet->inet_num == 0.
This seems a bit fragile, but TCP already relies on it and
it's unlikely that we change the behaviour in __inet_bind(),
inet_autobind(), and inet_dgram_connect().
And all the other users of __udp_disconnect() are also
AF_INET or AF_INET6, l2tp, ping, raw.
>
> >
> >
> >
> > > + sk->sk_state = TCP_CLOSE;
> > > + sk->sk_prot->unhash(sk);
> > > + sk_dst_reset(sk);
> > >
> > > out:
> > > if (!has_current_bpf_ctx())
> > > --
> > > 2.53.0.473.g4a7958ca14-goog
> > >
>
> Thanks!
>
> Jordan
next prev parent reply other threads:[~2026-03-05 3:25 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-03 17:01 [PATCH net-next v2 0/2] Preserve UDP socket addresses on abort Jordan Rife
2026-03-03 17:01 ` [PATCH net-next v2 1/2] udp: " Jordan Rife
2026-03-04 21:21 ` Kuniyuki Iwashima
2026-03-04 23:38 ` Willem de Bruijn
2026-03-04 23:52 ` Jordan Rife
2026-03-05 3:25 ` Kuniyuki Iwashima [this message]
2026-03-05 18:03 ` Jordan Rife
2026-03-05 19:31 ` Kuniyuki Iwashima
2026-03-05 23:18 ` Jordan Rife
2026-03-03 17:01 ` [PATCH net-next v2 2/2] selftests/bpf: Ensure dst addr/port are preserved after socket abort Jordan Rife
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260305032524.2176479-1-kuniyu@google.com \
--to=kuniyu@google.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=edumazet@google.com \
--cc=jrife@google.com \
--cc=kuba@kernel.org \
--cc=martin.lau@linux.dev \
--cc=netdev@vger.kernel.org \
--cc=sdf@fomichev.me \
--cc=willemdebruijn.kernel@gmail.com \
--cc=yusuke.suzuki@isovalent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox