From: Martin KaFai Lau <martin.lau@linux.dev>
To: Michal Luczaj <mhal@rbox.co>, Jiayuan Chen <mrpre@163.com>
Cc: John Fastabend <john.fastabend@gmail.com>,
Jakub Sitnicki <jakub@cloudflare.com>,
Eric Dumazet <edumazet@google.com>,
Kuniyuki Iwashima <kuniyu@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Willem de Bruijn <willemb@google.com>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Simon Horman <horms@kernel.org>,
Yonghong Song <yhs@fb.com>, Andrii Nakryiko <andrii@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Shuah Khan <shuah@kernel.org>,
Cong Wang <cong.wang@bytedance.com>,
netdev@vger.kernel.org, bpf@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
Date: Tue, 10 Mar 2026 15:20:34 -0700 [thread overview]
Message-ID: <b21fac4e-891b-481c-9734-e95290e49097@linux.dev> (raw)
In-Reply-To: <b9a956dd-ebb2-47ca-8b49-ccbba6ff9b26@rbox.co>
On 3/6/26 6:09 AM, Michal Luczaj wrote:
> On 3/6/26 06:01, Jiayuan Chen wrote:
>>
>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>> socket is properly set up, which would include having a defined peer. IOW,
>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>> socket which still has unix_peer(sk) == NULL.
>>>
>>> T0 bpf T1 connect
>>> ------ ----------
>>>
>>> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>> sock_map_sk_state_allowed(sk)
>>> ...
>>> sk_pair = unix_peer(sk)
>>> sock_hold(sk_pair)
>>> sock_hold(newsk)
>>> smp_mb__after_atomic()
>>> unix_peer(sk) = newsk
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>> Call Trace:
>>> sock_map_link+0x564/0x8b0
>>> sock_map_update_common+0x6e/0x340
>>> sock_map_update_elem_sys+0x17d/0x240
>>> __sys_bpf+0x26db/0x3250
>>> __x64_sys_bpf+0x21/0x30
>>> do_syscall_64+0x6b/0x3a0
>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>> but that involved an additional memory barrier, and changing the hot path
>>> was rejected. Then a check during proto update was considered[2], but a
>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>> lock. Or, more specifically, an insufficient lock[4].
>>>
>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>
>>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>>
>>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>> ---
>>> net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>> 1 file changed, 29 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>> --- a/net/core/sock_map.c
>>> +++ b/net/core/sock_map.c
>>> @@ -12,6 +12,7 @@
>>> #include <linux/list.h>
>>> #include <linux/jhash.h>
>>> #include <linux/sock_diag.h>
>>> +#include <net/af_unix.h>
>>> #include <net/udp.h>
>>>
>>> struct bpf_stab {
>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>> }
>>>
>>> static void sock_map_sk_acquire(struct sock *sk)
>>> - __acquires(&sk->sk_lock.slock)
>>> {
>>> lock_sock(sk);
>>> +
>>> + if (sk_is_unix(sk))
>>> + unix_state_lock(sk);
>>> +
>>> rcu_read_lock();
>>> }
>>>
>>
>> This introduces a new ordering constraint: lock_sock() before
>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>> lock_sock() inside unix_state_lock() in the future would create an
>> ABBA deadlock, which is exactly why the order was settled this way. However,
>> the thread did not reach a conclusion on whether that constraint should be
>> documented in the code.
>>
>> Since there is nothing enforcing this ordering mechanically, a brief comment
>> at sock_map_sk_acquire() would help future readers avoid introducing the
>> inverse ordering.
>
> Sure, will do.
>
>>> static void sock_map_sk_release(struct sock *sk)
>>> - __releases(&sk->sk_lock.slock)
>>> {
>>> rcu_read_unlock();
>>> +
>>> + if (sk_is_unix(sk))
>>> + unix_state_unlock(sk);
>>> +
>>> release_sock(sk);
>>> }
>>>
>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>> +{
>>> + local_bh_disable();
>>> + bh_lock_sock(sk);
>>> +
>>> + if (sk_is_unix(sk))
>>> + unix_state_lock(sk);
>>> +}
>>> +
>>
>>
>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>> sockets took only unix_state_lock() in the fast path, skipping
>> bh_lock_sock() entirely:
>>
>> /* v2 */
>> if (sk_is_unix(sk))
>> unix_state_lock(sk);
>> else
>> bh_lock_sock(sk);
>>
>> v3 takes both for AF_UNIX sockets.
>>
>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>> actually needs protection here — sk_state and unix_peer() — lives under
>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>> anything useful for AF_UNIX sockets in this path?
>
> Yeah, good point. I think I was just trying to keep the _fast variant
> aligned with the sleepy one. Which I agree might be unnecessary.
I hope the common use case should not be calling
bpf_map_update_elem(sockmap) only.
Beside, from looking at the may_update_sockmap(), I don't know if it is
even doable (or useful) to bpf_map_update_elem(unix_sk) in
tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
when the bpf_map_update_elem(sockmap) support was added iirc.
The only path left is bpf_iter, which I believe was the primary use case
when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
where lock_sock() has already been done. It is more for
reading-correctness though. This just came to my mind.
has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
been using it to conditionally take lock_sock() or not. [ I would still
keep patch 3 though. ]
[1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>
> In a parallel thread I've asked Kuniyuki if it might be better to
> completely drop patch 2/5, which would change how we interact with
> sock_map_close(). Lets see how it goes.
>
If patch 2 is dropped, lock_sock() is always needed for unix_sk?
next prev parent reply other threads:[~2026-03-10 22:20 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-05 23:30 [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
2026-03-06 5:30 ` Kuniyuki Iwashima
2026-03-06 6:24 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock^sk_state data-races Jiayuan Chen
2026-03-18 17:05 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
2026-03-05 23:30 ` [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded Michal Luczaj
2026-03-06 5:44 ` Kuniyuki Iwashima
2026-03-06 14:05 ` Michal Luczaj
2026-03-11 4:17 ` Kuniyuki Iwashima
2026-03-11 4:57 ` Kuniyuki Iwashima
2026-03-15 23:58 ` Michal Luczaj
2026-03-05 23:30 ` [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock Michal Luczaj
2026-03-06 5:47 ` Kuniyuki Iwashima
2026-03-06 6:04 ` Jiayuan Chen
2026-03-06 6:15 ` Jiayuan Chen
2026-03-06 14:06 ` Michal Luczaj
2026-03-06 14:31 ` Jiayuan Chen
2026-03-06 14:33 ` Jiayuan Chen
2026-03-05 23:30 ` [PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking Michal Luczaj
2026-03-06 14:34 ` Jiayuan Chen
2026-03-05 23:30 ` [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock Michal Luczaj
2026-03-06 5:01 ` Jiayuan Chen
2026-03-06 14:09 ` Michal Luczaj
2026-03-10 22:20 ` Martin KaFai Lau [this message]
2026-03-15 23:58 ` Michal Luczaj
2026-03-26 6:26 ` Martin KaFai Lau
2026-03-30 23:03 ` Michal Luczaj
2026-03-30 23:27 ` Kuniyuki Iwashima
2026-03-31 22:43 ` Michal Luczaj
2026-03-31 23:18 ` Kuniyuki Iwashima
2026-04-01 19:18 ` Michal Luczaj
2026-03-31 0:20 ` Martin KaFai Lau
2026-03-31 22:43 ` Michal Luczaj
2026-04-02 1:34 ` Martin KaFai Lau
2026-04-14 14:19 ` Michal Luczaj
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b21fac4e-891b-481c-9734-e95290e49097@linux.dev \
--to=martin.lau@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cong.wang@bytedance.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=eddyz87@gmail.com \
--cc=edumazet@google.com \
--cc=haoluo@google.com \
--cc=horms@kernel.org \
--cc=jakub@cloudflare.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mhal@rbox.co \
--cc=mrpre@163.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=willemb@google.com \
--cc=yhs@fb.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.