* [PATCH bpf v2 1/4] bpf, sockmap: Annotate af_unix sock::sk_state data-races
2026-02-07 14:34 [PATCH bpf v2 0/4] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
@ 2026-02-07 14:34 ` Michal Luczaj
2026-02-07 14:34 ` [PATCH bpf v2 2/4] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded Michal Luczaj
` (2 subsequent siblings)
3 siblings, 0 replies; 14+ messages in thread
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
socket sk_state locklessly.
Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
not only af_unix, but all non-TCP sockets (UDP, af_vsock).
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/core/sock_map.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 5947b38e4f8b..d4f15b846ad4 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -530,7 +530,7 @@ static bool sock_map_redirect_allowed(const struct sock *sk)
if (sk_is_tcp(sk))
return sk->sk_state != TCP_LISTEN;
else
- return sk->sk_state == TCP_ESTABLISHED;
+ return READ_ONCE(sk->sk_state) == TCP_ESTABLISHED;
}
static bool sock_map_sk_is_suitable(const struct sock *sk)
@@ -543,7 +543,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
if (sk_is_tcp(sk))
return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
if (sk_is_stream_unix(sk))
- return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+ return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
if (sk_is_vsock(sk) &&
(sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET))
return (1 << sk->sk_state) & TCPF_ESTABLISHED;
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH bpf v2 2/4] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
2026-02-07 14:34 [PATCH bpf v2 0/4] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
2026-02-07 14:34 ` [PATCH bpf v2 1/4] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
@ 2026-02-07 14:34 ` Michal Luczaj
2026-02-07 14:34 ` [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock Michal Luczaj
2026-02-07 14:34 ` [PATCH bpf v2 4/4] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking Michal Luczaj
3 siblings, 0 replies; 14+ messages in thread
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
Instead of repeating the same (un)locking pattern, reuse
sock_map_sk_{acquire,release}(). This centralizes the code and makes it
easier to adapt sockmap to af_unix-specific locking.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/core/sock_map.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d4f15b846ad4..b6586d9590b7 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
sk = xchg(psk, NULL);
if (sk) {
sock_hold(sk);
- lock_sock(sk);
- rcu_read_lock();
+ sock_map_sk_acquire(sk);
sock_map_unref(sk, psk);
- rcu_read_unlock();
- release_sock(sk);
+ sock_map_sk_release(sk);
sock_put(sk);
}
}
@@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
*/
hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
hlist_del(&elem->node);
- lock_sock(elem->sk);
- rcu_read_lock();
+ sock_map_sk_acquire(elem->sk);
sock_map_unref(elem->sk, elem);
- rcu_read_unlock();
- release_sock(elem->sk);
+ sock_map_sk_release(elem->sk);
sock_put(elem->sk);
sock_hash_free_elem(htab, elem);
}
@@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
void (*saved_close)(struct sock *sk, long timeout);
struct sk_psock *psock;
- lock_sock(sk);
- rcu_read_lock();
+ sock_map_sk_acquire(sk);
psock = sk_psock(sk);
if (likely(psock)) {
saved_close = psock->saved_close;
@@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
psock = sk_psock_get(sk);
if (unlikely(!psock))
goto no_psock;
- rcu_read_unlock();
sk_psock_stop(psock);
- release_sock(sk);
+ sock_map_sk_release(sk);
cancel_delayed_work_sync(&psock->work);
sk_psock_put(sk, psock);
} else {
saved_close = READ_ONCE(sk->sk_prot)->close;
no_psock:
- rcu_read_unlock();
- release_sock(sk);
+ sock_map_sk_release(sk);
}
/* Make sure we do not recurse. This is a bug.
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-07 14:34 [PATCH bpf v2 0/4] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
2026-02-07 14:34 ` [PATCH bpf v2 1/4] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
2026-02-07 14:34 ` [PATCH bpf v2 2/4] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded Michal Luczaj
@ 2026-02-07 14:34 ` Michal Luczaj
2026-02-07 22:00 ` Kuniyuki Iwashima
2026-02-07 14:34 ` [PATCH bpf v2 4/4] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking Michal Luczaj
3 siblings, 1 reply; 14+ messages in thread
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.
T0 bpf T1 connect
------ ----------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
sock_map_link+0x564/0x8b0
sock_map_update_common+0x6e/0x340
sock_map_update_elem_sys+0x17d/0x240
__sys_bpf+0x26db/0x3250
__x64_sys_bpf+0x21/0x30
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected. Then a check during proto update was considered[2], but a
follow-up discussion[3] concluded the root cause is sockmap taking a wrong
lock.
Thus, teach sockmap about the af_unix-specific locking: instead of the
usual lock_sock() involving sock::sk_lock, af_unix protects critical
sections under unix_state_lock() operating on unix_sock::lock.
[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
This patch also happens to fix a deadlock that may occur when
bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
iter prog attempts to update a sockmap. Which ends up spinning at
sock_map_update_elem()'s bh_lock_sock():
WARNING: possible recursive locking detected
--------------------------------------------
test_progs/1393 is trying to acquire lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0
but task is already holding lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(slock-AF_UNIX);
lock(slock-AF_UNIX);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by test_progs/1393:
#0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
#1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
#2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
#3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00
Call Trace:
dump_stack_lvl+0x5d/0x80
print_deadlock_bug.cold+0xc0/0xce
__lock_acquire+0x130f/0x2590
lock_acquire+0x14e/0x2b0
_raw_spin_lock+0x30/0x40
sock_map_update_elem+0xdb/0x1f0
bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
bpf_iter_run_prog+0x5b9/0xb00
bpf_iter_unix_seq_show+0x1f7/0x2e0
bpf_seq_read+0x42c/0x10d0
vfs_read+0x171/0xb20
ksys_read+0xff/0x200
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Keeping sparse annotations in sock_map_sk_{acquire,release}() required some
hackery I'm not proud of. Is there a better way?
---
net/core/sock_map.c | 47 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 39 insertions(+), 8 deletions(-)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index b6586d9590b7..0c638b1f363a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -12,6 +12,7 @@
#include <linux/list.h>
#include <linux/jhash.h>
#include <linux/sock_diag.h>
+#include <net/af_unix.h>
#include <net/udp.h>
struct bpf_stab {
@@ -115,17 +116,49 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
}
static void sock_map_sk_acquire(struct sock *sk)
- __acquires(&sk->sk_lock.slock)
+ __acquires(sock_or_unix_lock)
{
- lock_sock(sk);
+ if (sk_is_unix(sk)) {
+ unix_state_lock(sk);
+ __release(sk); /* Silence sparse. */
+ } else {
+ lock_sock(sk);
+ }
+
rcu_read_lock();
}
static void sock_map_sk_release(struct sock *sk)
- __releases(&sk->sk_lock.slock)
+ __releases(sock_or_unix_lock)
{
rcu_read_unlock();
- release_sock(sk);
+
+ if (sk_is_unix(sk)) {
+ unix_state_unlock(sk);
+ __acquire(sk); /* Silence sparse. */
+ } else {
+ release_sock(sk);
+ }
+}
+
+static inline void sock_map_sk_acquire_fast(struct sock *sk)
+{
+ local_bh_disable();
+
+ if (sk_is_unix(sk))
+ unix_state_lock(sk);
+ else
+ bh_lock_sock(sk);
+}
+
+static inline void sock_map_sk_release_fast(struct sock *sk)
+{
+ if (sk_is_unix(sk))
+ unix_state_unlock(sk);
+ else
+ bh_unlock_sock(sk);
+
+ local_bh_enable();
}
static void sock_map_add_link(struct sk_psock *psock,
@@ -604,16 +637,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
if (!sock_map_sk_is_suitable(sk))
return -EOPNOTSUPP;
- local_bh_disable();
- bh_lock_sock(sk);
+ sock_map_sk_acquire_fast(sk);
if (!sock_map_sk_state_allowed(sk))
ret = -EOPNOTSUPP;
else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
else
ret = sock_hash_update_common(map, key, sk, flags);
- bh_unlock_sock(sk);
- local_bh_enable();
+ sock_map_sk_release_fast(sk);
return ret;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-07 14:34 ` [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock Michal Luczaj
@ 2026-02-07 22:00 ` Kuniyuki Iwashima
2026-02-08 17:14 ` Michal Luczaj
0 siblings, 1 reply; 14+ messages in thread
From: Kuniyuki Iwashima @ 2026-02-07 22:00 UTC (permalink / raw)
To: Michal Luczaj
Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Daniel Borkmann,
Willem de Bruijn, Cong Wang, Alexei Starovoitov, Yonghong Song,
Andrii Nakryiko, Eduard Zingerman, Martin KaFai Lau, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, netdev, bpf, linux-kernel, linux-kselftest
On Sat, Feb 7, 2026 at 6:35 AM Michal Luczaj <mhal@rbox.co> wrote:
>
> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
> socket is properly set up, which would include having a defined peer. IOW,
> there's a window when unix_stream_bpf_update_proto() can be called on
> socket which still has unix_peer(sk) == NULL.
>
> T0 bpf T1 connect
> ------ ----------
>
> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
> sock_hold(newsk)
> smp_mb__after_atomic()
> unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
> sock_map_link+0x564/0x8b0
> sock_map_update_common+0x6e/0x340
> sock_map_update_elem_sys+0x17d/0x240
> __sys_bpf+0x26db/0x3250
> __x64_sys_bpf+0x21/0x30
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Initial idea was to move peer assignment _before_ the sk_state update[1],
> but that involved an additional memory barrier, and changing the hot path
> was rejected. Then a check during proto update was considered[2], but a
> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
> lock.
>
> Thus, teach sockmap about the af_unix-specific locking: instead of the
> usual lock_sock() involving sock::sk_lock, af_unix protects critical
> sections under unix_state_lock() operating on unix_sock::lock.
>
> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>
> This patch also happens to fix a deadlock that may occur when
> bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
> iter prog attempts to update a sockmap. Which ends up spinning at
> sock_map_update_elem()'s bh_lock_sock():
Hmm.. this seems to be a more general problem for
bpf iter vs sockmap. bpf_iter_{tcp,udp}_seq_show() also
hold lock_sock(), where this patch's solution does not help.
We need to resolve this regardless of socket family.
Also, I feel lock_sock() should be outside of unix_state_lock()
since the former is usually sleepable. If we used such locking
order in the future, it would trigger ABBA deadlock and we would
have to revisit this problem.
>
> WARNING: possible recursive locking detected
> --------------------------------------------
> test_progs/1393 is trying to acquire lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0
>
> but task is already holding lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(slock-AF_UNIX);
> lock(slock-AF_UNIX);
>
> *** DEADLOCK ***
>
> May be due to missing lock nesting notation
>
> 4 locks held by test_progs/1393:
> #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
> #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
> #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
> #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00
>
> Call Trace:
> dump_stack_lvl+0x5d/0x80
> print_deadlock_bug.cold+0xc0/0xce
> __lock_acquire+0x130f/0x2590
> lock_acquire+0x14e/0x2b0
> _raw_spin_lock+0x30/0x40
> sock_map_update_elem+0xdb/0x1f0
> bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
> bpf_iter_run_prog+0x5b9/0xb00
> bpf_iter_unix_seq_show+0x1f7/0x2e0
> bpf_seq_read+0x42c/0x10d0
> vfs_read+0x171/0xb20
> ksys_read+0xff/0x200
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
> Keeping sparse annotations in sock_map_sk_{acquire,release}() required some
> hackery I'm not proud of. Is there a better way?
> ---
> net/core/sock_map.c | 47 +++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index b6586d9590b7..0c638b1f363a 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -12,6 +12,7 @@
> #include <linux/list.h>
> #include <linux/jhash.h>
> #include <linux/sock_diag.h>
> +#include <net/af_unix.h>
> #include <net/udp.h>
>
> struct bpf_stab {
> @@ -115,17 +116,49 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
> }
>
> static void sock_map_sk_acquire(struct sock *sk)
> - __acquires(&sk->sk_lock.slock)
> + __acquires(sock_or_unix_lock)
> {
> - lock_sock(sk);
> + if (sk_is_unix(sk)) {
> + unix_state_lock(sk);
> + __release(sk); /* Silence sparse. */
> + } else {
> + lock_sock(sk);
> + }
> +
> rcu_read_lock();
> }
>
> static void sock_map_sk_release(struct sock *sk)
> - __releases(&sk->sk_lock.slock)
> + __releases(sock_or_unix_lock)
> {
> rcu_read_unlock();
> - release_sock(sk);
> +
> + if (sk_is_unix(sk)) {
> + unix_state_unlock(sk);
> + __acquire(sk); /* Silence sparse. */
> + } else {
> + release_sock(sk);
> + }
> +}
> +
> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
> +{
> + local_bh_disable();
> +
> + if (sk_is_unix(sk))
> + unix_state_lock(sk);
> + else
> + bh_lock_sock(sk);
> +}
> +
> +static inline void sock_map_sk_release_fast(struct sock *sk)
> +{
> + if (sk_is_unix(sk))
> + unix_state_unlock(sk);
> + else
> + bh_unlock_sock(sk);
> +
> + local_bh_enable();
> }
>
> static void sock_map_add_link(struct sk_psock *psock,
> @@ -604,16 +637,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
> if (!sock_map_sk_is_suitable(sk))
> return -EOPNOTSUPP;
>
> - local_bh_disable();
> - bh_lock_sock(sk);
> + sock_map_sk_acquire_fast(sk);
> if (!sock_map_sk_state_allowed(sk))
> ret = -EOPNOTSUPP;
> else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
> ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
> else
> ret = sock_hash_update_common(map, key, sk, flags);
> - bh_unlock_sock(sk);
> - local_bh_enable();
> + sock_map_sk_release_fast(sk);
> return ret;
> }
>
>
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-07 22:00 ` Kuniyuki Iwashima
@ 2026-02-08 17:14 ` Michal Luczaj
2026-02-09 20:17 ` Martin KaFai Lau
0 siblings, 1 reply; 14+ messages in thread
From: Michal Luczaj @ 2026-02-08 17:14 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Daniel Borkmann,
Willem de Bruijn, Cong Wang, Alexei Starovoitov, Yonghong Song,
Andrii Nakryiko, Eduard Zingerman, Martin KaFai Lau, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, netdev, bpf, linux-kernel, linux-kselftest
On 2/7/26 23:00, Kuniyuki Iwashima wrote:
> On Sat, Feb 7, 2026 at 6:35 AM Michal Luczaj <mhal@rbox.co> wrote:
>> This patch also happens to fix a deadlock that may occur when
>> bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
>> iter prog attempts to update a sockmap. Which ends up spinning at
>> sock_map_update_elem()'s bh_lock_sock():
>
> Hmm.. this seems to be a more general problem for
> bpf iter vs sockmap. bpf_iter_{tcp,udp}_seq_show() also
> hold lock_sock(), where this patch's solution does not help.
> We need to resolve this regardless of socket family.
I don't see any deadlocks there. Note that I've mentioned lock_sock_fast()
fast path was a problem, not lock_sock().
> Also, I feel lock_sock() should be outside of unix_state_lock()
> since the former is usually sleepable. If we used such locking
> order in the future, it would trigger ABBA deadlock and we would
> have to revisit this problem.
Do I understand correctly you're suggesting taking lock_sock() for af_unix
just as well?
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-08 17:14 ` Michal Luczaj
@ 2026-02-09 20:17 ` Martin KaFai Lau
2026-02-11 10:02 ` Michal Luczaj
0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2026-02-09 20:17 UTC (permalink / raw)
To: Michal Luczaj, Kuniyuki Iwashima
Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Daniel Borkmann,
Willem de Bruijn, Cong Wang, Alexei Starovoitov, Yonghong Song,
Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
netdev, bpf, linux-kernel, linux-kselftest
On 2/8/26 9:14 AM, Michal Luczaj wrote:
> On 2/7/26 23:00, Kuniyuki Iwashima wrote:
>> On Sat, Feb 7, 2026 at 6:35 AM Michal Luczaj <mhal@rbox.co> wrote:
>>> This patch also happens to fix a deadlock that may occur when
>>> bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
>>> iter prog attempts to update a sockmap. Which ends up spinning at
>>> sock_map_update_elem()'s bh_lock_sock():
>>
>> Hmm.. this seems to be a more general problem for
>> bpf iter vs sockmap. bpf_iter_{tcp,udp}_seq_show() also
>> hold lock_sock(), where this patch's solution does not help.
>> We need to resolve this regardless of socket family.
>
> I don't see any deadlocks there. Note that I've mentioned lock_sock_fast()
> fast path was a problem, not lock_sock().
For the tcp/udp, I think the bpf_iter should be fine: lock_sock() in
seq_show and bh_lock_sock() in map_update. It seems redundant though.
From looking at may_update_sockmap(), other bpf progs (e.g., tc) can do
map_update also. On those paths, I am not sure why
sock_map_update_elem() does not need to check "!sock_owned_by_user(sk)".
If it is indeed an issue, it probably needs to be addressed separately.
It should also be helpful to be consistent with tcp/udp iter and use
lock_sock() instead of lock_sock_fast() in bpf_iter_unix_seq_show().
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-09 20:17 ` Martin KaFai Lau
@ 2026-02-11 10:02 ` Michal Luczaj
2026-02-11 13:24 ` Michal Luczaj
2026-02-23 21:43 ` Martin KaFai Lau
0 siblings, 2 replies; 14+ messages in thread
From: Michal Luczaj @ 2026-02-11 10:02 UTC (permalink / raw)
To: Martin KaFai Lau, Kuniyuki Iwashima
Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Daniel Borkmann,
Willem de Bruijn, Cong Wang, Alexei Starovoitov, Yonghong Song,
Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
netdev, bpf, linux-kernel, linux-kselftest
On 2/9/26 21:17, Martin KaFai Lau wrote:
> On 2/8/26 9:14 AM, Michal Luczaj wrote:
>> On 2/7/26 23:00, Kuniyuki Iwashima wrote:
>>> On Sat, Feb 7, 2026 at 6:35 AM Michal Luczaj <mhal@rbox.co> wrote:
>>>> This patch also happens to fix a deadlock that may occur when
>>>> bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
>>>> iter prog attempts to update a sockmap. Which ends up spinning at
>>>> sock_map_update_elem()'s bh_lock_sock():
>>>
>>> Hmm.. this seems to be a more general problem for
>>> bpf iter vs sockmap. bpf_iter_{tcp,udp}_seq_show() also
>>> hold lock_sock(), where this patch's solution does not help.
>>> We need to resolve this regardless of socket family.
>>
>> I don't see any deadlocks there. Note that I've mentioned lock_sock_fast()
>> fast path was a problem, not lock_sock().
>
> For the tcp/udp, I think the bpf_iter should be fine: lock_sock() in
> seq_show and bh_lock_sock() in map_update. It seems redundant though.
I wasn't sure what exactly you suspect of being redundant, so I did some
digging:
lock_sock() in tcp/udp iter is expected (among others?) by kfunc
bpf_sock_destroy(). Relevant commit 4ddbcb886268 ("bpf: Add
bpf_sock_destroy kfunc"),
https://lore.kernel.org/all/20230519225157.760788-8-aditi.ghag@isovalent.com/
In short: lock must be taken for synchronization of proto::diag_destroy().
Reasons for bh_lock_sock() during bpf sockmap update are explained in
commit 0126240f448d ("bpf: sockmap: Allow update from BPF"),
https://lore.kernel.org/bpf/20200821102948.21918-6-lmb@cloudflare.com/
In short: socket shouldn't be allowed to change its state during the
update. BH lock because bpf can't sleep.
> From looking at may_update_sockmap(), other bpf progs (e.g., tc) can do
> map_update also. On those paths, I am not sure why
> sock_map_update_elem() does not need to check "!sock_owned_by_user(sk)".
> If it is indeed an issue, it probably needs to be addressed separately.
Since sockmap update can happen in a tracing prog, can you really expect a
socket to be always owned?
> It should also be helpful to be consistent with tcp/udp iter and use
> lock_sock() instead of lock_sock_fast() in bpf_iter_unix_seq_show().
OK, I'll tweak that in v3.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-11 10:02 ` Michal Luczaj
@ 2026-02-11 13:24 ` Michal Luczaj
2026-02-23 21:43 ` Martin KaFai Lau
1 sibling, 0 replies; 14+ messages in thread
From: Michal Luczaj @ 2026-02-11 13:24 UTC (permalink / raw)
To: Martin KaFai Lau, Kuniyuki Iwashima
Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Daniel Borkmann,
Willem de Bruijn, Cong Wang, Alexei Starovoitov, Yonghong Song,
Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
netdev, bpf, linux-kernel, linux-kselftest
Replying to correct myself:
> Since sockmap update can happen in a tracing prog, can you really expect a
> socket to be always owned?
Ha, but I'm wrong! Of tracers only BPF_TRACE_ITER allows calling
sock_map_update_elem(). And those are supposed to lock_sock(). Except some
don't, e.g. sock_map_seq_show(). Or is there something else I'm missing?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-11 10:02 ` Michal Luczaj
2026-02-11 13:24 ` Michal Luczaj
@ 2026-02-23 21:43 ` Martin KaFai Lau
2026-02-24 15:28 ` Michal Luczaj
1 sibling, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2026-02-23 21:43 UTC (permalink / raw)
To: Michal Luczaj, Kuniyuki Iwashima
Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Daniel Borkmann,
Willem de Bruijn, Cong Wang, Alexei Starovoitov, Yonghong Song,
Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
netdev, bpf, linux-kernel, linux-kselftest
On 2/11/26 2:02 AM, Michal Luczaj wrote:
>> It should also be helpful to be consistent with tcp/udp iter and use
>> lock_sock() instead of lock_sock_fast() in bpf_iter_unix_seq_show().
> OK, I'll tweak that in v3.
Hi, Michal. Are you planning to send v3 soon? I don't think the
sock_owned_by_user for the non-tracing prog needs to be addressed in the
same set.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-23 21:43 ` Martin KaFai Lau
@ 2026-02-24 15:28 ` Michal Luczaj
2026-03-03 1:51 ` Martin KaFai Lau
0 siblings, 1 reply; 14+ messages in thread
From: Michal Luczaj @ 2026-02-24 15:28 UTC (permalink / raw)
To: Martin KaFai Lau, Kuniyuki Iwashima
Cc: John Fastabend, Jakub Sitnicki, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Daniel Borkmann,
Willem de Bruijn, Cong Wang, Alexei Starovoitov, Yonghong Song,
Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
netdev, bpf, linux-kernel, linux-kselftest
On 2/23/26 22:43, Martin KaFai Lau wrote:
>
> On 2/11/26 2:02 AM, Michal Luczaj wrote:
>>> It should also be helpful to be consistent with tcp/udp iter and use
>>> lock_sock() instead of lock_sock_fast() in bpf_iter_unix_seq_show().
>> OK, I'll tweak that in v3.
>
> Hi, Michal. Are you planning to send v3 soon? I don't think the
> sock_owned_by_user for the non-tracing prog needs to be addressed in the
> same set.
Yes, I'm working on it. Sorry for the delay, I'm taking my sweet time to
come up with a selftest.
I think I can neatly fit the sock_owned_by_user()-related stuff in this
series, but let me know if you'd rather have it separately. Whichever way,
I don't mind.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-02-24 15:28 ` Michal Luczaj
@ 2026-03-03 1:51 ` Martin KaFai Lau
2026-03-05 23:35 ` Michal Luczaj
0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2026-03-03 1:51 UTC (permalink / raw)
To: Michal Luczaj
Cc: Kuniyuki Iwashima, John Fastabend, Jakub Sitnicki,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf,
linux-kernel, linux-kselftest
On 2/24/26 7:28 AM, Michal Luczaj wrote:
> On 2/23/26 22:43, Martin KaFai Lau wrote:
>>
>> On 2/11/26 2:02 AM, Michal Luczaj wrote:
>>>> It should also be helpful to be consistent with tcp/udp iter and use
>>>> lock_sock() instead of lock_sock_fast() in bpf_iter_unix_seq_show().
>>> OK, I'll tweak that in v3.
>>
>> Hi, Michal. Are you planning to send v3 soon? I don't think the
>> sock_owned_by_user for the non-tracing prog needs to be addressed in the
>> same set.
>
> Yes, I'm working on it. Sorry for the delay, I'm taking my sweet time to
> come up with a selftest.
>
> I think I can neatly fit the sock_owned_by_user()-related stuff in this
> series, but let me know if you'd rather have it separately. Whichever way,
> I don't mind.
I think sock_owned_by_user() is not related to the AF_UNIX's sockmap
fix. If it is not, it's better to separate it. I think one thing at a
time is easier to review.
If they are related somewhat or it's easier to review them together, see
if it makes sense to structure them in a way such that the AF_UNIX's
sockmap-related fixes and tests are at the beginning of the patch set so
that they can be applied first if others need more discussion.
Regarding the "Keeping sparse annotations in
sock_map_sk_{acquire,release}() required some
hackery...". Maybe just remove the annotations?
[ btw, from commit 5b63d0ae94cc, the sparse support is removed and it
depends solely on clang now (?). If it is the case, what does clang do
on the "sock_or_unix_lock"? ]
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock
2026-03-03 1:51 ` Martin KaFai Lau
@ 2026-03-05 23:35 ` Michal Luczaj
0 siblings, 0 replies; 14+ messages in thread
From: Michal Luczaj @ 2026-03-05 23:35 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: Kuniyuki Iwashima, John Fastabend, Jakub Sitnicki,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf,
linux-kernel, linux-kselftest
On 3/3/26 02:51, Martin KaFai Lau wrote:
> On 2/24/26 7:28 AM, Michal Luczaj wrote:
>> On 2/23/26 22:43, Martin KaFai Lau wrote:
>>>
>>> On 2/11/26 2:02 AM, Michal Luczaj wrote:
>>>>> It should also be helpful to be consistent with tcp/udp iter and use
>>>>> lock_sock() instead of lock_sock_fast() in bpf_iter_unix_seq_show().
>>>> OK, I'll tweak that in v3.
>>>
>>> Hi, Michal. Are you planning to send v3 soon? I don't think the
>>> sock_owned_by_user for the non-tracing prog needs to be addressed in the
>>> same set.
>>
>> Yes, I'm working on it. Sorry for the delay, I'm taking my sweet time to
>> come up with a selftest.
>>
>> I think I can neatly fit the sock_owned_by_user()-related stuff in this
>> series, but let me know if you'd rather have it separately. Whichever way,
>> I don't mind.
>
> I think sock_owned_by_user() is not related to the AF_UNIX's sockmap
> fix. If it is not, it's better to separate it. I think one thing at a
> time is easier to review.
All right, sending out v3 to reduce the scope:
https://lore.kernel.org/bpf/20260306-unix-proto-update-null-ptr-deref-v3-0-2f0c7410c523@rbox.co/
> If they are related somewhat or it's easier to review them together, see
> if it makes sense to structure them in a way such that the AF_UNIX's
> sockmap-related fixes and tests are at the beginning of the patch set so
> that they can be applied first if others need more discussion.
I'd say those changes are related, but things are getting unclear for me, I
won't insist on keeping them together.
So. Simply dropping BH locks and introducing sock_owned_by_user() in
sock_map_update_elem() is not enough. For example, sock_map_seq_show() and
sock_hash_seq_show() must be adapted before taking the sock lock (as they
currently run under RCU).
But there's a deeper problem: bpf iters let the bpf code call sockmap
update when sk's file descriptor is halfway through sock_map_close(). This
breaks the sockmap contract as the update can happen after
sock_map_remove_links(), but before saved_close(). sockmap ends up with a
closed sk that normally would be removed.
For sockmap/sockhash iters this condition can be handled: before running
the bpf tracer prog, check if sk is still linked to map. Since those are
sockmap/sockhash iters, if there's no link, it means sk should be skipped.
But I'm not sure about other iters. Plus the case you've mentioned of bpf
tc progs that can call sockmap update with sk unowned. Hence I stop here
worried this may be a doomed approach :)
Long story short: feels like taking the spinlocks (without unconditionally
checking sock_owned_by_user()) in sock_map_update_elem() is the best
approach. Just as it is currently implemented (+the af_unix special
casing). Some sks will come owned, some won't. What needs to be added after
taking the spinlock(s) is a sort of is-sk-being-closed check. Or am I
missing something?
> Regarding the "Keeping sparse annotations in
> sock_map_sk_{acquire,release}() required some
> hackery...". Maybe just remove the annotations?
> [ btw, from commit 5b63d0ae94cc, the sparse support is removed and it
> depends solely on clang now (?). If it is the case, what does clang do
> on the "sock_or_unix_lock"? ]
Oh, I haven't seen that one, thanks. Will take a look. For now, I've just
dropped the annotations.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH bpf v2 4/4] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
2026-02-07 14:34 [PATCH bpf v2 0/4] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
` (2 preceding siblings ...)
2026-02-07 14:34 ` [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock Michal Luczaj
@ 2026-02-07 14:34 ` Michal Luczaj
3 siblings, 0 replies; 14+ messages in thread
From: Michal Luczaj @ 2026-02-07 14:34 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Kuniyuki Iwashima,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Daniel Borkmann, Willem de Bruijn, Cong Wang,
Alexei Starovoitov, Yonghong Song, Andrii Nakryiko,
Eduard Zingerman, Martin KaFai Lau, Song Liu, Yonghong Song,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
Updating a sockmap from a unix iterator prog may lead to a deadlock.
Piggyback on the original selftest.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
index fea275df9e22..a2652c8c3616 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
@@ -7,6 +7,13 @@
char _license[] SEC("license") = "GPL";
+SEC(".maps") struct {
+ __uint(type, BPF_MAP_TYPE_SOCKMAP);
+ __uint(max_entries, 1);
+ __type(key, __u32);
+ __type(value, __u64);
+} sockmap;
+
static long sock_i_ino(const struct sock *sk)
{
const struct socket *sk_socket = sk->sk_socket;
@@ -76,5 +83,8 @@ int dump_unix(struct bpf_iter__unix *ctx)
BPF_SEQ_PRINTF(seq, "\n");
+ /* Test for deadlock. */
+ bpf_map_update_elem(&sockmap, &(int){0}, sk, 0);
+
return 0;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 14+ messages in thread