[PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
@ 2026-03-05 23:30 Michal Luczaj
  2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Michal Luczaj @ 2026-03-05 23:30 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

BPF_MAP_UPDATE_ELEM races unix_stream_connect(): when
sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED),
unix_peer(sk) in unix_stream_bpf_update_proto() may still return NULL.

BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
  sock_map_link+0x564/0x8b0
  sock_map_update_common+0x6e/0x340
  sock_map_update_elem_sys+0x17d/0x240
  __sys_bpf+0x26db/0x3250
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0x6b/0x3a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Series fixes the null-ptr-deref by teaching sockmap about the
af_unix-specific locking. Accidentally this also fixes a deadlock.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Changes in v3:
- Drop sparse annotations [Martin]
- Keep lock_sock() along the unix_state_lock() [Kaniyuki]
- Unify BPF iter af_unix locking [Kaniyuki, Martin]
- Link to v2: https://lore.kernel.org/r/20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co

Changes in v2:
- Instead of probing for unix peer, make sockmap take the right lock [Martin]
- Annotate data races [Kaniyuki, Martin]
- Extend bpf unix iter selftest to attempt a deadlock
- Link to v1: https://lore.kernel.org/r/20260129-unix-proto-update-null-ptr-deref-v1-1-e1daeb7012fd@rbox.co

---
Michal Luczaj (5):
      bpf, sockmap: Annotate af_unix sock::sk_state data-races
      bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
      bpf, sockmap: Fix af_unix iter deadlock
      selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
      bpf, sockmap: Adapt for af_unix-specific lock

 net/core/sock_map.c                               | 60 ++++++++++++++---------
 net/unix/af_unix.c                                |  7 ++-
 tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++
 3 files changed, 51 insertions(+), 26 deletions(-)
---
base-commit: 56145d237385ca0e7ca9ff7b226aaf2eb8ef368b
change-id: 20260129-unix-proto-update-null-ptr-deref-6a2733bcbbf8

Best regards,
-- 
Michal Luczaj <mhal@rbox.co>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races
  2026-03-05 23:30 [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
@ 2026-03-05 23:30 ` Michal Luczaj
  2026-03-06  5:30   ` Kuniyuki Iwashima
                     ` (2 more replies)
  2026-03-05 23:30 ` [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded Michal Luczaj
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 26+ messages in thread
From: Michal Luczaj @ 2026-03-05 23:30 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
socket sk_state locklessly.

Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
not only af_unix, but all non-TCP sockets (UDP, af_vsock).

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index b0e96337a269..02a68be3002a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -530,7 +530,7 @@ static bool sock_map_redirect_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return sk->sk_state != TCP_LISTEN;
 	else
-		return sk->sk_state == TCP_ESTABLISHED;
+		return READ_ONCE(sk->sk_state) == TCP_ESTABLISHED;
 }
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
@@ -543,7 +543,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 	if (sk_is_stream_unix(sk))
-		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	if (sk_is_vsock(sk) &&
 	    (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET))
 		return (1 << sk->sk_state) & TCPF_ESTABLISHED;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
  2026-03-05 23:30 [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
  2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
@ 2026-03-05 23:30 ` Michal Luczaj
  2026-03-06  5:44   ` Kuniyuki Iwashima
  2026-03-05 23:30 ` [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock Michal Luczaj
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Michal Luczaj @ 2026-03-05 23:30 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

Instead of repeating the same (un)locking pattern, reuse
sock_map_sk_{acquire,release}(). This centralizes the code and makes it
easier to adapt sockmap to af_unix-specific locking.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 02a68be3002a..7ba6a7f24ccd 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
 		sk = xchg(psk, NULL);
 		if (sk) {
 			sock_hold(sk);
-			lock_sock(sk);
-			rcu_read_lock();
+			sock_map_sk_acquire(sk);
 			sock_map_unref(sk, psk);
-			rcu_read_unlock();
-			release_sock(sk);
+			sock_map_sk_release(sk);
 			sock_put(sk);
 		}
 	}
@@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
 		 */
 		hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
 			hlist_del(&elem->node);
-			lock_sock(elem->sk);
-			rcu_read_lock();
+			sock_map_sk_acquire(elem->sk);
 			sock_map_unref(elem->sk, elem);
-			rcu_read_unlock();
-			release_sock(elem->sk);
+			sock_map_sk_release(elem->sk);
 			sock_put(elem->sk);
 			sock_hash_free_elem(htab, elem);
 		}
@@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
 	void (*saved_close)(struct sock *sk, long timeout);
 	struct sk_psock *psock;
 
-	lock_sock(sk);
-	rcu_read_lock();
+	sock_map_sk_acquire(sk);
 	psock = sk_psock(sk);
 	if (likely(psock)) {
 		saved_close = psock->saved_close;
@@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
 		psock = sk_psock_get(sk);
 		if (unlikely(!psock))
 			goto no_psock;
-		rcu_read_unlock();
 		sk_psock_stop(psock);
-		release_sock(sk);
+		sock_map_sk_release(sk);
 		cancel_delayed_work_sync(&psock->work);
 		sk_psock_put(sk, psock);
 	} else {
 		saved_close = READ_ONCE(sk->sk_prot)->close;
 no_psock:
-		rcu_read_unlock();
-		release_sock(sk);
+		sock_map_sk_release(sk);
 	}
 
 	/* Make sure we do not recurse. This is a bug.

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
  2026-03-05 23:30 [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
  2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
  2026-03-05 23:30 ` [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded Michal Luczaj
@ 2026-03-05 23:30 ` Michal Luczaj
  2026-03-06  5:47   ` Kuniyuki Iwashima
                     ` (2 more replies)
  2026-03-05 23:30 ` [PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking Michal Luczaj
  2026-03-05 23:30 ` [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock Michal Luczaj
  4 siblings, 3 replies; 26+ messages in thread
From: Michal Luczaj @ 2026-03-05 23:30 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

bpf_iter_unix_seq_show() may deadlock when lock_sock_fast() takes the fast
path and the iter prog attempts to update a sockmap. Which ends up spinning
at sock_map_update_elem()'s bh_lock_sock():

WARNING: possible recursive locking detected
test_progs/1393 is trying to acquire lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0

but task is already holding lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(slock-AF_UNIX);
  lock(slock-AF_UNIX);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by test_progs/1393:
 #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
 #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
 #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
 #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00

Call Trace:
 dump_stack_lvl+0x5d/0x80
 print_deadlock_bug.cold+0xc0/0xce
 __lock_acquire+0x130f/0x2590
 lock_acquire+0x14e/0x2b0
 _raw_spin_lock+0x30/0x40
 sock_map_update_elem+0xdb/0x1f0
 bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
 bpf_iter_run_prog+0x5b9/0xb00
 bpf_iter_unix_seq_show+0x1f7/0x2e0
 bpf_seq_read+0x42c/0x10d0
 vfs_read+0x171/0xb20
 ksys_read+0xff/0x200
 do_syscall_64+0x6b/0x3a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/unix/af_unix.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3756a93dc63a..3d2cfb4ecbcd 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3729,15 +3729,14 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
 	struct bpf_prog *prog;
 	struct sock *sk = v;
 	uid_t uid;
-	bool slow;
 	int ret;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
-	slow = lock_sock_fast(sk);
+	lock_sock(sk);
 
-	if (unlikely(sk_unhashed(sk))) {
+	if (unlikely(sock_flag(sk, SOCK_DEAD))) {
 		ret = SEQ_SKIP;
 		goto unlock;
 	}
@@ -3747,7 +3746,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
 	prog = bpf_iter_get_info(&meta, false);
 	ret = unix_prog_seq_show(prog, &meta, v, uid);
 unlock:
-	unlock_sock_fast(sk, slow);
+	release_sock(sk);
 	return ret;
 }
 

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
  2026-03-05 23:30 [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
                   ` (2 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock Michal Luczaj
@ 2026-03-05 23:30 ` Michal Luczaj
  2026-03-06 14:34   ` Jiayuan Chen
  2026-03-05 23:30 ` [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock Michal Luczaj
  4 siblings, 1 reply; 26+ messages in thread
From: Michal Luczaj @ 2026-03-05 23:30 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

Updating a sockmap from a unix iterator prog may lead to a deadlock.
Piggyback on the original selftest.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
index fea275df9e22..a2652c8c3616 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
@@ -7,6 +7,13 @@
 
 char _license[] SEC("license") = "GPL";
 
+SEC(".maps") struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} sockmap;
+
 static long sock_i_ino(const struct sock *sk)
 {
 	const struct socket *sk_socket = sk->sk_socket;
@@ -76,5 +83,8 @@ int dump_unix(struct bpf_iter__unix *ctx)
 
 	BPF_SEQ_PRINTF(seq, "\n");
 
+	/* Test for deadlock. */
+	bpf_map_update_elem(&sockmap, &(int){0}, sk, 0);
+
 	return 0;
 }

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
  2026-03-05 23:30 [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
                   ` (3 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking Michal Luczaj
@ 2026-03-05 23:30 ` Michal Luczaj
  2026-03-06  5:01   ` Jiayuan Chen
  4 siblings, 1 reply; 26+ messages in thread
From: Michal Luczaj @ 2026-03-05 23:30 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.

          T0 bpf                            T1 connect
          ------                            ----------

                                WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
                                sock_hold(newsk)
                                smp_mb__after_atomic()
                                unix_peer(sk) = newsk

BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
  sock_map_link+0x564/0x8b0
  sock_map_update_common+0x6e/0x340
  sock_map_update_elem_sys+0x17d/0x240
  __sys_bpf+0x26db/0x3250
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0x6b/0x3a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected. Then a check during proto update was considered[2], but a
follow-up discussion[3] concluded the root cause is sockmap taking a wrong
lock. Or, more specifically, an insufficient lock[4].

Thus, teach sockmap about the af_unix-specific locking: af_unix protects
critical sections under unix_state_lock() operating on unix_sock::lock.

[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
[4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 7ba6a7f24ccd..6109fbe6f99c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -12,6 +12,7 @@
 #include <linux/list.h>
 #include <linux/jhash.h>
 #include <linux/sock_diag.h>
+#include <net/af_unix.h>
 #include <net/udp.h>
 
 struct bpf_stab {
@@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
 }
 
 static void sock_map_sk_acquire(struct sock *sk)
-	__acquires(&sk->sk_lock.slock)
 {
 	lock_sock(sk);
+
+	if (sk_is_unix(sk))
+		unix_state_lock(sk);
+
 	rcu_read_lock();
 }
 
 static void sock_map_sk_release(struct sock *sk)
-	__releases(&sk->sk_lock.slock)
 {
 	rcu_read_unlock();
+
+	if (sk_is_unix(sk))
+		unix_state_unlock(sk);
+
 	release_sock(sk);
 }
 
+static inline void sock_map_sk_acquire_fast(struct sock *sk)
+{
+	local_bh_disable();
+	bh_lock_sock(sk);
+
+	if (sk_is_unix(sk))
+		unix_state_lock(sk);
+}
+
+static inline void sock_map_sk_release_fast(struct sock *sk)
+{
+	if (sk_is_unix(sk))
+		unix_state_unlock(sk);
+
+	bh_unlock_sock(sk);
+	local_bh_enable();
+}
+
 static void sock_map_add_link(struct sk_psock *psock,
 			      struct sk_psock_link *link,
 			      struct bpf_map *map, void *link_raw)
@@ -604,16 +629,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
 	if (!sock_map_sk_is_suitable(sk))
 		return -EOPNOTSUPP;
 
-	local_bh_disable();
-	bh_lock_sock(sk);
+	sock_map_sk_acquire_fast(sk);
 	if (!sock_map_sk_state_allowed(sk))
 		ret = -EOPNOTSUPP;
 	else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
 		ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
 	else
 		ret = sock_hash_update_common(map, key, sk, flags);
-	bh_unlock_sock(sk);
-	local_bh_enable();
+	sock_map_sk_release_fast(sk);
 	return ret;
 }
 

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
  2026-03-05 23:30 ` [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock Michal Luczaj
@ 2026-03-06  5:01   ` Jiayuan Chen
  2026-03-06 14:09     ` Michal Luczaj
  0 siblings, 1 reply; 26+ messages in thread
From: Jiayuan Chen @ 2026-03-06  5:01 UTC (permalink / raw)
  To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest


On 3/6/26 7:30 AM, Michal Luczaj wrote:
> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
> socket is properly set up, which would include having a defined peer. IOW,
> there's a window when unix_stream_bpf_update_proto() can be called on
> socket which still has unix_peer(sk) == NULL.
>
>            T0 bpf                            T1 connect
>            ------                            ----------
>
>                                  WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
>                                  sock_hold(newsk)
>                                  smp_mb__after_atomic()
>                                  unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
>    sock_map_link+0x564/0x8b0
>    sock_map_update_common+0x6e/0x340
>    sock_map_update_elem_sys+0x17d/0x240
>    __sys_bpf+0x26db/0x3250
>    __x64_sys_bpf+0x21/0x30
>    do_syscall_64+0x6b/0x3a0
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Initial idea was to move peer assignment _before_ the sk_state update[1],
> but that involved an additional memory barrier, and changing the hot path
> was rejected. Then a check during proto update was considered[2], but a
> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
> lock. Or, more specifically, an insufficient lock[4].
>
> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
> critical sections under unix_state_lock() operating on unix_sock::lock.
>
> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>   net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>   1 file changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 7ba6a7f24ccd..6109fbe6f99c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -12,6 +12,7 @@
>   #include <linux/list.h>
>   #include <linux/jhash.h>
>   #include <linux/sock_diag.h>
> +#include <net/af_unix.h>
>   #include <net/udp.h>
>   
>   struct bpf_stab {
> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>   }
>   
>   static void sock_map_sk_acquire(struct sock *sk)
> -	__acquires(&sk->sk_lock.slock)
>   {
>   	lock_sock(sk);
> +
> +	if (sk_is_unix(sk))
> +		unix_state_lock(sk);
> +
>   	rcu_read_lock();
>   }
>   

This introduces a new ordering constraint: lock_sock() before
unix_state_lock(). Kuniyuki flagged in the v2 review that taking
lock_sock() inside unix_state_lock() in the future would create an
ABBA deadlock, which is exactly why the order was settled this way. However,
the thread did not reach a conclusion on whether that constraint should be
documented in the code.

Since there is nothing enforcing this ordering mechanically, a brief comment
at sock_map_sk_acquire() would help future readers avoid introducing the
inverse ordering.


>   static void sock_map_sk_release(struct sock *sk)
> -	__releases(&sk->sk_lock.slock)
>   {
>   	rcu_read_unlock();
> +
> +	if (sk_is_unix(sk))
> +		unix_state_unlock(sk);
> +
>   	release_sock(sk);
>   }
>   
> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
> +{
> +	local_bh_disable();
> +	bh_lock_sock(sk);
> +
> +	if (sk_is_unix(sk))
> +		unix_state_lock(sk);
> +}
> +


v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
sockets took only unix_state_lock() in the fast path, skipping
bh_lock_sock() entirely:

/* v2 */
if (sk_is_unix(sk))
   unix_state_lock(sk);
else
   bh_lock_sock(sk);

v3 takes both for AF_UNIX sockets.

bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
actually needs protection here — sk_state and unix_peer() — lives under
unix_sock::lock. Since unix_state_lock() is already sufficient to close
the race against unix_stream_connect(), is bh_lock_sock() still doing
anything useful for AF_UNIX sockets in this path?

The v3 changelog attributes "Keep lock_sock() along with unix_state_lock()"
to Kuniyuki, but that comment appears to concern sock_map_sk_acquire()
(the slow path); the fast path behaviour change was not explicitly discussed
in the v2 thread.

If bh_lock_sock() is redundant for AF_UNIX here, reverting to the v2
approach would reduce unnecessary lock contention. If it is still needed,
a comment explaining what it protects for AF_UNIX sockets would help.

> +static inline void sock_map_sk_release_fast(struct sock *sk)
> +{
> +	if (sk_is_unix(sk))
> +		unix_state_unlock(sk);
> +
> +	bh_unlock_sock(sk);
> +	local_bh_enable();
> +}
> +
>   static void sock_map_add_link(struct sk_psock *psock,
>   			      struct sk_psock_link *link,
>   			      struct bpf_map *map, void *link_raw)
> @@ -604,16 +629,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key,
>   	if (!sock_map_sk_is_suitable(sk))
>   		return -EOPNOTSUPP;
>   
> -	local_bh_disable();
> -	bh_lock_sock(sk);
> +	sock_map_sk_acquire_fast(sk);
>   	if (!sock_map_sk_state_allowed(sk))
>   		ret = -EOPNOTSUPP;
>   	else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
>   		ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
>   	else
>   		ret = sock_hash_update_common(map, key, sk, flags);
> -	bh_unlock_sock(sk);
> -	local_bh_enable();
> +	sock_map_sk_release_fast(sk);
>   	return ret;
>   }
>   
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races
  2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
@ 2026-03-06  5:30   ` Kuniyuki Iwashima
  2026-03-06  6:24   ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock^sk_state data-races Jiayuan Chen
  2026-03-18 17:05   ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
  2 siblings, 0 replies; 26+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-06  5:30 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On Thu, Mar 5, 2026 at 3:31 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
> socket sk_state locklessly.
>
> Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
> not only af_unix, but all non-TCP sockets (UDP, af_vsock).
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

Actually TCP path also needs READ_ONCE(), but I think
it's okay for now since this series focuses on AF_UNIX.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
  2026-03-05 23:30 ` [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded Michal Luczaj
@ 2026-03-06  5:44   ` Kuniyuki Iwashima
  2026-03-06 14:05     ` Michal Luczaj
  0 siblings, 1 reply; 26+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-06  5:44 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On Thu, Mar 5, 2026 at 3:32 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> Instead of repeating the same (un)locking pattern, reuse
> sock_map_sk_{acquire,release}(). This centralizes the code and makes it
> easier to adapt sockmap to af_unix-specific locking.
>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  net/core/sock_map.c | 21 +++++++--------------
>  1 file changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 02a68be3002a..7ba6a7f24ccd 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
>                 sk = xchg(psk, NULL);
>                 if (sk) {
>                         sock_hold(sk);
> -                       lock_sock(sk);
> -                       rcu_read_lock();
> +                       sock_map_sk_acquire(sk);
>                         sock_map_unref(sk, psk);
> -                       rcu_read_unlock();
> -                       release_sock(sk);
> +                       sock_map_sk_release(sk);
>                         sock_put(sk);
>                 }
>         }
> @@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
>                  */
>                 hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
>                         hlist_del(&elem->node);
> -                       lock_sock(elem->sk);
> -                       rcu_read_lock();
> +                       sock_map_sk_acquire(elem->sk);
>                         sock_map_unref(elem->sk, elem);
> -                       rcu_read_unlock();
> -                       release_sock(elem->sk);
> +                       sock_map_sk_release(elem->sk);
>                         sock_put(elem->sk);
>                         sock_hash_free_elem(htab, elem);
>                 }
> @@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
>         void (*saved_close)(struct sock *sk, long timeout);
>         struct sk_psock *psock;
>
> -       lock_sock(sk);
> -       rcu_read_lock();
> +       sock_map_sk_acquire(sk);
>         psock = sk_psock(sk);
>         if (likely(psock)) {
>                 saved_close = psock->saved_close;
> @@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
>                 psock = sk_psock_get(sk);
>                 if (unlikely(!psock))
>                         goto no_psock;
> -               rcu_read_unlock();
>                 sk_psock_stop(psock);
> -               release_sock(sk);
> +               sock_map_sk_release(sk);

I think sk_psock_stop() was intentionally put outside
of rcu_read_lock() to not extend the grace period
unnecessarily. e.g. while + __sk_msg_free().

Maybe add __sock_map_sk_release() without
rcu_read_unlock() ?



>                 cancel_delayed_work_sync(&psock->work);
>                 sk_psock_put(sk, psock);
>         } else {
>                 saved_close = READ_ONCE(sk->sk_prot)->close;
>  no_psock:
> -               rcu_read_unlock();
> -               release_sock(sk);
> +               sock_map_sk_release(sk);
>         }
>
>         /* Make sure we do not recurse. This is a bug.
>
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
  2026-03-05 23:30 ` [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock Michal Luczaj
@ 2026-03-06  5:47   ` Kuniyuki Iwashima
  2026-03-06  6:04   ` Jiayuan Chen
  2026-03-06 14:33   ` Jiayuan Chen
  2 siblings, 0 replies; 26+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-06  5:47 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On Thu, Mar 5, 2026 at 3:32 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> bpf_iter_unix_seq_show() may deadlock when lock_sock_fast() takes the fast
> path and the iter prog attempts to update a sockmap. Which ends up spinning
> at sock_map_update_elem()'s bh_lock_sock():
>
> WARNING: possible recursive locking detected
> test_progs/1393 is trying to acquire lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0
>
> but task is already holding lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(slock-AF_UNIX);
>   lock(slock-AF_UNIX);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 4 locks held by test_progs/1393:
>  #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
>  #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
>  #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>  #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00
>
> Call Trace:
>  dump_stack_lvl+0x5d/0x80
>  print_deadlock_bug.cold+0xc0/0xce
>  __lock_acquire+0x130f/0x2590
>  lock_acquire+0x14e/0x2b0
>  _raw_spin_lock+0x30/0x40
>  sock_map_update_elem+0xdb/0x1f0
>  bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
>  bpf_iter_run_prog+0x5b9/0xb00
>  bpf_iter_unix_seq_show+0x1f7/0x2e0
>  bpf_seq_read+0x42c/0x10d0
>  vfs_read+0x171/0xb20
>  ksys_read+0xff/0x200
>  do_syscall_64+0x6b/0x3a0
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
  2026-03-05 23:30 ` [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock Michal Luczaj
  2026-03-06  5:47   ` Kuniyuki Iwashima
@ 2026-03-06  6:04   ` Jiayuan Chen
  2026-03-06  6:15     ` Jiayuan Chen
  2026-03-06 14:06     ` Michal Luczaj
  2026-03-06 14:33   ` Jiayuan Chen
  2 siblings, 2 replies; 26+ messages in thread
From: Jiayuan Chen @ 2026-03-06  6:04 UTC (permalink / raw)
  To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest


On 3/6/26 7:30 AM, Michal Luczaj wrote:
> bpf_iter_unix_seq_show() may deadlock when lock_sock_fast() takes the fast
> path and the iter prog attempts to update a sockmap. Which ends up spinning
> at sock_map_update_elem()'s bh_lock_sock():
>
> WARNING: possible recursive locking detected
> test_progs/1393 is trying to acquire lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0
>
> but task is already holding lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>
> other info that might help us debug this:
>   Possible unsafe locking scenario:
>
>         CPU0
>         ----
>    lock(slock-AF_UNIX);
>    lock(slock-AF_UNIX);
>
>   *** DEADLOCK ***
>
>   May be due to missing lock nesting notation
>
> 4 locks held by test_progs/1393:
>   #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
>   #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
>   #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>   #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00
>
> Call Trace:
>   dump_stack_lvl+0x5d/0x80
>   print_deadlock_bug.cold+0xc0/0xce
>   __lock_acquire+0x130f/0x2590
>   lock_acquire+0x14e/0x2b0
>   _raw_spin_lock+0x30/0x40
>   sock_map_update_elem+0xdb/0x1f0
>   bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
>   bpf_iter_run_prog+0x5b9/0xb00
>   bpf_iter_unix_seq_show+0x1f7/0x2e0
>   bpf_seq_read+0x42c/0x10d0
>   vfs_read+0x171/0xb20
>   ksys_read+0xff/0x200
>   do_syscall_64+0x6b/0x3a0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>   net/unix/af_unix.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 3756a93dc63a..3d2cfb4ecbcd 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -3729,15 +3729,14 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
>   	uid_t uid;
> -	bool slow;
>   	int ret;
>   
>   	if (v == SEQ_START_TOKEN)
>   		return 0;
>   
> -	slow = lock_sock_fast(sk);
> +	lock_sock(sk);
>   
> -	if (unlikely(sk_unhashed(sk))) {
> +	if (unlikely(sock_flag(sk, SOCK_DEAD))) {
>   		ret = SEQ_SKIP;
>   		goto unlock;
>   	}


Switching to lock_sock() fixes the deadlock, but it does not provide mutual
exclusion with unix_release_sock(), which uses unix_state_lock() exclusively
and does not touch lock_sock() at all. So a dying socket can still reach the
BPF prog concurrently with unix_release_sock() running on another CPU.

Both SOCK_DEAD and the clearing of unix_peer(sk) happen under
unix_state_lock() in unix_release_sock(). Without taking unix_state_lock()
before the SOCK_DEAD check, there is a window:

iter unix_release_sock()
---  lock_sock(sk)
SOCK_DEAD == 0(check passes)
   unix_state_lock(sk)
                                                   unix_peer(sk) = NULL
   sock_set_flag(sk, SOCK_DEAD)
   unix_state_unlock(sk)
BPF prog runs
→ accesses unix_peer(sk) == NULL → crash

This was not raised in the v2 discussion.

The natural fix is to check SOCK_DEAD under unix_state_lock(). However,
holding unix_state_lock() throughout BPF prog execution would conflict with
patch 5: sock_map_sk_acquire_fast() also takes unix_state_lock() for AF_UNIX
sockets, resulting in a recursive spinlock deadlock.


Kuniyuki, Martin — what is the right approach here?

> @@ -3747,7 +3746,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
>   	prog = bpf_iter_get_info(&meta, false);
>   	ret = unix_prog_seq_show(prog, &meta, v, uid);
>   unlock:
> -	unlock_sock_fast(sk, slow);
> +	release_sock(sk);
>   	return ret;
>   }
>   
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
  2026-03-06  6:04   ` Jiayuan Chen
@ 2026-03-06  6:15     ` Jiayuan Chen
  2026-03-06 14:06     ` Michal Luczaj
  1 sibling, 0 replies; 26+ messages in thread
From: Jiayuan Chen @ 2026-03-06  6:15 UTC (permalink / raw)
  To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest

March 6, 2026 at 14:04, "Jiayuan Chen" <jiayuan.chen@linux.dev mailto:jiayuan.chen@linux.dev?to=%22Jiayuan%20Chen%22%20%3Cjiayuan.chen%40linux.dev%3E > wrote:


> 
> On 3/6/26 7:30 AM, Michal Luczaj wrote:
> 

[...]
> >  diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >  index 3756a93dc63a..3d2cfb4ecbcd 100644
> >  --- a/net/unix/af_unix.c
> >  +++ b/net/unix/af_unix.c
> >  @@ -3729,15 +3729,14 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> >  struct bpf_prog *prog;
> >  struct sock *sk = v;
> >  uid_t uid;
> >  - bool slow;
> >  int ret;
> >  > if (v == SEQ_START_TOKEN)
> >  return 0;
> >  > - slow = lock_sock_fast(sk);
> >  + lock_sock(sk);
> >  > - if (unlikely(sk_unhashed(sk))) {
> >  + if (unlikely(sock_flag(sk, SOCK_DEAD))) {
> >  ret = SEQ_SKIP;
> >  goto unlock;
> >  }
> > 
> Switching to lock_sock() fixes the deadlock, but it does not provide mutual
> exclusion with unix_release_sock(), which uses unix_state_lock() exclusively
> and does not touch lock_sock() at all. So a dying socket can still reach the
> BPF prog concurrently with unix_release_sock() running on another CPU.
> 
> Both SOCK_DEAD and the clearing of unix_peer(sk) happen under
> unix_state_lock() in unix_release_sock(). Without taking unix_state_lock()
> before the SOCK_DEAD check, there is a window:
> 
> iter unix_release_sock()
> ---  lock_sock(sk)
> SOCK_DEAD == 0(check passes)
>    unix_state_lock(sk)
>                                                    unix_peer(sk) = NULL
>    sock_set_flag(sk, SOCK_DEAD)
>    unix_state_unlock(sk)
> BPF prog runs
> → accesses unix_peer(sk) == NULL → crash


Sorry for malformed message.
Here is correct:

iter                              unix_release_sock()
---  lock_sock(sk)
SOCK_DEAD == 0  (check passes)
                                  unix_state_lock(sk)
                                  unix_peer(sk) = NULL
                                  sock_set_flag(sk, SOCK_DEAD)
                                  unix_state_unlock(sk)
BPF prog runs
→ accesses unix_peer(sk) == NULL → crash

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock^sk_state data-races
  2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
  2026-03-06  5:30   ` Kuniyuki Iwashima
@ 2026-03-06  6:24   ` Jiayuan Chen
  2026-03-18 17:05   ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
  2 siblings, 0 replies; 26+ messages in thread
From: Jiayuan Chen @ 2026-03-06  6:24 UTC (permalink / raw)
  To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest


On 3/6/26 7:30 AM, Michal Luczaj wrote:
> sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
> socket sk_state locklessly.
>
> Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
> not only af_unix, but all non-TCP sockets (UDP, af_vsock).
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
  2026-03-06  5:44   ` Kuniyuki Iwashima
@ 2026-03-06 14:05     ` Michal Luczaj
  2026-03-11  4:17       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Luczaj @ 2026-03-06 14:05 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On 3/6/26 06:44, Kuniyuki Iwashima wrote:
> On Thu, Mar 5, 2026 at 3:32 PM Michal Luczaj <mhal@rbox.co> wrote:
>>
>> Instead of repeating the same (un)locking pattern, reuse
>> sock_map_sk_{acquire,release}(). This centralizes the code and makes it
>> easier to adapt sockmap to af_unix-specific locking.
>>
>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> ---
>>  net/core/sock_map.c | 21 +++++++--------------
>>  1 file changed, 7 insertions(+), 14 deletions(-)
>>
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index 02a68be3002a..7ba6a7f24ccd 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
>>                 sk = xchg(psk, NULL);
>>                 if (sk) {
>>                         sock_hold(sk);
>> -                       lock_sock(sk);
>> -                       rcu_read_lock();
>> +                       sock_map_sk_acquire(sk);
>>                         sock_map_unref(sk, psk);
>> -                       rcu_read_unlock();
>> -                       release_sock(sk);
>> +                       sock_map_sk_release(sk);
>>                         sock_put(sk);
>>                 }
>>         }
>> @@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
>>                  */
>>                 hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
>>                         hlist_del(&elem->node);
>> -                       lock_sock(elem->sk);
>> -                       rcu_read_lock();
>> +                       sock_map_sk_acquire(elem->sk);
>>                         sock_map_unref(elem->sk, elem);
>> -                       rcu_read_unlock();
>> -                       release_sock(elem->sk);
>> +                       sock_map_sk_release(elem->sk);
>>                         sock_put(elem->sk);
>>                         sock_hash_free_elem(htab, elem);
>>                 }
>> @@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
>>         void (*saved_close)(struct sock *sk, long timeout);
>>         struct sk_psock *psock;
>>
>> -       lock_sock(sk);
>> -       rcu_read_lock();
>> +       sock_map_sk_acquire(sk);
>>         psock = sk_psock(sk);
>>         if (likely(psock)) {
>>                 saved_close = psock->saved_close;
>> @@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
>>                 psock = sk_psock_get(sk);
>>                 if (unlikely(!psock))
>>                         goto no_psock;
>> -               rcu_read_unlock();
>>                 sk_psock_stop(psock);
>> -               release_sock(sk);
>> +               sock_map_sk_release(sk);
> 
> I think sk_psock_stop() was intentionally put outside
> of rcu_read_lock() to not extend the grace period
> unnecessarily. e.g. while + __sk_msg_free().
> 
> Maybe add __sock_map_sk_release() without
> rcu_read_unlock() ?

How about dropping this patch completely? The more I stare at it, I see no
reason why af_unix state lock would matter in any of these places.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
  2026-03-06  6:04   ` Jiayuan Chen
  2026-03-06  6:15     ` Jiayuan Chen
@ 2026-03-06 14:06     ` Michal Luczaj
  2026-03-06 14:31       ` Jiayuan Chen
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Luczaj @ 2026-03-06 14:06 UTC (permalink / raw)
  To: Jiayuan Chen, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest

On 3/6/26 07:04, Jiayuan Chen wrote:
> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>> @@ -3729,15 +3729,14 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
>>   	struct bpf_prog *prog;
>>   	struct sock *sk = v;
>>   	uid_t uid;
>> -	bool slow;
>>   	int ret;
>>   
>>   	if (v == SEQ_START_TOKEN)
>>   		return 0;
>>   
>> -	slow = lock_sock_fast(sk);
>> +	lock_sock(sk);
>>   
>> -	if (unlikely(sk_unhashed(sk))) {
>> +	if (unlikely(sock_flag(sk, SOCK_DEAD))) {
>>   		ret = SEQ_SKIP;
>>   		goto unlock;
>>   	}
> 
> 
> Switching to lock_sock() fixes the deadlock, but it does not provide mutual
> exclusion with unix_release_sock(), which uses unix_state_lock() exclusively
> and does not touch lock_sock() at all. So a dying socket can still reach the
> BPF prog concurrently with unix_release_sock() running on another CPU.

That's right. Note that although the socket is dying, iter holds a
reference to it, so the socket is far from being freed (as in: memory
released).

> Both SOCK_DEAD and the clearing of unix_peer(sk) happen under
> unix_state_lock() in unix_release_sock(). Without taking unix_state_lock()
> before the SOCK_DEAD check, there is a window:
>
> iter                              unix_release_sock()
> ---  lock_sock(sk)
> SOCK_DEAD == 0  (check passes)
>                                   unix_state_lock(sk)
>                                   unix_peer(sk) = NULL
>                                   sock_set_flag(sk, SOCK_DEAD)
>                                   unix_state_unlock(sk)
> BPF prog runs
> → accesses unix_peer(sk) == NULL → crash
>
> This was not raised in the v2 discussion.

It was raised in v1[1]. Conclusion was that bpf prog bytecode directly
accessing unix_peer(sk) is not an issue; bpf machinery will handle any
faults. That said, should a "bad" value of unix_peer(sk) end up as a
parameter of a bpf helper, yes, that is a well known[2] problem (that have
a solution unrelated to this series).

[1]:
https://lore.kernel.org/bpf/6de6f1bf-c8ee-4dfb-9b8c-f89185946630@linux.dev/
[2]:
https://lore.kernel.org/bpf/CAADnVQK_93g_KkNFYXSr8ZvA1fYh4hoFRJCJFPS-zs4ox0HhAA@mail.gmail.com/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
  2026-03-06  5:01   ` Jiayuan Chen
@ 2026-03-06 14:09     ` Michal Luczaj
  2026-03-10 22:20       ` Martin KaFai Lau
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Luczaj @ 2026-03-06 14:09 UTC (permalink / raw)
  To: Jiayuan Chen, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest

On 3/6/26 06:01, Jiayuan Chen wrote:
> 
> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>> socket is properly set up, which would include having a defined peer. IOW,
>> there's a window when unix_stream_bpf_update_proto() can be called on
>> socket which still has unix_peer(sk) == NULL.
>>
>>            T0 bpf                            T1 connect
>>            ------                            ----------
>>
>>                                  WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>> sock_map_sk_state_allowed(sk)
>> ...
>> sk_pair = unix_peer(sk)
>> sock_hold(sk_pair)
>>                                  sock_hold(newsk)
>>                                  smp_mb__after_atomic()
>>                                  unix_peer(sk) = newsk
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>> Call Trace:
>>    sock_map_link+0x564/0x8b0
>>    sock_map_update_common+0x6e/0x340
>>    sock_map_update_elem_sys+0x17d/0x240
>>    __sys_bpf+0x26db/0x3250
>>    __x64_sys_bpf+0x21/0x30
>>    do_syscall_64+0x6b/0x3a0
>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>> but that involved an additional memory barrier, and changing the hot path
>> was rejected. Then a check during proto update was considered[2], but a
>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>> lock. Or, more specifically, an insufficient lock[4].
>>
>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>
>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>
>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> ---
>>   net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>   1 file changed, 29 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/list.h>
>>   #include <linux/jhash.h>
>>   #include <linux/sock_diag.h>
>> +#include <net/af_unix.h>
>>   #include <net/udp.h>
>>   
>>   struct bpf_stab {
>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>   }
>>   
>>   static void sock_map_sk_acquire(struct sock *sk)
>> -	__acquires(&sk->sk_lock.slock)
>>   {
>>   	lock_sock(sk);
>> +
>> +	if (sk_is_unix(sk))
>> +		unix_state_lock(sk);
>> +
>>   	rcu_read_lock();
>>   }
>>   
> 
> This introduces a new ordering constraint: lock_sock() before
> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
> lock_sock() inside unix_state_lock() in the future would create an
> ABBA deadlock, which is exactly why the order was settled this way. However,
> the thread did not reach a conclusion on whether that constraint should be
> documented in the code.
> 
> Since there is nothing enforcing this ordering mechanically, a brief comment
> at sock_map_sk_acquire() would help future readers avoid introducing the
> inverse ordering.

Sure, will do.

>>   static void sock_map_sk_release(struct sock *sk)
>> -	__releases(&sk->sk_lock.slock)
>>   {
>>   	rcu_read_unlock();
>> +
>> +	if (sk_is_unix(sk))
>> +		unix_state_unlock(sk);
>> +
>>   	release_sock(sk);
>>   }
>>   
>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>> +{
>> +	local_bh_disable();
>> +	bh_lock_sock(sk);
>> +
>> +	if (sk_is_unix(sk))
>> +		unix_state_lock(sk);
>> +}
>> +
> 
> 
> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
> sockets took only unix_state_lock() in the fast path, skipping
> bh_lock_sock() entirely:
> 
> /* v2 */
> if (sk_is_unix(sk))
>    unix_state_lock(sk);
> else
>    bh_lock_sock(sk);
> 
> v3 takes both for AF_UNIX sockets.
> 
> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
> actually needs protection here — sk_state and unix_peer() — lives under
> unix_sock::lock. Since unix_state_lock() is already sufficient to close
> the race against unix_stream_connect(), is bh_lock_sock() still doing
> anything useful for AF_UNIX sockets in this path?

Yeah, good point. I think I was just trying to keep the _fast variant
aligned with the sleepy one. Which I agree might be unnecessary.

In a parallel thread I've asked Kuniyuki if it might be better to
completely drop patch 2/5, which would change how we interact with
sock_map_close(). Lets see how it goes.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
  2026-03-06 14:06     ` Michal Luczaj
@ 2026-03-06 14:31       ` Jiayuan Chen
  0 siblings, 0 replies; 26+ messages in thread
From: Jiayuan Chen @ 2026-03-06 14:31 UTC (permalink / raw)
  To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest

March 6, 2026 at 22:06, "Michal Luczaj" <mhal@rbox.co mailto:mhal@rbox.co?to=%22Michal%20Luczaj%22%20%3Cmhal%40rbox.co%3E > wrote:


> 
> On 3/6/26 07:04, Jiayuan Chen wrote:
> 
> > 
> > On 3/6/26 7:30 AM, Michal Luczaj wrote:
> > 
> > > 
> > > @@ -3729,15 +3729,14 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
> > >  struct bpf_prog *prog;
> > >  struct sock *sk = v;
> > >  uid_t uid;
> > >  - bool slow;
> > >  int ret;
> > >  
> > >  if (v == SEQ_START_TOKEN)
> > >  return 0;
> > >  
> > >  - slow = lock_sock_fast(sk);
> > >  + lock_sock(sk);
> > >  
> > >  - if (unlikely(sk_unhashed(sk))) {
> > >  + if (unlikely(sock_flag(sk, SOCK_DEAD))) {
> > >  ret = SEQ_SKIP;
> > >  goto unlock;
> > >  }
> > > 
> >  
> >  
> >  Switching to lock_sock() fixes the deadlock, but it does not provide mutual
> >  exclusion with unix_release_sock(), which uses unix_state_lock() exclusively
> >  and does not touch lock_sock() at all. So a dying socket can still reach the
> >  BPF prog concurrently with unix_release_sock() running on another CPU.
> > 
> That's right. Note that although the socket is dying, iter holds a
> reference to it, so the socket is far from being freed (as in: memory
> released).
> 
> > 
> > Both SOCK_DEAD and the clearing of unix_peer(sk) happen under
> >  unix_state_lock() in unix_release_sock(). Without taking unix_state_lock()
> >  before the SOCK_DEAD check, there is a window:
> > 
> >  iter unix_release_sock()
> >  --- lock_sock(sk)
> >  SOCK_DEAD == 0 (check passes)
> >  unix_state_lock(sk)
> >  unix_peer(sk) = NULL
> >  sock_set_flag(sk, SOCK_DEAD)
> >  unix_state_unlock(sk)
> >  BPF prog runs
> >  → accesses unix_peer(sk) == NULL → crash
> > 
> >  This was not raised in the v2 discussion.
> > 
> It was raised in v1[1]. Conclusion was that bpf prog bytecode directly
> accessing unix_peer(sk) is not an issue; bpf machinery will handle any
> faults. That said, should a "bad" value of unix_peer(sk) end up as a
> parameter of a bpf helper, yes, that is a well known[2] problem (that have
> a solution unrelated to this series).
> 
> [1]:
> https://lore.kernel.org/bpf/6de6f1bf-c8ee-4dfb-9b8c-f89185946630@linux.dev/
> [2]:
> https://lore.kernel.org/bpf/CAADnVQK_93g_KkNFYXSr8ZvA1fYh4hoFRJCJFPS-zs4ox0HhAA@mail.gmail.com/
>

Thanks for letting me know.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock
  2026-03-05 23:30 ` [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock Michal Luczaj
  2026-03-06  5:47   ` Kuniyuki Iwashima
  2026-03-06  6:04   ` Jiayuan Chen
@ 2026-03-06 14:33   ` Jiayuan Chen
  2 siblings, 0 replies; 26+ messages in thread
From: Jiayuan Chen @ 2026-03-06 14:33 UTC (permalink / raw)
  To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest


On 3/6/26 7:30 AM, Michal Luczaj wrote:
> bpf_iter_unix_seq_show() may deadlock when lock_sock_fast() takes the fast
> path and the iter prog attempts to update a sockmap. Which ends up spinning
> at sock_map_update_elem()'s bh_lock_sock():
>
> WARNING: possible recursive locking detected
> test_progs/1393 is trying to acquire lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0
>
> but task is already holding lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>
> other info that might help us debug this:
>   Possible unsafe locking scenario:
>
>         CPU0
>         ----
>    lock(slock-AF_UNIX);
>    lock(slock-AF_UNIX);
>
>   *** DEADLOCK ***
>
>   May be due to missing lock nesting notation
>
> 4 locks held by test_progs/1393:
>   #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
>   #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
>   #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>   #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00
>
> Call Trace:
>   dump_stack_lvl+0x5d/0x80
>   print_deadlock_bug.cold+0xc0/0xce
>   __lock_acquire+0x130f/0x2590
>   lock_acquire+0x14e/0x2b0
>   _raw_spin_lock+0x30/0x40
>   sock_map_update_elem+0xdb/0x1f0
>   bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
>   bpf_iter_run_prog+0x5b9/0xb00
>   bpf_iter_unix_seq_show+0x1f7/0x2e0
>   bpf_seq_read+0x42c/0x10d0
>   vfs_read+0x171/0xb20
>   ksys_read+0xff/0x200
>   do_syscall_64+0x6b/0x3a0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>


Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
  2026-03-05 23:30 ` [PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking Michal Luczaj
@ 2026-03-06 14:34   ` Jiayuan Chen
  0 siblings, 0 replies; 26+ messages in thread
From: Jiayuan Chen @ 2026-03-06 14:34 UTC (permalink / raw)
  To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest


On 3/6/26 7:30 AM, Michal Luczaj wrote:
> Updating a sockmap from a unix iterator prog may lead to a deadlock.
> Piggyback on the original selftest.
>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
  2026-03-06 14:09     ` Michal Luczaj
@ 2026-03-10 22:20       ` Martin KaFai Lau
  2026-03-15 23:58         ` Michal Luczaj
  0 siblings, 1 reply; 26+ messages in thread
From: Martin KaFai Lau @ 2026-03-10 22:20 UTC (permalink / raw)
  To: Michal Luczaj, Jiayuan Chen
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Cong Wang, netdev, bpf, linux-kernel, linux-kselftest



On 3/6/26 6:09 AM, Michal Luczaj wrote:
> On 3/6/26 06:01, Jiayuan Chen wrote:
>>
>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>> socket is properly set up, which would include having a defined peer. IOW,
>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>> socket which still has unix_peer(sk) == NULL.
>>>
>>>             T0 bpf                            T1 connect
>>>             ------                            ----------
>>>
>>>                                   WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>> sock_map_sk_state_allowed(sk)
>>> ...
>>> sk_pair = unix_peer(sk)
>>> sock_hold(sk_pair)
>>>                                   sock_hold(newsk)
>>>                                   smp_mb__after_atomic()
>>>                                   unix_peer(sk) = newsk
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>> Call Trace:
>>>     sock_map_link+0x564/0x8b0
>>>     sock_map_update_common+0x6e/0x340
>>>     sock_map_update_elem_sys+0x17d/0x240
>>>     __sys_bpf+0x26db/0x3250
>>>     __x64_sys_bpf+0x21/0x30
>>>     do_syscall_64+0x6b/0x3a0
>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>> but that involved an additional memory barrier, and changing the hot path
>>> was rejected. Then a check during proto update was considered[2], but a
>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>> lock. Or, more specifically, an insufficient lock[4].
>>>
>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>
>>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>>
>>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>> ---
>>>    net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>>    1 file changed, 29 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>> --- a/net/core/sock_map.c
>>> +++ b/net/core/sock_map.c
>>> @@ -12,6 +12,7 @@
>>>    #include <linux/list.h>
>>>    #include <linux/jhash.h>
>>>    #include <linux/sock_diag.h>
>>> +#include <net/af_unix.h>
>>>    #include <net/udp.h>
>>>    
>>>    struct bpf_stab {
>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>>    }
>>>    
>>>    static void sock_map_sk_acquire(struct sock *sk)
>>> -	__acquires(&sk->sk_lock.slock)
>>>    {
>>>    	lock_sock(sk);
>>> +
>>> +	if (sk_is_unix(sk))
>>> +		unix_state_lock(sk);
>>> +
>>>    	rcu_read_lock();
>>>    }
>>>    
>>
>> This introduces a new ordering constraint: lock_sock() before
>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>> lock_sock() inside unix_state_lock() in the future would create an
>> ABBA deadlock, which is exactly why the order was settled this way. However,
>> the thread did not reach a conclusion on whether that constraint should be
>> documented in the code.
>>
>> Since there is nothing enforcing this ordering mechanically, a brief comment
>> at sock_map_sk_acquire() would help future readers avoid introducing the
>> inverse ordering.
> 
> Sure, will do.
> 
>>>    static void sock_map_sk_release(struct sock *sk)
>>> -	__releases(&sk->sk_lock.slock)
>>>    {
>>>    	rcu_read_unlock();
>>> +
>>> +	if (sk_is_unix(sk))
>>> +		unix_state_unlock(sk);
>>> +
>>>    	release_sock(sk);
>>>    }
>>>    
>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>> +{
>>> +	local_bh_disable();
>>> +	bh_lock_sock(sk);
>>> +
>>> +	if (sk_is_unix(sk))
>>> +		unix_state_lock(sk);
>>> +}
>>> +
>>
>>
>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>> sockets took only unix_state_lock() in the fast path, skipping
>> bh_lock_sock() entirely:
>>
>> /* v2 */
>> if (sk_is_unix(sk))
>>     unix_state_lock(sk);
>> else
>>     bh_lock_sock(sk);
>>
>> v3 takes both for AF_UNIX sockets.
>>
>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>> actually needs protection here — sk_state and unix_peer() — lives under
>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>> anything useful for AF_UNIX sockets in this path?
> 
> Yeah, good point. I think I was just trying to keep the _fast variant
> aligned with the sleepy one. Which I agree might be unnecessary.

I hope the common use case should not be calling 
bpf_map_update_elem(sockmap) only.

Beside, from looking at the may_update_sockmap(), I don't know if it is 
even doable (or useful) to bpf_map_update_elem(unix_sk) in 
tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking 
at unix_dgram_sendmsg() => sk_filter(). It was not the original use case 
when the bpf_map_update_elem(sockmap) support was added iirc.

The only path left is bpf_iter, which I believe was the primary use case 
when adding bpf_map_update_elem(sockmap) support [1]. It would be nice 
to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix) 
where lock_sock() has already been done. It is more for 
reading-correctness though. This just came to my mind. 
has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has 
been using it to conditionally take lock_sock() or not. [ I would still 
keep patch 3 though. ]

[1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/

> 
> In a parallel thread I've asked Kuniyuki if it might be better to
> completely drop patch 2/5, which would change how we interact with
> sock_map_close(). Lets see how it goes.
> 

If patch 2 is dropped, lock_sock() is always needed for unix_sk?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
  2026-03-06 14:05     ` Michal Luczaj
@ 2026-03-11  4:17       ` Kuniyuki Iwashima
  2026-03-11  4:57         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 26+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-11  4:17 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On Fri, Mar 6, 2026 at 6:05 AM Michal Luczaj <mhal@rbox.co> wrote:
>
> On 3/6/26 06:44, Kuniyuki Iwashima wrote:
> > On Thu, Mar 5, 2026 at 3:32 PM Michal Luczaj <mhal@rbox.co> wrote:
> >>
> >> Instead of repeating the same (un)locking pattern, reuse
> >> sock_map_sk_{acquire,release}(). This centralizes the code and makes it
> >> easier to adapt sockmap to af_unix-specific locking.
> >>
> >> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> >> ---
> >>  net/core/sock_map.c | 21 +++++++--------------
> >>  1 file changed, 7 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> >> index 02a68be3002a..7ba6a7f24ccd 100644
> >> --- a/net/core/sock_map.c
> >> +++ b/net/core/sock_map.c
> >> @@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
> >>                 sk = xchg(psk, NULL);
> >>                 if (sk) {
> >>                         sock_hold(sk);
> >> -                       lock_sock(sk);
> >> -                       rcu_read_lock();
> >> +                       sock_map_sk_acquire(sk);
> >>                         sock_map_unref(sk, psk);
> >> -                       rcu_read_unlock();
> >> -                       release_sock(sk);
> >> +                       sock_map_sk_release(sk);
> >>                         sock_put(sk);
> >>                 }
> >>         }
> >> @@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
> >>                  */
> >>                 hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
> >>                         hlist_del(&elem->node);
> >> -                       lock_sock(elem->sk);
> >> -                       rcu_read_lock();
> >> +                       sock_map_sk_acquire(elem->sk);
> >>                         sock_map_unref(elem->sk, elem);
> >> -                       rcu_read_unlock();
> >> -                       release_sock(elem->sk);
> >> +                       sock_map_sk_release(elem->sk);
> >>                         sock_put(elem->sk);
> >>                         sock_hash_free_elem(htab, elem);
> >>                 }
> >> @@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
> >>         void (*saved_close)(struct sock *sk, long timeout);
> >>         struct sk_psock *psock;
> >>
> >> -       lock_sock(sk);
> >> -       rcu_read_lock();
> >> +       sock_map_sk_acquire(sk);
> >>         psock = sk_psock(sk);
> >>         if (likely(psock)) {
> >>                 saved_close = psock->saved_close;
> >> @@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
> >>                 psock = sk_psock_get(sk);
> >>                 if (unlikely(!psock))
> >>                         goto no_psock;
> >> -               rcu_read_unlock();
> >>                 sk_psock_stop(psock);
> >> -               release_sock(sk);
> >> +               sock_map_sk_release(sk);
> >
> > I think sk_psock_stop() was intentionally put outside
> > of rcu_read_lock() to not extend the grace period
> > unnecessarily. e.g. while + __sk_msg_free().
> >
> > Maybe add __sock_map_sk_release() without
> > rcu_read_unlock() ?
>
> How about dropping this patch completely? The more I stare at it, I see no
> reason why af_unix state lock would matter in any of these places.

I agree.  Actually, once it's held, it can be released right away.
The lock is only to ensure that peer is set after checking
TCP_ESTABLISHED, but it continues holding unix_state_lock()
unnecessarily long.

Honestly I prefer Martin's idea, using unix_peer_get() in
unix_stream_bpf_update_proto().

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
  2026-03-11  4:17       ` Kuniyuki Iwashima
@ 2026-03-11  4:57         ` Kuniyuki Iwashima
  2026-03-15 23:58           ` Michal Luczaj
  0 siblings, 1 reply; 26+ messages in thread
From: Kuniyuki Iwashima @ 2026-03-11  4:57 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On Tue, Mar 10, 2026 at 9:17 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Fri, Mar 6, 2026 at 6:05 AM Michal Luczaj <mhal@rbox.co> wrote:
> >
> > On 3/6/26 06:44, Kuniyuki Iwashima wrote:
> > > On Thu, Mar 5, 2026 at 3:32 PM Michal Luczaj <mhal@rbox.co> wrote:
> > >>
> > >> Instead of repeating the same (un)locking pattern, reuse
> > >> sock_map_sk_{acquire,release}(). This centralizes the code and makes it
> > >> easier to adapt sockmap to af_unix-specific locking.
> > >>
> > >> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> > >> ---
> > >>  net/core/sock_map.c | 21 +++++++--------------
> > >>  1 file changed, 7 insertions(+), 14 deletions(-)
> > >>
> > >> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > >> index 02a68be3002a..7ba6a7f24ccd 100644
> > >> --- a/net/core/sock_map.c
> > >> +++ b/net/core/sock_map.c
> > >> @@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
> > >>                 sk = xchg(psk, NULL);
> > >>                 if (sk) {
> > >>                         sock_hold(sk);
> > >> -                       lock_sock(sk);
> > >> -                       rcu_read_lock();
> > >> +                       sock_map_sk_acquire(sk);
> > >>                         sock_map_unref(sk, psk);
> > >> -                       rcu_read_unlock();
> > >> -                       release_sock(sk);
> > >> +                       sock_map_sk_release(sk);
> > >>                         sock_put(sk);
> > >>                 }
> > >>         }
> > >> @@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
> > >>                  */
> > >>                 hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
> > >>                         hlist_del(&elem->node);
> > >> -                       lock_sock(elem->sk);
> > >> -                       rcu_read_lock();
> > >> +                       sock_map_sk_acquire(elem->sk);
> > >>                         sock_map_unref(elem->sk, elem);
> > >> -                       rcu_read_unlock();
> > >> -                       release_sock(elem->sk);
> > >> +                       sock_map_sk_release(elem->sk);
> > >>                         sock_put(elem->sk);
> > >>                         sock_hash_free_elem(htab, elem);
> > >>                 }
> > >> @@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
> > >>         void (*saved_close)(struct sock *sk, long timeout);
> > >>         struct sk_psock *psock;
> > >>
> > >> -       lock_sock(sk);
> > >> -       rcu_read_lock();
> > >> +       sock_map_sk_acquire(sk);
> > >>         psock = sk_psock(sk);
> > >>         if (likely(psock)) {
> > >>                 saved_close = psock->saved_close;
> > >> @@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
> > >>                 psock = sk_psock_get(sk);
> > >>                 if (unlikely(!psock))
> > >>                         goto no_psock;
> > >> -               rcu_read_unlock();
> > >>                 sk_psock_stop(psock);
> > >> -               release_sock(sk);
> > >> +               sock_map_sk_release(sk);
> > >
> > > I think sk_psock_stop() was intentionally put outside
> > > of rcu_read_lock() to not extend the grace period
> > > unnecessarily. e.g. while + __sk_msg_free().
> > >
> > > Maybe add __sock_map_sk_release() without
> > > rcu_read_unlock() ?
> >
> > How about dropping this patch completely? The more I stare at it, I see no
> > reason why af_unix state lock would matter in any of these places.
>
> I agree.  Actually, once it's held, it can be released right away.
> The lock is only to ensure that peer is set after checking
> TCP_ESTABLISHED, but it continues holding unix_state_lock()
> unnecessarily long.
>
> Honestly I prefer Martin's idea, using unix_peer_get() in
> unix_stream_bpf_update_proto().

Pondering again, I'm leaning to towards my initial approach,
just null check for peer, which allows bpf_iter to acquire
unix_state_lock() and make sure the socket is alive.
(still lock_sock() is needed for bpf_setsockopt())

The check is lightweight and SOCKMAP does not need to
hold the lock unnecessarily, and we can provide stable
result to bpf_iter.

IOW, if we hold unix_state_lock() in the SOCKMAP path
(even unix_peer_get()), we cannot use unix_state_lock()
for bpf_iter and lose stability since it will trigger dead lock.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
  2026-03-10 22:20       ` Martin KaFai Lau
@ 2026-03-15 23:58         ` Michal Luczaj
  2026-03-26  6:26           ` Martin KaFai Lau
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Luczaj @ 2026-03-15 23:58 UTC (permalink / raw)
  To: Martin KaFai Lau, Jiayuan Chen
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On 3/10/26 23:20, Martin KaFai Lau wrote:
> On 3/6/26 6:09 AM, Michal Luczaj wrote:
>> On 3/6/26 06:01, Jiayuan Chen wrote:
>>>
>>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>>> socket is properly set up, which would include having a defined peer. IOW,
>>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>>> socket which still has unix_peer(sk) == NULL.
>>>>
>>>>             T0 bpf                            T1 connect
>>>>             ------                            ----------
>>>>
>>>>                                   WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>>> sock_map_sk_state_allowed(sk)
>>>> ...
>>>> sk_pair = unix_peer(sk)
>>>> sock_hold(sk_pair)
>>>>                                   sock_hold(newsk)
>>>>                                   smp_mb__after_atomic()
>>>>                                   unix_peer(sk) = newsk
>>>>
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>>> Call Trace:
>>>>     sock_map_link+0x564/0x8b0
>>>>     sock_map_update_common+0x6e/0x340
>>>>     sock_map_update_elem_sys+0x17d/0x240
>>>>     __sys_bpf+0x26db/0x3250
>>>>     __x64_sys_bpf+0x21/0x30
>>>>     do_syscall_64+0x6b/0x3a0
>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>>> but that involved an additional memory barrier, and changing the hot path
>>>> was rejected. Then a check during proto update was considered[2], but a
>>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>>> lock. Or, more specifically, an insufficient lock[4].
>>>>
>>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>>
>>>> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
>>>> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
>>>> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
>>>> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
>>>>
>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
>>>> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
>>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>>> ---
>>>>    net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>>>    1 file changed, 29 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>>> --- a/net/core/sock_map.c
>>>> +++ b/net/core/sock_map.c
>>>> @@ -12,6 +12,7 @@
>>>>    #include <linux/list.h>
>>>>    #include <linux/jhash.h>
>>>>    #include <linux/sock_diag.h>
>>>> +#include <net/af_unix.h>
>>>>    #include <net/udp.h>
>>>>    
>>>>    struct bpf_stab {
>>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>>>>    }
>>>>    
>>>>    static void sock_map_sk_acquire(struct sock *sk)
>>>> -	__acquires(&sk->sk_lock.slock)
>>>>    {
>>>>    	lock_sock(sk);
>>>> +
>>>> +	if (sk_is_unix(sk))
>>>> +		unix_state_lock(sk);
>>>> +
>>>>    	rcu_read_lock();
>>>>    }
>>>>    
>>>
>>> This introduces a new ordering constraint: lock_sock() before
>>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>>> lock_sock() inside unix_state_lock() in the future would create an
>>> ABBA deadlock, which is exactly why the order was settled this way. However,
>>> the thread did not reach a conclusion on whether that constraint should be
>>> documented in the code.
>>>
>>> Since there is nothing enforcing this ordering mechanically, a brief comment
>>> at sock_map_sk_acquire() would help future readers avoid introducing the
>>> inverse ordering.
>>
>> Sure, will do.
>>
>>>>    static void sock_map_sk_release(struct sock *sk)
>>>> -	__releases(&sk->sk_lock.slock)
>>>>    {
>>>>    	rcu_read_unlock();
>>>> +
>>>> +	if (sk_is_unix(sk))
>>>> +		unix_state_unlock(sk);
>>>> +
>>>>    	release_sock(sk);
>>>>    }
>>>>    
>>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>>> +{
>>>> +	local_bh_disable();
>>>> +	bh_lock_sock(sk);
>>>> +
>>>> +	if (sk_is_unix(sk))
>>>> +		unix_state_lock(sk);
>>>> +}
>>>> +
>>>
>>>
>>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>>> sockets took only unix_state_lock() in the fast path, skipping
>>> bh_lock_sock() entirely:
>>>
>>> /* v2 */
>>> if (sk_is_unix(sk))
>>>     unix_state_lock(sk);
>>> else
>>>     bh_lock_sock(sk);
>>>
>>> v3 takes both for AF_UNIX sockets.
>>>
>>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>>> actually needs protection here — sk_state and unix_peer() — lives under
>>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>>> anything useful for AF_UNIX sockets in this path?
>>
>> Yeah, good point. I think I was just trying to keep the _fast variant
>> aligned with the sleepy one. Which I agree might be unnecessary.
> 
> I hope the common use case should not be calling 
> bpf_map_update_elem(sockmap) only.
> 
> Beside, from looking at the may_update_sockmap(), I don't know if it is 
> even doable (or useful) to bpf_map_update_elem(unix_sk) in 
> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking 
> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case 
> when the bpf_map_update_elem(sockmap) support was added iirc.

What about a situation when unix_sk is stored in a sockmap, then tc prog
looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
useful, but seems doable.

> The only path left is bpf_iter, which I believe was the primary use case 
> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice 
> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix) 
> where lock_sock() has already been done. It is more for 
> reading-correctness though. This just came to my mind. 
> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has 
> been using it to conditionally take lock_sock() or not.

[ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
always `true` in sock_map_update_elem(), right? ]

Let me know if I'm correctly rephrasing your idea: assume all bpf-context
callers hold the socket locked or keep it "stable" (meaning: "sk won't
surprise sockmap update by some breaking state change coming from another
thread"). As you said, most bpf iters already take the sock_lock(), and I
have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
dropping that bh_lock_sock().

> [ I would still keep patch 3 though. ]

Right.

> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
> 
>>
>> In a parallel thread I've asked Kuniyuki if it might be better to
>> completely drop patch 2/5, which would change how we interact with
>> sock_map_close(). Lets see how it goes.
>>
> 
> If patch 2 is dropped, lock_sock() is always needed for unix_sk?

For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
following Kuniyuki's suggestion to keep locking pattern/order (that repeats
when unix bpf iter prog invokes bpf_map_update_elem() ->
sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded
  2026-03-11  4:57         ` Kuniyuki Iwashima
@ 2026-03-15 23:58           ` Michal Luczaj
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Luczaj @ 2026-03-15 23:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
	Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On 3/11/26 05:57, Kuniyuki Iwashima wrote:
> On Tue, Mar 10, 2026 at 9:17 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>
>> On Fri, Mar 6, 2026 at 6:05 AM Michal Luczaj <mhal@rbox.co> wrote:
>>>
>>> On 3/6/26 06:44, Kuniyuki Iwashima wrote:
>>>> On Thu, Mar 5, 2026 at 3:32 PM Michal Luczaj <mhal@rbox.co> wrote:
>>>>>
>>>>> Instead of repeating the same (un)locking pattern, reuse
>>>>> sock_map_sk_{acquire,release}(). This centralizes the code and makes it
>>>>> easier to adapt sockmap to af_unix-specific locking.
>>>>>
>>>>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>>>>> ---
>>>>>  net/core/sock_map.c | 21 +++++++--------------
>>>>>  1 file changed, 7 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>>>> index 02a68be3002a..7ba6a7f24ccd 100644
>>>>> --- a/net/core/sock_map.c
>>>>> +++ b/net/core/sock_map.c
>>>>> @@ -353,11 +353,9 @@ static void sock_map_free(struct bpf_map *map)
>>>>>                 sk = xchg(psk, NULL);
>>>>>                 if (sk) {
>>>>>                         sock_hold(sk);
>>>>> -                       lock_sock(sk);
>>>>> -                       rcu_read_lock();
>>>>> +                       sock_map_sk_acquire(sk);
>>>>>                         sock_map_unref(sk, psk);
>>>>> -                       rcu_read_unlock();
>>>>> -                       release_sock(sk);
>>>>> +                       sock_map_sk_release(sk);
>>>>>                         sock_put(sk);
>>>>>                 }
>>>>>         }
>>>>> @@ -1176,11 +1174,9 @@ static void sock_hash_free(struct bpf_map *map)
>>>>>                  */
>>>>>                 hlist_for_each_entry_safe(elem, node, &unlink_list, node) {
>>>>>                         hlist_del(&elem->node);
>>>>> -                       lock_sock(elem->sk);
>>>>> -                       rcu_read_lock();
>>>>> +                       sock_map_sk_acquire(elem->sk);
>>>>>                         sock_map_unref(elem->sk, elem);
>>>>> -                       rcu_read_unlock();
>>>>> -                       release_sock(elem->sk);
>>>>> +                       sock_map_sk_release(elem->sk);
>>>>>                         sock_put(elem->sk);
>>>>>                         sock_hash_free_elem(htab, elem);
>>>>>                 }
>>>>> @@ -1676,8 +1672,7 @@ void sock_map_close(struct sock *sk, long timeout)
>>>>>         void (*saved_close)(struct sock *sk, long timeout);
>>>>>         struct sk_psock *psock;
>>>>>
>>>>> -       lock_sock(sk);
>>>>> -       rcu_read_lock();
>>>>> +       sock_map_sk_acquire(sk);
>>>>>         psock = sk_psock(sk);
>>>>>         if (likely(psock)) {
>>>>>                 saved_close = psock->saved_close;
>>>>> @@ -1685,16 +1680,14 @@ void sock_map_close(struct sock *sk, long timeout)
>>>>>                 psock = sk_psock_get(sk);
>>>>>                 if (unlikely(!psock))
>>>>>                         goto no_psock;
>>>>> -               rcu_read_unlock();
>>>>>                 sk_psock_stop(psock);
>>>>> -               release_sock(sk);
>>>>> +               sock_map_sk_release(sk);
>>>>
>>>> I think sk_psock_stop() was intentionally put outside
>>>> of rcu_read_lock() to not extend the grace period
>>>> unnecessarily. e.g. while + __sk_msg_free().
>>>>
>>>> Maybe add __sock_map_sk_release() without
>>>> rcu_read_unlock() ?
>>>
>>> How about dropping this patch completely? The more I stare at it, I see no
>>> reason why af_unix state lock would matter in any of these places.
>>
>> I agree.  Actually, once it's held, it can be released right away.
>> The lock is only to ensure that peer is set after checking
>> TCP_ESTABLISHED, but it continues holding unix_state_lock()
>> unnecessarily long.
>>
>> Honestly I prefer Martin's idea, using unix_peer_get() in
>> unix_stream_bpf_update_proto().
> 
> Pondering again, I'm leaning to towards my initial approach,
> just null check for peer, which allows bpf_iter to acquire
> unix_state_lock() and make sure the socket is alive.
> (still lock_sock() is needed for bpf_setsockopt())
> 
> The check is lightweight and SOCKMAP does not need to
> hold the lock unnecessarily, and we can provide stable
> result to bpf_iter.

Right, unix_state_lock() would keep the unix sock state. But is keeping the
unix state that important in the context of bpf iter?

> IOW, if we hold unix_state_lock() in the SOCKMAP path
> (even unix_peer_get()), we cannot use unix_state_lock()
> for bpf_iter and lose stability since it will trigger dead lock.

Assuming the answer to the question above is "not really", I'd say taking
unix_state_lock() during sockmap update (both bpf and non-bpf context)
makes sense for 2 reasons:

1. Fulfil sockmap's locking expectation as Martin pointed out; once the
state lock is taken, no need to (additionally) protected against
null-ptr-deref during sock_map_update_elem_sys().

2. Slightly better handle the case of unix sock changing state unexpectedly
_while_ it is being placed in a sockmap by sock_map_update_elem(). At least
for now; in the follow up series we may be able to get rid of (or
simplify?) sock_map_sk_{acquire,release}_fast().

So, what I'm saying is: let's drop changes touching sock_map_free(),
sock_hash_free(), sock_map_close() (this patch), stick to lock_sock-only
unix iter, for non-bpf-context update take the unix state lock (in series
with lock_sock() per your suggestion) which tackles the null-ptr-deref, and
for bpf-context update let's follow what's currently happening for
non-af_unix socks: take the unix state spinlock.

This would boil down to:

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 02a68be3002a..61d782db614c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -12,6 +12,7 @@
 #include <linux/list.h>
 #include <linux/jhash.h>
 #include <linux/sock_diag.h>
+#include <net/af_unix.h>
 #include <net/udp.h>

 struct bpf_stab {
@@ -115,19 +116,45 @@ int sock_map_prog_detach(const union bpf_attr *attr,
enum bpf_prog_type ptype)
 }

 static void sock_map_sk_acquire(struct sock *sk)
-	__acquires(&sk->sk_lock.slock)
 {
 	lock_sock(sk);
+
+	if (sk_is_unix(sk))
+		unix_state_lock(sk);
+
 	rcu_read_lock();
 }

 static void sock_map_sk_release(struct sock *sk)
-	__releases(&sk->sk_lock.slock)
 {
 	rcu_read_unlock();
+
+	if (sk_is_unix(sk))
+		unix_state_unlock(sk);
+
 	release_sock(sk);
 }

+static void sock_map_sk_acquire_fast(struct sock *sk)
+{
+	if (sk_is_unix(sk)) {
+		unix_state_lock(sk);
+	} else {
+		local_bh_disable();
+		bh_lock_sock(sk);
+	}
+}
+
+static void sock_map_sk_release_fast(struct sock *sk)
+{
+	if (sk_is_unix(sk)) {
+		unix_state_unlock(sk);
+	} else {
+		bh_unlock_sock(sk);
+		local_bh_enable();
+	}
+}
+
 static void sock_map_add_link(struct sk_psock *psock,
 			      struct sk_psock_link *link,
 			      struct bpf_map *map, void *link_raw)
@@ -606,16 +633,14 @@ static long sock_map_update_elem(struct bpf_map *map,
void *key,
 	if (!sock_map_sk_is_suitable(sk))
 		return -EOPNOTSUPP;

-	local_bh_disable();
-	bh_lock_sock(sk);
+	sock_map_sk_acquire_fast(sk);
 	if (!sock_map_sk_state_allowed(sk))
 		ret = -EOPNOTSUPP;
 	else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
 		ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
 	else
 		ret = sock_hash_update_common(map, key, sk, flags);
-	bh_unlock_sock(sk);
-	local_bh_enable();
+	sock_map_sk_release_fast(sk);
 	return ret;
 }


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races
  2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
  2026-03-06  5:30   ` Kuniyuki Iwashima
  2026-03-06  6:24   ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock^sk_state data-races Jiayuan Chen
@ 2026-03-18 17:05   ` Michal Luczaj
  2 siblings, 0 replies; 26+ messages in thread
From: Michal Luczaj @ 2026-03-18 17:05 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest

On 3/6/26 00:30, Michal Luczaj wrote:
> sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
> socket sk_state locklessly.
> 
> Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
> not only af_unix, but all non-TCP sockets (UDP, af_vsock).
> 
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  net/core/sock_map.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index b0e96337a269..02a68be3002a 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c

...

> @@ -543,7 +543,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
>  	if (sk_is_tcp(sk))
>  		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
>  	if (sk_is_stream_unix(sk))
> -		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
> +		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
>  	if (sk_is_vsock(sk) &&
>  	    (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET))
>  		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
> 

Another reason to (conditionally) take unix_state_lock() at
sock_map_update_elem{,_sys}(): chunk above can be dropped.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
  2026-03-15 23:58         ` Michal Luczaj
@ 2026-03-26  6:26           ` Martin KaFai Lau
  0 siblings, 0 replies; 26+ messages in thread
From: Martin KaFai Lau @ 2026-03-26  6:26 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: Jiayuan Chen, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest

On 3/15/26 4:58 PM, Michal Luczaj wrote:
>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>> when the bpf_map_update_elem(sockmap) support was added iirc.
> 
> What about a situation when unix_sk is stored in a sockmap, then tc prog
> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
> useful, but seems doable.

[ Sorry for the late reply ]

It is a bummer that the bpf_map_update_elem(unix_sk) path is possible 
from tc :(

Then unix_state_lock() in its current form cannot be safely acquired in 
sock_map_update_elem(). It is currently a spin_lock() instead of 
spin_lock_bh().

> 
>> The only path left is bpf_iter, which I believe was the primary use case
>> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
>> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
>> where lock_sock() has already been done. It is more for
>> reading-correctness though. This just came to my mind.
>> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
>> been using it to conditionally take lock_sock() or not.
> 
> [ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
> which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
> always `true` in sock_map_update_elem(), right? ]

For all the bpf prog types allowed by may_update_sockmap() to do 
bpf_map_update_elem(sockmap), only BPF_TRACE_ITER should have 
has_current_bpf_ctx() == true. The tc prog (and others allowed in 
may_update_sockmap()) will have has_current_bpf_ctx() == false when 
calling sock_map_update_elem().

The tc case of bpf_map_update_elem(unix_sk) is unfortunate and requires 
going back to the drawing board. I think checking unix_peer(sk) for NULL 
without acquiring unix_state_lock() is needed for the 
sock_map_update_elem() path, since changing unix_state_lock() for this 
unknown use case seems overkill.

Whether sock_map_update_elem_"sys"() needs unix_state_lock() is up for 
debate.

For bpf_iter_unix_seq_show(), one thought is to add unix_state_lock() 
there before running the bpf iter prog. iiuc, it is what Kuniyuki has in 
mind also to allow bpf iter prog having a stable view of unix_sock. This 
could be a followup.
[fwiw, it was why I first thought of has_current_bpf_ctx() to avoid 
sock_map_update_elem() taking unix_state_lock() again if 
bpf_iter_unix_seq_show() acquires unix_state_lock() earlier. I later 
concluded (but proved to be incorrect) that tc cannot call 
bpf_map_update_elem(unix_sk).]

> 
> Let me know if I'm correctly rephrasing your idea: assume all bpf-context
> callers hold the socket locked or keep it "stable" (meaning: "sk won't
> surprise sockmap update by some breaking state change coming from another
> thread"). As you said, most bpf iters already take the sock_lock(), and I

Right, all bpf iter (udp, tcp, unix) has acquired the lock_sock() before 
running the bpf iter prog. afaik, the only exception is netlink bpf iter 
but it cannot be added to sock_map afaik.

> have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
> dropping that bh_lock_sock().
> 
>> [ I would still keep patch 3 though. ]
> 
> Right.
> 
>> [1]: https://lore.kernel.org/bpf/20200821102948.21918-1-lmb@cloudflare.com/
>>
>>>
>>> In a parallel thread I've asked Kuniyuki if it might be better to
>>> completely drop patch 2/5, which would change how we interact with
>>> sock_map_close(). Lets see how it goes.
>>>
>>
>> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
> 
> For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
> following Kuniyuki's suggestion to keep locking pattern/order (that repeats
> when unix bpf iter prog invokes bpf_map_update_elem() ->
> sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-03-26  6:26 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-05 23:30 [PATCH bpf v3 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update Michal Luczaj
2026-03-05 23:30 ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
2026-03-06  5:30   ` Kuniyuki Iwashima
2026-03-06  6:24   ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock^sk_state data-races Jiayuan Chen
2026-03-18 17:05   ` [PATCH bpf v3 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races Michal Luczaj
2026-03-05 23:30 ` [PATCH bpf v3 2/5] bpf, sockmap: Use sock_map_sk_{acquire,release}() where open-coded Michal Luczaj
2026-03-06  5:44   ` Kuniyuki Iwashima
2026-03-06 14:05     ` Michal Luczaj
2026-03-11  4:17       ` Kuniyuki Iwashima
2026-03-11  4:57         ` Kuniyuki Iwashima
2026-03-15 23:58           ` Michal Luczaj
2026-03-05 23:30 ` [PATCH bpf v3 3/5] bpf, sockmap: Fix af_unix iter deadlock Michal Luczaj
2026-03-06  5:47   ` Kuniyuki Iwashima
2026-03-06  6:04   ` Jiayuan Chen
2026-03-06  6:15     ` Jiayuan Chen
2026-03-06 14:06     ` Michal Luczaj
2026-03-06 14:31       ` Jiayuan Chen
2026-03-06 14:33   ` Jiayuan Chen
2026-03-05 23:30 ` [PATCH bpf v3 4/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking Michal Luczaj
2026-03-06 14:34   ` Jiayuan Chen
2026-03-05 23:30 ` [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock Michal Luczaj
2026-03-06  5:01   ` Jiayuan Chen
2026-03-06 14:09     ` Michal Luczaj
2026-03-10 22:20       ` Martin KaFai Lau
2026-03-15 23:58         ` Michal Luczaj
2026-03-26  6:26           ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox