netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 0/3] bpf: Fix use-after-free of sockmap
@ 2025-03-17  9:22 Jiayuan Chen
  2025-03-17  9:22 ` [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-17  9:22 UTC (permalink / raw)
  To: xiyou.wangcong, john.fastabend, jakub
  Cc: davem, edumazet, kuba, pabeni, horms, andrii, eddyz87, mykolal,
	ast, daniel, martin.lau, song, yonghong.song, kpsingh, sdf,
	haoluo, jolsa, shuah, mhal, jiayuan.chen, sgarzare, netdev, bpf,
	linux-kernel, linux-kselftest

Hi all, this is the v3 version.
===
Syzkaller reported this issue [1].

The current sockmap has a dependency on sk_socket in both read and write
stages, but there is a possibility that sk->sk_socket is released during
the process, leading to panic situations. For a detailed reproduction,
please refer to the description in the v2:
https://lore.kernel.org/bpf/20250228055106.58071-1-jiayuan.chen@linux.dev/

The corresponding fix approaches are described in the commit messages of
each patch.

By the way, the current sockmap lacks statistical information, especially
global statistics, such as the number of successful or failed rx and tx
operations. These statistics cannot be obtained from the socket interface
itself.

These data will be of great help in troubleshooting issues and observing
sockmap behavior.

If the maintainer/reviewer does not object, I think we can provide these
statistical information in the future, either through proc/trace/bpftool.

[1] https://syzkaller.appspot.com/bug?extid=dd90a702f518e0eac072

---
v2 -> v3:
1. Michal Luczaj reported similar race issue under sockmap sending path.
2. Rcu lock is conflict with mutex_lock in unix socket read implementation.
https://lore.kernel.org/bpf/20250228055106.58071-1-jiayuan.chen@linux.dev/

v1 -> v2:
1. Add Fixes tag.
2. Extend selftest of edge case for TCP/UDP sockets.
3. Add Reviewed-by and Acked-by tag.
https://lore.kernel.org/bpf/20250226132242.52663-1-jiayuan.chen@linux.dev/T/#t

Jiayuan Chen (3):
  bpf, sockmap: avoid using sk_socket after free when sending
  bpf, sockmap: avoid using sk_socket after free when reading
  selftests/bpf: Add edge case tests for sockmap

 net/core/skmsg.c                              | 22 ++++++-
 .../selftests/bpf/prog_tests/socket_helpers.h | 13 +++-
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 60 +++++++++++++++++++
 3 files changed, 91 insertions(+), 4 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending
  2025-03-17  9:22 [PATCH bpf-next v3 0/3] bpf: Fix use-after-free of sockmap Jiayuan Chen
@ 2025-03-17  9:22 ` Jiayuan Chen
  2025-03-19 23:02   ` Cong Wang
  2025-03-20 12:32   ` Michal Luczaj
  2025-03-17  9:22 ` [PATCH bpf-next v3 2/3] bpf, sockmap: avoid using sk_socket after free when reading Jiayuan Chen
  2025-03-17  9:22 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add edge case tests for sockmap Jiayuan Chen
  2 siblings, 2 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-17  9:22 UTC (permalink / raw)
  To: xiyou.wangcong, john.fastabend, jakub
  Cc: davem, edumazet, kuba, pabeni, horms, andrii, eddyz87, mykolal,
	ast, daniel, martin.lau, song, yonghong.song, kpsingh, sdf,
	haoluo, jolsa, shuah, mhal, jiayuan.chen, sgarzare, netdev, bpf,
	linux-kernel, linux-kselftest

The sk->sk_socket is not locked or referenced, and during the call to
skb_send_sock(), there is a race condition with the release of sk_socket.
All types of sockets(tcp/udp/unix/vsock) will be affected.

Race conditions:
'''
CPU0                               CPU1
skb_send_sock
  sendmsg_unlocked
    sock_sendmsg
      sock_sendmsg_nosec
                                   close(fd):
                                     ...
                                   ops->release()
                                     sock_map_close()
                                   sk_socket->ops = NULL
                                   free(socket)
      sock->ops->sendmsg
            ^
            panic here
'''

Based on the fact that we already wait for the workqueue to finish in
sock_map_close() if psock is held, we simply increase the psock
reference count to avoid race conditions.
'''
void sock_map_close()
{
    ...
    if (likely(psock)) {
    ...
    psock = sk_psock_get(sk);
    if (unlikely(!psock))
        goto no_psock; <=== Control usually jumps here via goto
        ...
        cancel_delayed_work_sync(&psock->work); <=== not executed
        sk_psock_put(sk, psock);
        ...
}
'''

The panic I catched:
'''
Workqueue: events sk_psock_backlog
RIP: 0010:sock_sendmsg+0x21d/0x440
RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001
...
Call Trace:
 <TASK>
 ? die_addr+0x40/0xa0
 ? exc_general_protection+0x14c/0x230
 ? asm_exc_general_protection+0x26/0x30
 ? sock_sendmsg+0x21d/0x440
 ? sock_sendmsg+0x3e0/0x440
 ? __pfx_sock_sendmsg+0x10/0x10
 __skb_send_sock+0x543/0xb70
 sk_psock_backlog+0x247/0xb80
...
'''

Reported-by: Michal Luczaj <mhal@rbox.co>
Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

---
Some approach I tried
1. add rcu:
   - RCU conflicts with mutex_lock in Unix socket send path.
   - Race conditions still exist when reading sk->sk_socket->ops for in
     current sock_sendmsg implementation.

2. Increased the reference of sk_socket->file:
   - If the user calls close(fd), we will do nothing because the reference
     count is not set to 0. It's unexpected.

3. Use sock_lock when calling skb_send_sock:
   - skb_send_sock itself already do the locking.
   - If we call skb_send_sock_locked instead, we have to implement
     sendmsg_locked for each protocol, which is not easy for UDP or Unix,
     as the sending process involves frequent locking and unlocking, which
     makes it challenging to isolate the locking logic.
---
 net/core/skmsg.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 0ddc4c718833..6101c1bb279a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -655,6 +655,14 @@ static void sk_psock_backlog(struct work_struct *work)
 	bool ingress;
 	int ret;
 
+	/* Increment the psock refcnt to synchronize with close(fd) path in
+	 * sock_map_close(), ensuring we wait for backlog thread completion
+	 * before sk_socket freed. If refcnt increment fails, it indicates
+	 * sock_map_close() completed with sk_socket potentially already freed.
+	 */
+	if (!sk_psock_get(psock->sk))
+		return;
+
 	mutex_lock(&psock->work_mutex);
 	if (unlikely(state->len)) {
 		len = state->len;
@@ -702,6 +710,7 @@ static void sk_psock_backlog(struct work_struct *work)
 	}
 end:
 	mutex_unlock(&psock->work_mutex);
+	sk_psock_put(psock->sk, psock);
 }
 
 struct sk_psock *sk_psock_init(struct sock *sk, int node)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v3 2/3] bpf, sockmap: avoid using sk_socket after free when reading
  2025-03-17  9:22 [PATCH bpf-next v3 0/3] bpf: Fix use-after-free of sockmap Jiayuan Chen
  2025-03-17  9:22 ` [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
@ 2025-03-17  9:22 ` Jiayuan Chen
  2025-03-20  0:34   ` Cong Wang
  2025-03-17  9:22 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add edge case tests for sockmap Jiayuan Chen
  2 siblings, 1 reply; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-17  9:22 UTC (permalink / raw)
  To: xiyou.wangcong, john.fastabend, jakub
  Cc: davem, edumazet, kuba, pabeni, horms, andrii, eddyz87, mykolal,
	ast, daniel, martin.lau, song, yonghong.song, kpsingh, sdf,
	haoluo, jolsa, shuah, mhal, jiayuan.chen, sgarzare, netdev, bpf,
	linux-kernel, linux-kselftest, syzbot+dd90a702f518e0eac072

There are potential concurrency issues, as shown below.
'''
CPU0                               CPU1
sk_psock_verdict_data_ready:
  socket *sock = sk->sk_socket
  if (!sock) return
                                   close(fd):
                                     ...
                                     ops->release()
  if (!sock->ops) return
                                     sock->ops = NULL
                                     rcu_call(sock)
                                     free(sock)
  READ_ONCE(sock->ops)
  ^
  use 'sock' after free
'''

RCU is not applicable to Unix sockets read path, because the Unix socket
implementation itself assumes it's always in process context and heavily
uses mutex_lock, so, we can't call read_skb within rcu lock.

Incrementing the psock reference count would not help either, since
sock_map_close() does not wait for data_ready() to complete its execution.

While we don't utilize sk_socket here, implementing read_skb at the sock
layer instead of socket layer might be architecturally preferable ?
However, deferring this optimization as current fix adequately addresses
the immediate issue.

Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Reported-by: syzbot+dd90a702f518e0eac072@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/6734c033.050a0220.2a2fcc.0015.GAE@google.com/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/core/skmsg.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6101c1bb279a..5e913b62929e 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -1231,17 +1231,24 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
 
 static void sk_psock_verdict_data_ready(struct sock *sk)
 {
-	struct socket *sock = sk->sk_socket;
+	struct socket *sock;
 	const struct proto_ops *ops;
 	int copied;
 
 	trace_sk_data_ready(sk);
 
-	if (unlikely(!sock))
+	rcu_read_lock();
+	sock = sk->sk_socket;
+	if (unlikely(!sock)) {
+		rcu_read_unlock();
 		return;
+	}
 	ops = READ_ONCE(sock->ops);
-	if (!ops || !ops->read_skb)
+	if (!ops || !ops->read_skb) {
+		rcu_read_unlock();
 		return;
+	}
+	rcu_read_unlock();
 	copied = ops->read_skb(sk, sk_psock_verdict_recv);
 	if (copied >= 0) {
 		struct sk_psock *psock;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v3 3/3] selftests/bpf: Add edge case tests for sockmap
  2025-03-17  9:22 [PATCH bpf-next v3 0/3] bpf: Fix use-after-free of sockmap Jiayuan Chen
  2025-03-17  9:22 ` [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
  2025-03-17  9:22 ` [PATCH bpf-next v3 2/3] bpf, sockmap: avoid using sk_socket after free when reading Jiayuan Chen
@ 2025-03-17  9:22 ` Jiayuan Chen
  2 siblings, 0 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-17  9:22 UTC (permalink / raw)
  To: xiyou.wangcong, john.fastabend, jakub
  Cc: davem, edumazet, kuba, pabeni, horms, andrii, eddyz87, mykolal,
	ast, daniel, martin.lau, song, yonghong.song, kpsingh, sdf,
	haoluo, jolsa, shuah, mhal, jiayuan.chen, sgarzare, netdev, bpf,
	linux-kernel, linux-kselftest

Add edge case tests for sockmap.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 .../selftests/bpf/prog_tests/socket_helpers.h | 13 +++-
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 60 +++++++++++++++++++
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/socket_helpers.h b/tools/testing/selftests/bpf/prog_tests/socket_helpers.h
index 1bdfb79ef009..a805143dd84f 100644
--- a/tools/testing/selftests/bpf/prog_tests/socket_helpers.h
+++ b/tools/testing/selftests/bpf/prog_tests/socket_helpers.h
@@ -313,11 +313,22 @@ static inline int recv_timeout(int fd, void *buf, size_t len, int flags,
 
 static inline int create_pair(int family, int sotype, int *p0, int *p1)
 {
-	__close_fd int s, c = -1, p = -1;
+	__close_fd int s = -1, c = -1, p = -1;
 	struct sockaddr_storage addr;
 	socklen_t len = sizeof(addr);
 	int err;
 
+	if (family == AF_UNIX) {
+		int fds[2];
+
+		err = socketpair(family, sotype, 0, fds);
+		if (!err) {
+			*p0 = fds[0];
+			*p1 = fds[1];
+		}
+		return err;
+	}
+
 	s = socket_loopback(family, sotype);
 	if (s < 0)
 		return s;
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index 1e3e4392dcca..c72357f41035 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -1042,6 +1042,59 @@ static void test_sockmap_vsock_unconnected(void)
 	xclose(map);
 }
 
+void *close_thread(void *arg)
+{
+	int *fd = (int *)arg;
+
+	sleep(1);
+	close(*fd);
+	*fd = -1;
+	return NULL;
+}
+
+void test_sockmap_with_close_on_write(int family, int sotype)
+{
+	struct test_sockmap_pass_prog *skel;
+	int err, map, verdict;
+	pthread_t tid;
+	int zero = 0;
+	int c = -1, p = -1;
+
+	skel = test_sockmap_pass_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	map = bpf_map__fd(skel->maps.sock_map_rx);
+
+	err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (!ASSERT_OK(err, "bpf_prog_attach"))
+		goto out;
+
+	err = create_pair(family, sotype, &c, &p);
+	if (!ASSERT_OK(err, "create_pair"))
+		goto out;
+
+	err = bpf_map_update_elem(map, &zero, &p, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto out;
+
+	err = pthread_create(&tid, 0, close_thread, &p);
+	if (!ASSERT_OK(err, "pthread_create"))
+		goto out;
+
+	while (p > 0)
+		send(c, "a", 1, MSG_NOSIGNAL);
+
+	pthread_join(tid, NULL);
+out:
+	if (c > 0)
+		close(c);
+	if (p > 0)
+		close(p);
+	test_sockmap_pass_prog__destroy(skel);
+}
+
 void test_sockmap_basic(void)
 {
 	if (test__start_subtest("sockmap create_update_free"))
@@ -1108,4 +1161,11 @@ void test_sockmap_basic(void)
 		test_sockmap_skb_verdict_vsock_poll();
 	if (test__start_subtest("sockmap vsock unconnected"))
 		test_sockmap_vsock_unconnected();
+	if (test__start_subtest("sockmap with write on close")) {
+		test_sockmap_with_close_on_write(AF_UNIX, SOCK_STREAM);
+		test_sockmap_with_close_on_write(AF_UNIX, SOCK_DGRAM);
+		test_sockmap_with_close_on_write(AF_INET, SOCK_STREAM);
+		test_sockmap_with_close_on_write(AF_INET, SOCK_DGRAM);
+		test_sockmap_with_close_on_write(AF_VSOCK, SOCK_STREAM);
+	}
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending
  2025-03-17  9:22 ` [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
@ 2025-03-19 23:02   ` Cong Wang
  2025-03-19 23:36     ` Jiayuan Chen
  2025-03-20 12:32   ` Michal Luczaj
  1 sibling, 1 reply; 12+ messages in thread
From: Cong Wang @ 2025-03-19 23:02 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: john.fastabend, jakub, davem, edumazet, kuba, pabeni, horms,
	andrii, eddyz87, mykolal, ast, daniel, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, mhal, sgarzare,
	netdev, bpf, linux-kernel, linux-kselftest

On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote:
> The sk->sk_socket is not locked or referenced, and during the call to

Hm? We should have a reference in socket map, whether directly or
indirectly, right? When we add a socket to a socket map, we do call
sock_map_psock_get_checked() to obtain a reference.

> skb_send_sock(), there is a race condition with the release of sk_socket.
> All types of sockets(tcp/udp/unix/vsock) will be affected.
> 
> Race conditions:
> '''
> CPU0                               CPU1
> skb_send_sock
>   sendmsg_unlocked
>     sock_sendmsg
>       sock_sendmsg_nosec
>                                    close(fd):
>                                      ...
>                                    ops->release()

IIRC, ->release() is only called when the refcnt of fd becomes zero, so
I wonder how we reach here despite we have a reference of psock->refcnt?

>                                      sock_map_close()
>                                    sk_socket->ops = NULL
>                                    free(socket)
>       sock->ops->sendmsg
>             ^
>             panic here

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending
  2025-03-19 23:02   ` Cong Wang
@ 2025-03-19 23:36     ` Jiayuan Chen
  2025-03-20  0:06       ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-19 23:36 UTC (permalink / raw)
  To: Cong Wang
  Cc: john.fastabend, jakub, davem, edumazet, kuba, pabeni, horms,
	andrii, eddyz87, mykolal, ast, daniel, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, mhal, sgarzare,
	netdev, bpf, linux-kernel, linux-kselftest

2025/3/20 07:02, "Cong Wang" <xiyou.wangcong@gmail.com> wrote:

> 
> On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote:
> 
> > 
> > The sk->sk_socket is not locked or referenced, and during the call to
> > 
> 
> Hm? We should have a reference in socket map, whether directly or
> 
> indirectly, right? When we add a socket to a socket map, we do call
> 
> sock_map_psock_get_checked() to obtain a reference.
> 

Yes, but we remove psock from sockmap when sock_map_close() was called
'''
sock_map_close
	lock_sock(sk);
	rcu_read_lock();
	psock = sk_psock(sk);
        // here we remove psock and the reference of psock become 0
	sock_map_remove_links(sk, psock)
        psock = sk_psock_get(sk);
        if (unlikely(!psock))
            goto no_psock;     <=== jmp to no_psock
        rcu_read_unlock();
        release_sock(sk);
        cancel_delayed_work_sync(&psock->work); <== no chance to run cancel
'''

So I think we should hold the psock when backlog running

Thanks,

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending
  2025-03-19 23:36     ` Jiayuan Chen
@ 2025-03-20  0:06       ` Cong Wang
  2025-03-20  0:27         ` Jiayuan Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2025-03-20  0:06 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: john.fastabend, jakub, davem, edumazet, kuba, pabeni, horms,
	andrii, eddyz87, mykolal, ast, daniel, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, mhal, sgarzare,
	netdev, bpf, linux-kernel, linux-kselftest

On Wed, Mar 19, 2025 at 11:36:13PM +0000, Jiayuan Chen wrote:
> 2025/3/20 07:02, "Cong Wang" <xiyou.wangcong@gmail.com> wrote:
> 
> > 
> > On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote:
> > 
> > > 
> > > The sk->sk_socket is not locked or referenced, and during the call to
> > > 
> > 
> > Hm? We should have a reference in socket map, whether directly or
> > 
> > indirectly, right? When we add a socket to a socket map, we do call
> > 
> > sock_map_psock_get_checked() to obtain a reference.
> > 
> 
> Yes, but we remove psock from sockmap when sock_map_close() was called
> '''
> sock_map_close
> 	lock_sock(sk);
> 	rcu_read_lock();
> 	psock = sk_psock(sk);
>         // here we remove psock and the reference of psock become 0
> 	sock_map_remove_links(sk, psock)

sk_psock_drop() also calls cancel_delayed_work_sync(&psock->work),
althrough in yet another work. Is this also a contribution to this bug?

>         psock = sk_psock_get(sk);
>         if (unlikely(!psock))
>             goto no_psock;     <=== jmp to no_psock
>         rcu_read_unlock();
>         release_sock(sk);
>         cancel_delayed_work_sync(&psock->work); <== no chance to run cancel
> '''
> 

I have to say sock_map_close() becomes harder and harder to understand
now. And I am feeling we may have more bugs since we have two flying
work's here: psock->rwork and psock->work.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending
  2025-03-20  0:06       ` Cong Wang
@ 2025-03-20  0:27         ` Jiayuan Chen
  0 siblings, 0 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-20  0:27 UTC (permalink / raw)
  To: Cong Wang
  Cc: john.fastabend, jakub, davem, edumazet, kuba, pabeni, horms,
	andrii, eddyz87, mykolal, ast, daniel, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, mhal, sgarzare,
	netdev, bpf, linux-kernel, linux-kselftest

March 20, 2025 at 08:06, "Cong Wang" <xiyou.wangcong@gmail.com> wrote:

> 
> On Wed, Mar 19, 2025 at 11:36:13PM +0000, Jiayuan Chen wrote:
> 
> > 
> > 2025/3/20 07:02, "Cong Wang" <xiyou.wangcong@gmail.com> wrote:
> > 
> >  
> > 
> >  
> > 
> >  On Mon, Mar 17, 2025 at 05:22:54PM +0800, Jiayuan Chen wrote:
> > 
> >  
> > 
> >  > 
> > 
> >  > The sk->sk_socket is not locked or referenced, and during the call to
> > 
> >  > 
> > 
> >  
> > 
> >  Hm? We should have a reference in socket map, whether directly or
> > 
> >  
> > 
> >  indirectly, right? When we add a socket to a socket map, we do call
> > 
> >  
> > 
> >  sock_map_psock_get_checked() to obtain a reference.
> > 
> >  
> > 
> >  
> > 
> >  Yes, but we remove psock from sockmap when sock_map_close() was called
> > 
> >  '''
> > 
> >  sock_map_close
> > 
> >  lock_sock(sk);
> > 
> >  rcu_read_lock();
> > 
> >  psock = sk_psock(sk);
> > 
> >  // here we remove psock and the reference of psock become 0
> > 
> >  sock_map_remove_links(sk, psock)
> > 
> 
> sk_psock_drop() also calls cancel_delayed_work_sync(&psock->work),
> 
> althrough in yet another work. Is this also a contribution to this bug?
>

Maybe it's related. Calling cancel_delayed_work_sync() in sk_psock_drop()
is too late for our scenario.

To be more precise, the core goal of this patch is to prevent sock_map_close()
from executing until the backlog work completes. This is because sock_map_close()
resides in the close(fd) path, once it finishes, subsequent steps will release
the sk_socket. Therefore, performing cancellation in sk_psock_drop() is too late.

Upon reviewing historical commits, I found that the backlog work originally held
lock_sk, which naturally synchronized with lock_sk in sock_map_close. However,
when the backlog work later removed lock_sk, an alternative synchronization
mechanism(just hold psock reference like this patch) became necessary.
> > 
> > psock = sk_psock_get(sk);
> > 
> >  if (unlikely(!psock))
> > 
> >  goto no_psock; <=== jmp to no_psock
> > 
> >  rcu_read_unlock();
> > 
> >  release_sock(sk);
> > 
> >  cancel_delayed_work_sync(&psock->work); <== no chance to run cancel
> > 
> >  '''
> > 
> 
> I have to say sock_map_close() becomes harder and harder to understand
> 
> now. And I am feeling we may have more bugs since we have two flying
> 
> work's here: psock->rwork and psock->work.
> 
> Thanks.

Yes, this patch prevent sock_map_close() from executing
until the backlog work completes. This likely makes the
cancel_delayed_work in sk_psock_destroy redundant.

The code has undergone too many iterations. While sk_psock_destroy certainly
contains redundant operations, we should retain it for now. There may be
hidden dependencies we haven't fully untangled.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, sockmap: avoid using sk_socket after free when reading
  2025-03-17  9:22 ` [PATCH bpf-next v3 2/3] bpf, sockmap: avoid using sk_socket after free when reading Jiayuan Chen
@ 2025-03-20  0:34   ` Cong Wang
  2025-03-20 12:36     ` Jiayuan Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2025-03-20  0:34 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: john.fastabend, jakub, davem, edumazet, kuba, pabeni, horms,
	andrii, eddyz87, mykolal, ast, daniel, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, mhal, sgarzare,
	netdev, bpf, linux-kernel, linux-kselftest,
	syzbot+dd90a702f518e0eac072

On Mon, Mar 17, 2025 at 05:22:55PM +0800, Jiayuan Chen wrote:
> There are potential concurrency issues, as shown below.
> '''
> CPU0                               CPU1
> sk_psock_verdict_data_ready:
>   socket *sock = sk->sk_socket
>   if (!sock) return
>                                    close(fd):
>                                      ...
>                                      ops->release()
>   if (!sock->ops) return
>                                      sock->ops = NULL
>                                      rcu_call(sock)
>                                      free(sock)
>   READ_ONCE(sock->ops)
>   ^
>   use 'sock' after free
> '''
> 
> RCU is not applicable to Unix sockets read path, because the Unix socket
> implementation itself assumes it's always in process context and heavily
> uses mutex_lock, so, we can't call read_skb within rcu lock.

Hm, I guess the RCU work in sk_psock_drop() does not work for Unix
domain sockets either?

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending
  2025-03-17  9:22 ` [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
  2025-03-19 23:02   ` Cong Wang
@ 2025-03-20 12:32   ` Michal Luczaj
  2025-03-20 14:48     ` Jiayuan Chen
  1 sibling, 1 reply; 12+ messages in thread
From: Michal Luczaj @ 2025-03-20 12:32 UTC (permalink / raw)
  To: Jiayuan Chen, xiyou.wangcong, john.fastabend, jakub
  Cc: davem, edumazet, kuba, pabeni, horms, andrii, eddyz87, mykolal,
	ast, daniel, martin.lau, song, yonghong.song, kpsingh, sdf,
	haoluo, jolsa, shuah, sgarzare, netdev, bpf, linux-kernel,
	linux-kselftest

On 3/17/25 10:22, Jiayuan Chen wrote:
> The sk->sk_socket is not locked or referenced, and during the call to
> skb_send_sock(), there is a race condition with the release of sk_socket.
> All types of sockets(tcp/udp/unix/vsock) will be affected.
> ...
> Some approach I tried
> ...
> 2. Increased the reference of sk_socket->file:
>    - If the user calls close(fd), we will do nothing because the reference
>      count is not set to 0. It's unexpected.

Have you considered bumping file's refcnt only for the time of
send/callback? Along the lines of:

static struct file *sock_get_file(struct sock *sk)
{
	struct file *file = NULL;
	struct socket *sock;

	rcu_read_lock();
	sock = sk->sk_socket;
	if (sock)
		file = get_file_active(&sock->file);
	rcu_read_unlock();

	return file;
}

static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
			       u32 off, u32 len, bool ingress)
{
	int err;

	if (!ingress) {
		struct sock *sk = psock->sk;
		struct file *file;
		...

		file = sock_get_file(sk);
		if (!file)
			return -EIO;

		err = skb_send_sock(sk, skb, off, len);
		fput(file);
		return err;
	}
	...
}

static void sk_psock_verdict_data_ready(struct sock *sk)
{
	struct file *file;
	...

	file = sock_get_file(sk);
	if (!file)
		return;

	copied = sk->sk_socket->ops->read_skb(sk, sk_psock_verdict_recv);
	fput(file);
	...
}


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 2/3] bpf, sockmap: avoid using sk_socket after free when reading
  2025-03-20  0:34   ` Cong Wang
@ 2025-03-20 12:36     ` Jiayuan Chen
  0 siblings, 0 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-20 12:36 UTC (permalink / raw)
  To: Cong Wang
  Cc: john.fastabend, jakub, davem, edumazet, kuba, pabeni, horms,
	andrii, eddyz87, mykolal, ast, daniel, martin.lau, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, mhal, sgarzare,
	netdev, bpf, linux-kernel, linux-kselftest,
	syzbot+dd90a702f518e0eac072

March 20, 2025 at 08:34, "Cong Wang" <xiyou.wangcong@gmail.com> wrote:

> 
> On Mon, Mar 17, 2025 at 05:22:55PM +0800, Jiayuan Chen wrote:
> 
> > 
> > There are potential concurrency issues, as shown below.
> > 
> >  '''
> > 
> >  CPU0 CPU1
> > 
> >  sk_psock_verdict_data_ready:
> > 
> >  socket *sock = sk->sk_socket
> > 
> >  if (!sock) return
> > 
> >  close(fd):
> > 
> >  ...
> > 
> >  ops->release()
> > 
> >  if (!sock->ops) return
> > 
> >  sock->ops = NULL
> > 
> >  rcu_call(sock)
> > 
> >  free(sock)
> > 
> >  READ_ONCE(sock->ops)
> > 
> >  ^
> > 
> >  use 'sock' after free
> > 
> >  '''
> > 
> >  
> > 
> >  RCU is not applicable to Unix sockets read path, because the Unix socket
> > 
> >  implementation itself assumes it's always in process context and heavily
> > 
> >  uses mutex_lock, so, we can't call read_skb within rcu lock.
> > 
> 
> Hm, I guess the RCU work in sk_psock_drop() does not work for Unix
> 
> domain sockets either?
> 
> Thanks.
>

Although the Unix domain socket framework does not use RCU locks, the
entire sockmap process protects access to psock via RCU:
'''
rcu_read_lock();
psock = sk_psock(sk_other);
if (psock) {
 ...
}
rcu_read_unlock(); // `sk_psock_drop` will not execute until the unlock
'''

Therefore, I believe there are no issues with the psock operations here.

Thanks~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending
  2025-03-20 12:32   ` Michal Luczaj
@ 2025-03-20 14:48     ` Jiayuan Chen
  0 siblings, 0 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-03-20 14:48 UTC (permalink / raw)
  To: Michal Luczaj, xiyou.wangcong, john.fastabend, jakub
  Cc: davem, edumazet, kuba, pabeni, horms, andrii, eddyz87, mykolal,
	ast, daniel, martin.lau, song, yonghong.song, kpsingh, sdf,
	haoluo, jolsa, shuah, sgarzare, netdev, bpf, linux-kernel,
	linux-kselftest

March 20, 2025 at 20:32, "Michal Luczaj" <mhal@rbox.co> wrote:

> 
> On 3/17/25 10:22, Jiayuan Chen wrote:
> 
> > 
> > The sk->sk_socket is not locked or referenced, and during the call to
> > 
> >  skb_send_sock(), there is a race condition with the release of sk_socket.
> > 
> >  All types of sockets(tcp/udp/unix/vsock) will be affected.
> > 
> >  ...
> > 
> >  Some approach I tried
> > 
> >  ...
> > 
> >  2. Increased the reference of sk_socket->file:
> > 
> >  - If the user calls close(fd), we will do nothing because the reference
> > 
> >  count is not set to 0. It's unexpected.
> > 
> 
> Have you considered bumping file's refcnt only for the time of
> 
> send/callback? Along the lines of:
> 
> static struct file *sock_get_file(struct sock *sk)
> 
> {
> 
>  struct file *file = NULL;
> 
>  struct socket *sock;
> 
>  rcu_read_lock();
> 
>  sock = sk->sk_socket;
> 
>  if (sock)
> 
>  file = get_file_active(&sock->file);
> 
>  rcu_read_unlock();
> 
>  return file;
> 
> }
> 
> static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
> 
>  u32 off, u32 len, bool ingress)
> 
> {
> 
>  int err;
> 
>  if (!ingress) {
> 
>  struct sock *sk = psock->sk;
> 
>  struct file *file;
> 
>  ...
> 
>  file = sock_get_file(sk);
> 
>  if (!file)
> 
>  return -EIO;
> 
>  err = skb_send_sock(sk, skb, off, len);
> 
>  fput(file);
> 
>  return err;
> 
>  }
> 
>  ...
> 
> }
> 
> static void sk_psock_verdict_data_ready(struct sock *sk)
> 
> {
> 
>  struct file *file;
> 
>  ...
> 
>  file = sock_get_file(sk);
> 
>  if (!file)
> 
>  return;
> 
>  copied = sk->sk_socket->ops->read_skb(sk, sk_psock_verdict_recv);
> 
>  fput(file);
> 
>  ...
> 
> }
>

Thank you for your suggestions.

I previously attempted a similar approach in another version, but
felt that manipulating the file layer within sockmap was quite odd.

Moreover, the actual process flow is more complex than we initially
imagined.

The current overall close process looks roughly like this:
'''
close(fd):
    file_ref_put(file):
        __sock_release(sock)
            sock_map_close()
            saved_close()
               sk->sk_socket = NULL
            sock->ops = NULL
            sock->file = NULL
'''

We can observe that while sk->sk_socket might not be NULL, it doesn’t
guarantee that sock->file is not NULL. This means the following code
is problematic:
'''
sock = sk->sk_socket;
if (sock)
    sock->file->xx <== sock->file may be set to NULL
'''

Of course, we might try something like:
'''
try_hold_sock() {
    rcu_read_lock();
    
    sock = sk->sk_socket;
    if (!sock)
        unlock_return;
    
    file = sock->file;
    if (!file)
        unlock_return;
    
    file = get_file_active(&file);
    rcu_read_unlock();
    return file;
}
'''

Acquiring the socket's reference count requires significant effort,
and we need to pay special attention to the socket's own release
process to ensure correctness. Ultimately, this led me to decide to
operate on psock instead of directly manipulating the file layer to
avoid this issue.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-20 14:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-17  9:22 [PATCH bpf-next v3 0/3] bpf: Fix use-after-free of sockmap Jiayuan Chen
2025-03-17  9:22 ` [PATCH bpf-next v3 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
2025-03-19 23:02   ` Cong Wang
2025-03-19 23:36     ` Jiayuan Chen
2025-03-20  0:06       ` Cong Wang
2025-03-20  0:27         ` Jiayuan Chen
2025-03-20 12:32   ` Michal Luczaj
2025-03-20 14:48     ` Jiayuan Chen
2025-03-17  9:22 ` [PATCH bpf-next v3 2/3] bpf, sockmap: avoid using sk_socket after free when reading Jiayuan Chen
2025-03-20  0:34   ` Cong Wang
2025-03-20 12:36     ` Jiayuan Chen
2025-03-17  9:22 ` [PATCH bpf-next v3 3/3] selftests/bpf: Add edge case tests for sockmap Jiayuan Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).