* [PATCH bpf-next v4 1/3] bpf, sockmap: avoid using sk_socket after free when sending
2025-04-08 7:29 [PATCH bpf-next v4 0/3] bpf: Fix use-after-free of sockmap Jiayuan Chen
@ 2025-04-08 7:29 ` Jiayuan Chen
2025-04-08 7:29 ` [PATCH bpf-next v4 2/3] bpf, sockmap: avoid using sk_socket after free when reading Jiayuan Chen
2025-04-08 7:29 ` [PATCH bpf-next v4 3/3] selftests/bpf: Add edge case tests for sockmap Jiayuan Chen
2 siblings, 0 replies; 5+ messages in thread
From: Jiayuan Chen @ 2025-04-08 7:29 UTC (permalink / raw)
To: bpf
Cc: mrpre, Jiayuan Chen, Michal Luczaj, John Fastabend,
Jakub Sitnicki, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Mykola Lysenko, Shuah Khan, Cong Wang, Stefano Garzarella, netdev,
linux-kernel, linux-kselftest
The sk->sk_socket is not locked or referenced, and during the call to
skb_send_sock(), there is a race condition with the release of sk_socket.
All types of sockets(tcp/udp/unix/vsock) will be affected.
Race conditions:
'''
CPU0 CPU1
skb_send_sock
sendmsg_unlocked
sock_sendmsg
sock_sendmsg_nosec
close(fd):
...
ops->release()
sock_map_close()
sk_socket->ops = NULL
free(socket)
sock->ops->sendmsg
^
panic here
'''
The ref of psock become 0 after sock_map_close() executed.
'''
void sock_map_close()
{
...
if (likely(psock)) {
...
// !!here we remove psock and the ref of psock become 0
sock_map_remove_links(sk, psock)
psock = sk_psock_get(sk);
if (unlikely(!psock))
goto no_psock; <=== Control jumps here via goto
...
cancel_delayed_work_sync(&psock->work); <=== not executed
sk_psock_put(sk, psock);
...
}
'''
Based on the fact that we already wait for the workqueue to finish in
sock_map_close() if psock is held, we simply increase the psock
reference count to avoid race conditions.
With this patch, if the backlog thread is running, sock_map_close() will
wait for the backlog thread to complete and cancel all pending work.
Any pending work that hasn't started by then will fail when invoked by
sk_psock_get(), as the psock reference count have been zeroed, and
sk_psock_drop() will cancel all jobs via cancel_delayed_work_sync.
In summary, we require synchronization to coordinate the backlog thread
and close() thread.
The panic I catched:
'''
Workqueue: events sk_psock_backlog
RIP: 0010:sock_sendmsg+0x21d/0x440
RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001
...
Call Trace:
<TASK>
? die_addr+0x40/0xa0
? exc_general_protection+0x14c/0x230
? asm_exc_general_protection+0x26/0x30
? sock_sendmsg+0x21d/0x440
? sock_sendmsg+0x3e0/0x440
? __pfx_sock_sendmsg+0x10/0x10
__skb_send_sock+0x543/0xb70
sk_psock_backlog+0x247/0xb80
...
'''
Reported-by: Michal Luczaj <mhal@rbox.co>
Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
Some approach I tried
1. add rcu:
- RCU conflicts with mutex_lock in Unix socket send path.
- Race conditions still exist when reading sk->sk_socket->ops for in
current sock_sendmsg implementation.
2. Increased the reference of sk_socket->file:
- If the user calls close(fd), we will do nothing because the reference
count is not set to 0. It's unexpected.
- prev discussion:
https://lore.kernel.org/all/f2bd7e45b327d7b190edef4916d5b9e6dc83e87a@linux.dev/
3. Use sock_lock when calling skb_send_sock:
- skb_send_sock itself already do the locking.
- If we call skb_send_sock_locked instead, we have to implement
sendmsg_locked for each protocol, which is not easy for UDP or Unix,
as the sending process involves frequent locking and unlocking, which
makes it challenging to isolate the locking logic.
---
net/core/skmsg.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 0ddc4c718833..6101c1bb279a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -655,6 +655,14 @@ static void sk_psock_backlog(struct work_struct *work)
bool ingress;
int ret;
+ /* Increment the psock refcnt to synchronize with close(fd) path in
+ * sock_map_close(), ensuring we wait for backlog thread completion
+ * before sk_socket freed. If refcnt increment fails, it indicates
+ * sock_map_close() completed with sk_socket potentially already freed.
+ */
+ if (!sk_psock_get(psock->sk))
+ return;
+
mutex_lock(&psock->work_mutex);
if (unlikely(state->len)) {
len = state->len;
@@ -702,6 +710,7 @@ static void sk_psock_backlog(struct work_struct *work)
}
end:
mutex_unlock(&psock->work_mutex);
+ sk_psock_put(psock->sk, psock);
}
struct sk_psock *sk_psock_init(struct sock *sk, int node)
--
2.47.1
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH bpf-next v4 2/3] bpf, sockmap: avoid using sk_socket after free when reading
2025-04-08 7:29 [PATCH bpf-next v4 0/3] bpf: Fix use-after-free of sockmap Jiayuan Chen
2025-04-08 7:29 ` [PATCH bpf-next v4 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
@ 2025-04-08 7:29 ` Jiayuan Chen
2025-04-10 3:02 ` Alexei Starovoitov
2025-04-08 7:29 ` [PATCH bpf-next v4 3/3] selftests/bpf: Add edge case tests for sockmap Jiayuan Chen
2 siblings, 1 reply; 5+ messages in thread
From: Jiayuan Chen @ 2025-04-08 7:29 UTC (permalink / raw)
To: bpf
Cc: mrpre, Jiayuan Chen, syzbot+dd90a702f518e0eac072, John Fastabend,
Jakub Sitnicki, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Mykola Lysenko, Shuah Khan, Michal Luczaj, Cong Wang, netdev,
linux-kernel, linux-kselftest
There are potential concurrency issues, as shown below.
'''
CPU0 CPU1
sk_psock_verdict_data_ready:
socket *sock = sk->sk_socket
if (!sock) return
close(fd):
...
ops->release()
if (!sock->ops) return
sock->ops = NULL
rcu_call(sock)
free(sock)
READ_ONCE(sock->ops)
^
use 'sock' after free
'''
RCU is not applicable to Unix sockets read path, because the Unix socket
implementation itself assumes it's always in process context and heavily
uses mutex_lock, so, we can't call read_skb within rcu lock.
Incrementing the psock reference count would not help either, since
sock_map_close() does not wait for data_ready() to complete its execution.
While we don't utilize sk_socket here, implementing read_skb at the sock
layer instead of socket layer might be architecturally preferable ?
However, deferring this optimization as current fix adequately addresses
the immediate issue.
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Reported-by: syzbot+dd90a702f518e0eac072@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/6734c033.050a0220.2a2fcc.0015.GAE@google.com/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
net/core/skmsg.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6101c1bb279a..5e913b62929e 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -1231,17 +1231,24 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
static void sk_psock_verdict_data_ready(struct sock *sk)
{
- struct socket *sock = sk->sk_socket;
+ struct socket *sock;
const struct proto_ops *ops;
int copied;
trace_sk_data_ready(sk);
- if (unlikely(!sock))
+ rcu_read_lock();
+ sock = sk->sk_socket;
+ if (unlikely(!sock)) {
+ rcu_read_unlock();
return;
+ }
ops = READ_ONCE(sock->ops);
- if (!ops || !ops->read_skb)
+ if (!ops || !ops->read_skb) {
+ rcu_read_unlock();
return;
+ }
+ rcu_read_unlock();
copied = ops->read_skb(sk, sk_psock_verdict_recv);
if (copied >= 0) {
struct sk_psock *psock;
--
2.47.1
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH bpf-next v4 2/3] bpf, sockmap: avoid using sk_socket after free when reading
2025-04-08 7:29 ` [PATCH bpf-next v4 2/3] bpf, sockmap: avoid using sk_socket after free when reading Jiayuan Chen
@ 2025-04-10 3:02 ` Alexei Starovoitov
0 siblings, 0 replies; 5+ messages in thread
From: Alexei Starovoitov @ 2025-04-10 3:02 UTC (permalink / raw)
To: Jiayuan Chen
Cc: bpf, Jiayuan Chen, syzbot+dd90a702f518e0eac072, John Fastabend,
Jakub Sitnicki, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Mykola Lysenko, Shuah Khan, Michal Luczaj, Cong Wang,
Network Development, LKML, open list:KERNEL SELFTEST FRAMEWORK
On Tue, Apr 8, 2025 at 12:31 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> There are potential concurrency issues, as shown below.
> '''
> CPU0 CPU1
> sk_psock_verdict_data_ready:
> socket *sock = sk->sk_socket
> if (!sock) return
> close(fd):
> ...
> ops->release()
> if (!sock->ops) return
> sock->ops = NULL
> rcu_call(sock)
> free(sock)
> READ_ONCE(sock->ops)
> ^
> use 'sock' after free
> '''
>
> RCU is not applicable to Unix sockets read path, because the Unix socket
> implementation itself assumes it's always in process context and heavily
> uses mutex_lock, so, we can't call read_skb within rcu lock.
>
> Incrementing the psock reference count would not help either, since
> sock_map_close() does not wait for data_ready() to complete its execution.
>
> While we don't utilize sk_socket here, implementing read_skb at the sock
> layer instead of socket layer might be architecturally preferable ?
> However, deferring this optimization as current fix adequately addresses
> the immediate issue.
>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Reported-by: syzbot+dd90a702f518e0eac072@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/bpf/6734c033.050a0220.2a2fcc.0015.GAE@google.com/
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
> net/core/skmsg.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 6101c1bb279a..5e913b62929e 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -1231,17 +1231,24 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
>
> static void sk_psock_verdict_data_ready(struct sock *sk)
> {
> - struct socket *sock = sk->sk_socket;
> + struct socket *sock;
> const struct proto_ops *ops;
> int copied;
>
> trace_sk_data_ready(sk);
>
> - if (unlikely(!sock))
> + rcu_read_lock();
> + sock = sk->sk_socket;
> + if (unlikely(!sock)) {
> + rcu_read_unlock();
> return;
> + }
> ops = READ_ONCE(sock->ops);
> - if (!ops || !ops->read_skb)
> + if (!ops || !ops->read_skb) {
> + rcu_read_unlock();
> return;
> + }
> + rcu_read_unlock();
This makes no sense to me. RCU doesn't work this way.
pw-bot: cr
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH bpf-next v4 3/3] selftests/bpf: Add edge case tests for sockmap
2025-04-08 7:29 [PATCH bpf-next v4 0/3] bpf: Fix use-after-free of sockmap Jiayuan Chen
2025-04-08 7:29 ` [PATCH bpf-next v4 1/3] bpf, sockmap: avoid using sk_socket after free when sending Jiayuan Chen
2025-04-08 7:29 ` [PATCH bpf-next v4 2/3] bpf, sockmap: avoid using sk_socket after free when reading Jiayuan Chen
@ 2025-04-08 7:29 ` Jiayuan Chen
2 siblings, 0 replies; 5+ messages in thread
From: Jiayuan Chen @ 2025-04-08 7:29 UTC (permalink / raw)
To: bpf
Cc: mrpre, Jiayuan Chen, John Fastabend, Jakub Sitnicki,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Michal Luczaj, Cong Wang, netdev, linux-kernel,
linux-kselftest
Add edge case tests for sockmap.
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
.../selftests/bpf/prog_tests/socket_helpers.h | 13 +++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 60 +++++++++++++++++++
2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/socket_helpers.h b/tools/testing/selftests/bpf/prog_tests/socket_helpers.h
index 1bdfb79ef009..a805143dd84f 100644
--- a/tools/testing/selftests/bpf/prog_tests/socket_helpers.h
+++ b/tools/testing/selftests/bpf/prog_tests/socket_helpers.h
@@ -313,11 +313,22 @@ static inline int recv_timeout(int fd, void *buf, size_t len, int flags,
static inline int create_pair(int family, int sotype, int *p0, int *p1)
{
- __close_fd int s, c = -1, p = -1;
+ __close_fd int s = -1, c = -1, p = -1;
struct sockaddr_storage addr;
socklen_t len = sizeof(addr);
int err;
+ if (family == AF_UNIX) {
+ int fds[2];
+
+ err = socketpair(family, sotype, 0, fds);
+ if (!err) {
+ *p0 = fds[0];
+ *p1 = fds[1];
+ }
+ return err;
+ }
+
s = socket_loopback(family, sotype);
if (s < 0)
return s;
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index 1e3e4392dcca..c72357f41035 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -1042,6 +1042,59 @@ static void test_sockmap_vsock_unconnected(void)
xclose(map);
}
+void *close_thread(void *arg)
+{
+ int *fd = (int *)arg;
+
+ sleep(1);
+ close(*fd);
+ *fd = -1;
+ return NULL;
+}
+
+void test_sockmap_with_close_on_write(int family, int sotype)
+{
+ struct test_sockmap_pass_prog *skel;
+ int err, map, verdict;
+ pthread_t tid;
+ int zero = 0;
+ int c = -1, p = -1;
+
+ skel = test_sockmap_pass_prog__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+ map = bpf_map__fd(skel->maps.sock_map_rx);
+
+ err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach"))
+ goto out;
+
+ err = create_pair(family, sotype, &c, &p);
+ if (!ASSERT_OK(err, "create_pair"))
+ goto out;
+
+ err = bpf_map_update_elem(map, &zero, &p, BPF_ANY);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto out;
+
+ err = pthread_create(&tid, 0, close_thread, &p);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto out;
+
+ while (p > 0)
+ send(c, "a", 1, MSG_NOSIGNAL);
+
+ pthread_join(tid, NULL);
+out:
+ if (c > 0)
+ close(c);
+ if (p > 0)
+ close(p);
+ test_sockmap_pass_prog__destroy(skel);
+}
+
void test_sockmap_basic(void)
{
if (test__start_subtest("sockmap create_update_free"))
@@ -1108,4 +1161,11 @@ void test_sockmap_basic(void)
test_sockmap_skb_verdict_vsock_poll();
if (test__start_subtest("sockmap vsock unconnected"))
test_sockmap_vsock_unconnected();
+ if (test__start_subtest("sockmap with write on close")) {
+ test_sockmap_with_close_on_write(AF_UNIX, SOCK_STREAM);
+ test_sockmap_with_close_on_write(AF_UNIX, SOCK_DGRAM);
+ test_sockmap_with_close_on_write(AF_INET, SOCK_STREAM);
+ test_sockmap_with_close_on_write(AF_INET, SOCK_DGRAM);
+ test_sockmap_with_close_on_write(AF_VSOCK, SOCK_STREAM);
+ }
}
--
2.47.1
^ permalink raw reply related [flat|nested] 5+ messages in thread