[PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg()
@ 2026-06-10 17:59 Sechang Lim
  2026-06-10 18:12 ` sashiko-bot
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Sechang Lim @ 2026-06-10 17:59 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Cong Wang, Alexei Starovoitov, netdev, bpf,
	linux-kernel

udp_bpf_recvmsg() drains the psock ingress_msg queue through
sk_msg_recvmsg() without holding the socket lock. __sk_msg_recvmsg()
walks the head sk_msg's scatterlist and updates each entry in place
(sge->length -= copy, sge->offset += copy), drops the page once an entry
is consumed, and frees the sk_msg when it is drained. ingress_lock only
serializes the queue list operations, not this per-entry consume;
copy_page_to_iter() can fault, so the loop cannot hold a spinlock.

Two threads calling recvmsg() on the same UDP socket then consume the
same sk_msg at once:

  CPU0                                CPU1

  sk_msg_recvmsg                      sk_msg_recvmsg
    copy = sge->length                 copy = sge->length
    copy_page_to_iter(copy)            copy_page_to_iter(copy)
    sge->length -= copy
      /* sge->length == 0 */
                                       sge->length -= copy
                                         /* underflows to ~0U */

The next pass loads the underflowed length into a signed int, the value
is negative so the copied + copy > len clamp is skipped, and
copy_page_to_iter() is handed ~2^64 as its size_t bytes, which trips
page_copy_sane(). The same race can put_page() one page twice and free
the sk_msg while the other reader still walks it.

  WARNING: CPU: 0 PID: 5484 at lib/iov_iter.c:356 copy_page_to_iter
  WARNING: CPU: 1 PID: 5485 at lib/iov_iter.c:356 copy_page_to_iter
  CPU: 1 UID: 0 PID: 5485 Comm: syz.6.1204 Not tainted 7.1.0-rc7 #1
  RIP: 0010:copy_page_to_iter+0xa9/0x270
  Call Trace:
   <TASK>
   __sk_msg_recvmsg+0x384/0x9a0
   udp_bpf_recvmsg+0x1a0/0x6f0
   sock_recvmsg+0x124/0x150
   __sys_recvfrom+0x1c2/0x2b0
   __x64_sys_recvfrom+0x7b/0x90
   do_syscall_64+0x115/0x6d0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>

tcp_bpf_recvmsg_parser() already serializes this with lock_sock() around
__sk_msg_recvmsg(); udp_bpf_recvmsg() does not. Take the socket lock
around the sk_msg_recvmsg() consume so a given ingress sk_msg is drained
by one reader at a time. The wait path is left unlocked, as
udp_msg_wait_data() sleeps without dropping the socket lock.

Fixes: 1f5be6b3b063 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/ipv4/udp_bpf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 9f33b07b1481..8dd73e9ee1ea 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -80,7 +80,9 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}

 msg_bytes_ready:
+	lock_sock(sk);
 	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+	release_sock(sk);
 	if (!copied) {
 		long timeo;
 		int data;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg()
  2026-06-10 17:59 [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg() Sechang Lim
@ 2026-06-10 18:12 ` sashiko-bot
  2026-06-10 18:35 ` bot+bpf-ci
  2026-06-11 21:19 ` Cong Wang
  2 siblings, 0 replies; 4+ messages in thread
From: sashiko-bot @ 2026-06-10 18:12 UTC (permalink / raw)
  To: Sechang Lim; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

Pre-existing issues:
- [High] `udp_bpf_recvmsg` returns `-EAGAIN` instead of `-EINTR` or `-ERESTARTSYS` on signal interruption for blocking sockets.
- [High] `udp_bpf_recvmsg` enters an infinite busy-wait loop if data is queued in `psock->ingress_skb` but not yet processed into `psock->ingress_msg`.
- [High] Lost wakeups and deadlocks for BPF-redirected packets when the UDP socket is initially empty.
- [High] `vsock_bpf_recvmsg` sleeps while holding the socket lock, causing a deadlock with incoming packet delivery.
--

commit 093cedb7b7f9c8e7af25b932fdeb58d606fcc2ea
Author: Sechang Lim <rhkrqnwk98@gmail.com>

bpf, sockmap: take the socket lock in udp_bpf_recvmsg()

udp_bpf_recvmsg() drains the psock ingress_msg queue through
sk_msg_recvmsg() without holding the socket lock. This commit
adds lock_sock() and release_sock() around the consume path
so a given ingress sk_msg is drained by one reader at a time.

> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> index 9f33b07b14813..8dd73e9ee1ea6 100644
> --- a/net/ipv4/udp_bpf.c
> +++ b/net/ipv4/udp_bpf.c

[Severity: High]
This isn't a bug introduced by this patch, but can the unconditional
delegation to sk_udp_recvmsg() in udp_bpf_recvmsg() cause lost wakeups
and deadlocks for BPF-redirected packets when the UDP socket is
initially empty?

If !psock_has_data(psock) is initially true, udp_bpf_recvmsg() does:

    if (!psock_has_data(psock)) {
        ret = sk_udp_recvmsg(sk, msg, len, flags);
        goto out;
    }

sk_udp_recvmsg() will sleep on the native socket wait queue checking
only sk->sk_receive_queue.

If a packet arrives and is redirected by a BPF program to
psock->ingress_msg, the wait queue is woken up, but because
sk_receive_queue is still empty, the native UDP wait loop
(__skb_wait_for_more_packets) will immediately go back to sleep.

Does this leave the packet stuck in psock and the user thread hanging
permanently?

> @@ -80,7 +80,9 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  	}
>  
>  msg_bytes_ready:
> +	lock_sock(sk);
>  	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
> +	release_sock(sk);

[Severity: High]
This is a pre-existing issue, but can this code enter an infinite
busy-wait loop if data is queued in psock->ingress_skb but not yet
processed into psock->ingress_msg?

If psock->ingress_skb is not empty, psock_has_data() evaluates to true.

Then udp_bpf_recvmsg() calls sk_msg_recvmsg(), which only reads from
ingress_msg, returning 0.

Because psock_has_data() is true, udp_msg_wait_data() returns 1
without sleeping. The loop sees data == 1 and psock_has_data() == true
and jumps back to msg_bytes_ready.

Does this create a tight spin loop that rapidly acquires and releases
lock_sock, causing CPU starvation until the backlog workqueue thread
finishes moving the skb?

>  	if (!copied) {
>  		long timeo;
>  		int data;

[Severity: High]
This is a pre-existing issue, but does udp_bpf_recvmsg() violate POSIX
semantics by returning -EAGAIN instead of -EINTR or -ERESTARTSYS on
signal interruption for blocking sockets?

If a blocking recvmsg is interrupted by a signal, wait_woken() inside
udp_msg_wait_data() returns prematurely, causing udp_msg_wait_data()
to return 0.

The code then unconditionally sets copied = -EAGAIN when data is 0:

    copied = -EAGAIN;

Does this completely ignore signal_pending(current)?

Additionally, there is a pre-existing issue in net/vmw_vsock/vsock_bpf.c:

[Severity: High]
This isn't a bug introduced by this patch, but does vsock_bpf_recvmsg()
sleep while holding the socket lock, causing a deadlock with incoming
packet delivery?

In vsock_bpf_recvmsg(), the code acquires lock_sock(sk) and calls
vsock_msg_wait_data() when no data is available:

vsock_bpf_recvmsg()
    lock_sock(sk);
    ...
    while (copied == 0) {
        long timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
        if (!vsock_msg_wait_data(sk, psock, timeo)) {

This function calls wait_woken() without releasing the socket lock:

vsock_msg_wait_data()
    wait_woken(&wait, TASK_INTERRUPTIBLE, timeo);

Meanwhile, the Rx thread (virtio_transport_recv_pkt()) needs to acquire
lock_sock(sk) to deliver incoming packets and trigger sk_data_ready().

Because the sleeping user task holds the lock, does the Rx thread block
indefinitely, resulting in a deadlock where the user task is never
woken up?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610175922.1103871-1-rhkrqnwk98@gmail.com?part=1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg()
  2026-06-10 17:59 [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg() Sechang Lim
  2026-06-10 18:12 ` sashiko-bot
@ 2026-06-10 18:35 ` bot+bpf-ci
  2026-06-11 21:19 ` Cong Wang
  2 siblings, 0 replies; 4+ messages in thread
From: bot+bpf-ci @ 2026-06-10 18:35 UTC (permalink / raw)
  To: rhkrqnwk98, john.fastabend, jakub, davem, edumazet, kuba, pabeni
  Cc: horms, cong.wang, ast, netdev, bpf, linux-kernel, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2619 bytes --]

> udp_bpf_recvmsg() drains the psock ingress_msg queue through
> sk_msg_recvmsg() without holding the socket lock. __sk_msg_recvmsg()
> walks the head sk_msg's scatterlist and updates each entry in place
> (sge->length -= copy, sge->offset += copy), drops the page once an entry
> is consumed, and frees the sk_msg when it is drained. ingress_lock only
> serializes the queue list operations, not this per-entry consume;
> copy_page_to_iter() can fault, so the loop cannot hold a spinlock.
>
> Two threads calling recvmsg() on the same UDP socket then consume the
> same sk_msg at once:
>
>   CPU0                                CPU1
>
>   sk_msg_recvmsg                      sk_msg_recvmsg
>     copy = sge->length                 copy = sge->length
>     copy_page_to_iter(copy)            copy_page_to_iter(copy)
>     sge->length -= copy
>       /* sge->length == 0 */
>                                        sge->length -= copy
>                                          /* underflows to ~0U */
>
> The next pass loads the underflowed length into a signed int, the value
> is negative so the copied + copy > len clamp is skipped, and
> copy_page_to_iter() is handed ~2^64 as its size_t bytes, which trips
> page_copy_sane(). The same race can put_page() one page twice and free
> the sk_msg while the other reader still walks it.
>
> tcp_bpf_recvmsg_parser() already serializes this with lock_sock() around
> __sk_msg_recvmsg(); udp_bpf_recvmsg() does not. Take the socket lock
> around the sk_msg_recvmsg() consume so a given ingress sk_msg is drained
> by one reader at a time. The wait path is left unlocked, as
> udp_msg_wait_data() sleeps without dropping the socket lock.
>
> Fixes: 1f5be6b3b063 ("udp: Implement udp_bpf_recvmsg() for sockmap")

Is the Fixes: tag pointing at the right commit?

Commit 1f5be6b3b063 ("udp: Implement udp_bpf_recvmsg() for sockmap")
introduced udp_bpf_recvmsg() with a lock_sock(sk)/release_sock(sk) pair
wrapping the whole function, including the sk_msg_recvmsg() consume, so the
race described here was not present at that point.

The unlocked consume appears to have been introduced later by
9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy"), whose diff
removes both lock_sock(sk) and release_sock(sk) from udp_bpf_recvmsg().

Should the tag instead be:

  Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27296268347

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg()
  2026-06-10 17:59 [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg() Sechang Lim
  2026-06-10 18:12 ` sashiko-bot
  2026-06-10 18:35 ` bot+bpf-ci
@ 2026-06-11 21:19 ` Cong Wang
  2 siblings, 0 replies; 4+ messages in thread
From: Cong Wang @ 2026-06-11 21:19 UTC (permalink / raw)
  To: Sechang Lim
  Cc: John Fastabend, Jakub Sitnicki, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Cong Wang,
	Alexei Starovoitov, netdev, bpf, linux-kernel

On Wed, Jun 10, 2026 at 05:59:17PM +0000, Sechang Lim wrote:
> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> index 9f33b07b1481..8dd73e9ee1ea 100644
> --- a/net/ipv4/udp_bpf.c
> +++ b/net/ipv4/udp_bpf.c
> @@ -80,7 +80,9 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  	}
>  
>  msg_bytes_ready:
> +	lock_sock(sk);
>  	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
> +	release_sock(sk);
>  	if (!copied) {
>  		long timeo;
>  		int data;

It seems Kuniyuki's fix covers more than yours:
https://lore.kernel.org/bpf/20260221233234.3814768-4-kuniyu@google.com/

Regards,
Cong

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-11 21:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 17:59 [PATCH bpf] bpf, sockmap: take the socket lock in udp_bpf_recvmsg() Sechang Lim
2026-06-10 18:12 ` sashiko-bot
2026-06-10 18:35 ` bot+bpf-ci
2026-06-11 21:19 ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.