* [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP.
@ 2026-02-21 23:30 Kuniyuki Iwashima
2026-02-21 23:30 ` [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() " Kuniyuki Iwashima
` (5 more replies)
0 siblings, 6 replies; 31+ messages in thread
From: Kuniyuki Iwashima @ 2026-02-21 23:30 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki
Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf,
netdev
syzbot reported 3 issues in SOCKMAP for UDP.
Patch 1 - 2 fix lockless accesses to sk->sk_data_ready()
and sk->sk_write_space().
Patch 3 fixes UAF in sk_msg_recvmsg().
Patch 4 - 5 consolidate sk_psock_skb_ingress_self() into
sk_psock_skb_ingress() as prep.
Patch 6 fixes broken memory accounting.
Changes:
v4:
Patch 4: Don't pass gfp_flags and instead change it based on
take_ref in the next patch
Patch 5: Fix memory accounting condition from (skb->sk != sk) to
(skb->sk != sk && take_ref), the cause of selftest failure
Patch 6: Fix deadlock when requeued
v3: https://lore.kernel.org/netdev/20260219173756.315077-1-kuniyu@google.com/
Patch 2: Use WRITE_ONCE() in udp_bpf_update_proto()
v2: https://lore.kernel.org/netdev/20260217000701.791189-1-kuniyu@google.com/
Patch 2: Cache sk->sk_write_space in sock_wfree()
Patch 5: Keep msg->sk assignment
Patch 6: Fix build failure when CONFIG_INET=n
v1: https://lore.kernel.org/netdev/20260215204353.3645744-1-kuniyu@google.com/
Kuniyuki Iwashima (6):
sockmap: Annotate sk->sk_data_ready() for UDP.
sockmap: Annotate sk->sk_write_space() for UDP.
sockmap: Fix use-after-free in udp_bpf_recvmsg().
sockmap: Inline sk_psock_create_ingress_msg().
sockmap: Consolidate sk_psock_skb_ingress_self().
sockmap: Fix broken memory accounting for UDP.
include/net/udp.h | 9 ++++
net/core/skmsg.c | 100 +++++++++++++++++++++------------------------
net/core/sock.c | 8 +++-
net/ipv4/udp.c | 11 ++++-
net/ipv4/udp_bpf.c | 11 ++++-
5 files changed, 81 insertions(+), 58 deletions(-)
--
2.53.0.371.g1d285c8824-goog
^ permalink raw reply [flat|nested] 31+ messages in thread* [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() for UDP. 2026-02-21 23:30 [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP Kuniyuki Iwashima @ 2026-02-21 23:30 ` Kuniyuki Iwashima 2026-03-05 11:05 ` Jakub Sitnicki 2026-03-05 11:27 ` Jiayuan Chen 2026-02-21 23:30 ` [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() " Kuniyuki Iwashima ` (4 subsequent siblings) 5 siblings, 2 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-02-21 23:30 UTC (permalink / raw) To: John Fastabend, Jakub Sitnicki Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev, syzbot+113cea56c13a8a1e95ab syzbot reported data race of sk->sk_data_ready(). [0] UDP fast path does not hold bh_lock_sock(), instead spin_lock_bh(&sk->sk_receive_queue.lock) is used. Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_data_ready(). Another option is to hold sk->sk_receive_queue.lock in sock_map_sk_acquire() if sk_is_udp() is true, but this is overkill and also does not work for sk->sk_write_space(). [0]: BUG: KCSAN: data-race in __udp_enqueue_schedule_skb / sk_psock_drop write to 0xffff88811d063048 of 8 bytes by task 23114 on cpu 0: sk_psock_stop_verdict net/core/skmsg.c:1287 [inline] sk_psock_drop+0x12f/0x270 net/core/skmsg.c:873 sk_psock_put include/linux/skmsg.h:473 [inline] sock_map_unref+0x2a5/0x300 net/core/sock_map.c:185 __sock_map_delete net/core/sock_map.c:426 [inline] sock_map_delete_from_link net/core/sock_map.c:439 [inline] sock_map_unlink net/core/sock_map.c:1608 [inline] sock_map_remove_links+0x228/0x340 net/core/sock_map.c:1623 sock_map_close+0xa1/0x340 net/core/sock_map.c:1684 inet_release+0xcd/0xf0 net/ipv4/af_inet.c:437 __sock_release net/socket.c:662 [inline] sock_close+0x6b/0x150 net/socket.c:1455 __fput+0x29b/0x650 fs/file_table.c:468 ____fput+0x1c/0x30 fs/file_table.c:496 task_work_run+0x130/0x1a0 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:44 [inline] exit_to_user_mode_loop+0x1f7/0x6f0 kernel/entry/common.c:75 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] do_syscall_64+0x1d3/0x2a0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff88811d063048 of 8 bytes by task 23117 on cpu 1: __udp_enqueue_schedule_skb+0x6c1/0x840 net/ipv4/udp.c:1789 __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] udp_queue_rcv_one_skb+0x709/0xc20 net/ipv4/udp.c:2475 udp_queue_rcv_skb+0x20e/0x2b0 net/ipv4/udp.c:2493 __udp4_lib_mcast_deliver+0x6e8/0x790 net/ipv4/udp.c:2585 __udp4_lib_rcv+0x96f/0x1260 net/ipv4/udp.c:2724 udp_rcv+0x4f/0x60 net/ipv4/udp.c:2911 ip_protocol_deliver_rcu+0x3f9/0x780 net/ipv4/ip_input.c:207 ip_local_deliver_finish+0x1fc/0x2f0 net/ipv4/ip_input.c:241 NF_HOOK include/linux/netfilter.h:318 [inline] ip_local_deliver+0xe8/0x1e0 net/ipv4/ip_input.c:262 dst_input include/net/dst.h:474 [inline] ip_sublist_rcv_finish net/ipv4/ip_input.c:584 [inline] ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] ip_sublist_rcv+0x42b/0x6d0 net/ipv4/ip_input.c:644 ip_list_rcv+0x261/0x290 net/ipv4/ip_input.c:678 __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] __netif_receive_skb_list_core+0x4dc/0x500 net/core/dev.c:6242 __netif_receive_skb_list net/core/dev.c:6294 [inline] netif_receive_skb_list_internal+0x47d/0x5f0 net/core/dev.c:6385 netif_receive_skb_list+0x31/0x1f0 net/core/dev.c:6437 xdp_recv_frames net/bpf/test_run.c:269 [inline] xdp_test_run_batch net/bpf/test_run.c:350 [inline] bpf_test_run_xdp_live+0x104c/0x1360 net/bpf/test_run.c:379 bpf_prog_test_run_xdp+0x57b/0xa10 net/bpf/test_run.c:1396 bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4703 __sys_bpf+0x4c0/0x7b0 kernel/bpf/syscall.c:6182 __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] __x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6272 x64_sys_call+0x28e1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:322 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0xffffffff847b24d0 -> 0xffffffff84673410 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 23117 Comm: syz.8.5085 Tainted: G W syzkaller #0 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") Reported-by: syzbot+113cea56c13a8a1e95ab@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e1.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- net/core/skmsg.c | 4 ++-- net/ipv4/udp.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index ddde93dd8bc6..75fa94217e1e 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1296,7 +1296,7 @@ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) return; psock->saved_data_ready = sk->sk_data_ready; - sk->sk_data_ready = sk_psock_verdict_data_ready; + WRITE_ONCE(sk->sk_data_ready, sk_psock_verdict_data_ready); sk->sk_write_space = sk_psock_write_space; } @@ -1308,6 +1308,6 @@ void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock) if (!psock->saved_data_ready) return; - sk->sk_data_ready = psock->saved_data_ready; + WRITE_ONCE(sk->sk_data_ready, psock->saved_data_ready); psock->saved_data_ready = NULL; } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index b96e47f1c8a2..422c96fea249 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1787,7 +1787,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) * using prepare_to_wait_exclusive(). */ while (nb) { - INDIRECT_CALL_1(sk->sk_data_ready, + INDIRECT_CALL_1(READ_ONCE(sk->sk_data_ready), sock_def_readable, sk); nb--; } -- 2.53.0.371.g1d285c8824-goog ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() for UDP. 2026-02-21 23:30 ` [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() " Kuniyuki Iwashima @ 2026-03-05 11:05 ` Jakub Sitnicki 2026-03-05 11:27 ` Jiayuan Chen 1 sibling, 0 replies; 31+ messages in thread From: Jakub Sitnicki @ 2026-03-05 11:05 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+113cea56c13a8a1e95ab On Sat, Feb 21, 2026 at 11:30 PM GMT, Kuniyuki Iwashima wrote: > syzbot reported data race of sk->sk_data_ready(). [0] > > UDP fast path does not hold bh_lock_sock(), instead > spin_lock_bh(&sk->sk_receive_queue.lock) is used. > > Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_data_ready(). > > Another option is to hold sk->sk_receive_queue.lock in > sock_map_sk_acquire() if sk_is_udp() is true, but this is > overkill and also does not work for sk->sk_write_space(). > > [0]: > BUG: KCSAN: data-race in __udp_enqueue_schedule_skb / sk_psock_drop > > write to 0xffff88811d063048 of 8 bytes by task 23114 on cpu 0: > sk_psock_stop_verdict net/core/skmsg.c:1287 [inline] > sk_psock_drop+0x12f/0x270 net/core/skmsg.c:873 > sk_psock_put include/linux/skmsg.h:473 [inline] > sock_map_unref+0x2a5/0x300 net/core/sock_map.c:185 > __sock_map_delete net/core/sock_map.c:426 [inline] > sock_map_delete_from_link net/core/sock_map.c:439 [inline] > sock_map_unlink net/core/sock_map.c:1608 [inline] > sock_map_remove_links+0x228/0x340 net/core/sock_map.c:1623 > sock_map_close+0xa1/0x340 net/core/sock_map.c:1684 > inet_release+0xcd/0xf0 net/ipv4/af_inet.c:437 > __sock_release net/socket.c:662 [inline] > sock_close+0x6b/0x150 net/socket.c:1455 > __fput+0x29b/0x650 fs/file_table.c:468 > ____fput+0x1c/0x30 fs/file_table.c:496 > task_work_run+0x130/0x1a0 kernel/task_work.c:233 > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] > __exit_to_user_mode_loop kernel/entry/common.c:44 [inline] > exit_to_user_mode_loop+0x1f7/0x6f0 kernel/entry/common.c:75 > __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] > syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] > syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] > syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] > do_syscall_64+0x1d3/0x2a0 arch/x86/entry/syscall_64.c:100 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > read to 0xffff88811d063048 of 8 bytes by task 23117 on cpu 1: > __udp_enqueue_schedule_skb+0x6c1/0x840 net/ipv4/udp.c:1789 > __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] > udp_queue_rcv_one_skb+0x709/0xc20 net/ipv4/udp.c:2475 > udp_queue_rcv_skb+0x20e/0x2b0 net/ipv4/udp.c:2493 > __udp4_lib_mcast_deliver+0x6e8/0x790 net/ipv4/udp.c:2585 > __udp4_lib_rcv+0x96f/0x1260 net/ipv4/udp.c:2724 > udp_rcv+0x4f/0x60 net/ipv4/udp.c:2911 > ip_protocol_deliver_rcu+0x3f9/0x780 net/ipv4/ip_input.c:207 > ip_local_deliver_finish+0x1fc/0x2f0 net/ipv4/ip_input.c:241 > NF_HOOK include/linux/netfilter.h:318 [inline] > ip_local_deliver+0xe8/0x1e0 net/ipv4/ip_input.c:262 > dst_input include/net/dst.h:474 [inline] > ip_sublist_rcv_finish net/ipv4/ip_input.c:584 [inline] > ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] > ip_sublist_rcv+0x42b/0x6d0 net/ipv4/ip_input.c:644 > ip_list_rcv+0x261/0x290 net/ipv4/ip_input.c:678 > __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] > __netif_receive_skb_list_core+0x4dc/0x500 net/core/dev.c:6242 > __netif_receive_skb_list net/core/dev.c:6294 [inline] > netif_receive_skb_list_internal+0x47d/0x5f0 net/core/dev.c:6385 > netif_receive_skb_list+0x31/0x1f0 net/core/dev.c:6437 > xdp_recv_frames net/bpf/test_run.c:269 [inline] > xdp_test_run_batch net/bpf/test_run.c:350 [inline] > bpf_test_run_xdp_live+0x104c/0x1360 net/bpf/test_run.c:379 > bpf_prog_test_run_xdp+0x57b/0xa10 net/bpf/test_run.c:1396 > bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4703 > __sys_bpf+0x4c0/0x7b0 kernel/bpf/syscall.c:6182 > __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] > __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] > __x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6272 > x64_sys_call+0x28e1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:322 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > value changed: 0xffffffff847b24d0 -> 0xffffffff84673410 > > Reported by Kernel Concurrency Sanitizer on: > CPU: 1 UID: 0 PID: 23117 Comm: syz.8.5085 Tainted: G W syzkaller #0 PREEMPT(voluntary) > Tainted: [W]=WARN > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 > > Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > Reported-by: syzbot+113cea56c13a8a1e95ab@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e1.GAE@google.com/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- Sorry for the delay. Got caught up in skb metadata stuff... Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() for UDP. 2026-02-21 23:30 ` [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() " Kuniyuki Iwashima 2026-03-05 11:05 ` Jakub Sitnicki @ 2026-03-05 11:27 ` Jiayuan Chen 1 sibling, 0 replies; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 11:27 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+113cea56c13a8a1e95ab On Sat, Feb 21, 2026 at 11:30:48PM +0800, Kuniyuki Iwashima wrote: > syzbot reported data race of sk->sk_data_ready(). [0] > > UDP fast path does not hold bh_lock_sock(), instead > spin_lock_bh(&sk->sk_receive_queue.lock) is used. > > Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_data_ready(). > > Another option is to hold sk->sk_receive_queue.lock in > sock_map_sk_acquire() if sk_is_udp() is true, but this is > overkill and also does not work for sk->sk_write_space(). > > [0]: > BUG: KCSAN: data-race in __udp_enqueue_schedule_skb / sk_psock_drop > > write to 0xffff88811d063048 of 8 bytes by task 23114 on cpu 0: > sk_psock_stop_verdict net/core/skmsg.c:1287 [inline] > sk_psock_drop+0x12f/0x270 net/core/skmsg.c:873 > sk_psock_put include/linux/skmsg.h:473 [inline] > sock_map_unref+0x2a5/0x300 net/core/sock_map.c:185 > __sock_map_delete net/core/sock_map.c:426 [inline] > sock_map_delete_from_link net/core/sock_map.c:439 [inline] > sock_map_unlink net/core/sock_map.c:1608 [inline] > sock_map_remove_links+0x228/0x340 net/core/sock_map.c:1623 > sock_map_close+0xa1/0x340 net/core/sock_map.c:1684 > inet_release+0xcd/0xf0 net/ipv4/af_inet.c:437 > __sock_release net/socket.c:662 [inline] > sock_close+0x6b/0x150 net/socket.c:1455 > __fput+0x29b/0x650 fs/file_table.c:468 > ____fput+0x1c/0x30 fs/file_table.c:496 > task_work_run+0x130/0x1a0 kernel/task_work.c:233 > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] > __exit_to_user_mode_loop kernel/entry/common.c:44 [inline] > exit_to_user_mode_loop+0x1f7/0x6f0 kernel/entry/common.c:75 > __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] > syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] > syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] > syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] > do_syscall_64+0x1d3/0x2a0 arch/x86/entry/syscall_64.c:100 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > read to 0xffff88811d063048 of 8 bytes by task 23117 on cpu 1: > __udp_enqueue_schedule_skb+0x6c1/0x840 net/ipv4/udp.c:1789 > __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] > udp_queue_rcv_one_skb+0x709/0xc20 net/ipv4/udp.c:2475 > udp_queue_rcv_skb+0x20e/0x2b0 net/ipv4/udp.c:2493 > __udp4_lib_mcast_deliver+0x6e8/0x790 net/ipv4/udp.c:2585 > __udp4_lib_rcv+0x96f/0x1260 net/ipv4/udp.c:2724 > udp_rcv+0x4f/0x60 net/ipv4/udp.c:2911 > ip_protocol_deliver_rcu+0x3f9/0x780 net/ipv4/ip_input.c:207 > ip_local_deliver_finish+0x1fc/0x2f0 net/ipv4/ip_input.c:241 > NF_HOOK include/linux/netfilter.h:318 [inline] > ip_local_deliver+0xe8/0x1e0 net/ipv4/ip_input.c:262 > dst_input include/net/dst.h:474 [inline] > ip_sublist_rcv_finish net/ipv4/ip_input.c:584 [inline] > ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] > ip_sublist_rcv+0x42b/0x6d0 net/ipv4/ip_input.c:644 > ip_list_rcv+0x261/0x290 net/ipv4/ip_input.c:678 > __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] > __netif_receive_skb_list_core+0x4dc/0x500 net/core/dev.c:6242 > __netif_receive_skb_list net/core/dev.c:6294 [inline] > netif_receive_skb_list_internal+0x47d/0x5f0 net/core/dev.c:6385 > netif_receive_skb_list+0x31/0x1f0 net/core/dev.c:6437 > xdp_recv_frames net/bpf/test_run.c:269 [inline] > xdp_test_run_batch net/bpf/test_run.c:350 [inline] > bpf_test_run_xdp_live+0x104c/0x1360 net/bpf/test_run.c:379 > bpf_prog_test_run_xdp+0x57b/0xa10 net/bpf/test_run.c:1396 > bpf_prog_test_run+0x204/0x340 kernel/bpf/syscall.c:4703 > __sys_bpf+0x4c0/0x7b0 kernel/bpf/syscall.c:6182 > __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] > __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] > __x64_sys_bpf+0x41/0x50 kernel/bpf/syscall.c:6272 > x64_sys_call+0x28e1/0x3000 arch/x86/include/generated/asm/syscalls_64.h:322 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > value changed: 0xffffffff847b24d0 -> 0xffffffff84673410 > > Reported by Kernel Concurrency Sanitizer on: > CPU: 1 UID: 0 PID: 23117 Comm: syz.8.5085 Tainted: G W syzkaller #0 PREEMPT(voluntary) > Tainted: [W]=WARN > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 > > Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > Reported-by: syzbot+113cea56c13a8a1e95ab@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e1.GAE@google.com/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > net/core/skmsg.c | 4 ++-- > net/ipv4/udp.c | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index ddde93dd8bc6..75fa94217e1e 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -1296,7 +1296,7 @@ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) > return; > > psock->saved_data_ready = sk->sk_data_ready; > - sk->sk_data_ready = sk_psock_verdict_data_ready; > + WRITE_ONCE(sk->sk_data_ready, sk_psock_verdict_data_ready); > sk->sk_write_space = sk_psock_write_space; > } > > @@ -1308,6 +1308,6 @@ void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock) > if (!psock->saved_data_ready) > return; > > - sk->sk_data_ready = psock->saved_data_ready; > + WRITE_ONCE(sk->sk_data_ready, psock->saved_data_ready); > psock->saved_data_ready = NULL; > } > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index b96e47f1c8a2..422c96fea249 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1787,7 +1787,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) > * using prepare_to_wait_exclusive(). > */ > while (nb) { > - INDIRECT_CALL_1(sk->sk_data_ready, > + INDIRECT_CALL_1(READ_ONCE(sk->sk_data_ready), > sock_def_readable, sk); > nb--; > } > -- > 2.53.0.371.g1d285c8824-goog > Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() for UDP. 2026-02-21 23:30 [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() " Kuniyuki Iwashima @ 2026-02-21 23:30 ` Kuniyuki Iwashima 2026-03-05 1:48 ` Jiayuan Chen ` (2 more replies) 2026-02-21 23:30 ` [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg() Kuniyuki Iwashima ` (3 subsequent siblings) 5 siblings, 3 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-02-21 23:30 UTC (permalink / raw) To: John Fastabend, Jakub Sitnicki Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev UDP TX skb->destructor() is sock_wfree(), and UDP only holds lock_sock() for UDP_CORK / MSG_MORE sendmsg(). Otherwise, sk->sk_write_space() is read locklessly. Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_write_space(). Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- v3: Use WRITE_ONCE() in udp_bpf_update_proto() v2: Cache sk->sk_write_space in sock_wfree() --- net/core/skmsg.c | 2 +- net/core/sock.c | 8 ++++++-- net/ipv4/udp_bpf.c | 2 +- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 75fa94217e1e..3d7eb2f4ac98 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1297,7 +1297,7 @@ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) psock->saved_data_ready = sk->sk_data_ready; WRITE_ONCE(sk->sk_data_ready, sk_psock_verdict_data_ready); - sk->sk_write_space = sk_psock_write_space; + WRITE_ONCE(sk->sk_write_space, sk_psock_write_space); } void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock) diff --git a/net/core/sock.c b/net/core/sock.c index 693e6d80f501..710f57ff3768 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2673,8 +2673,12 @@ void sock_wfree(struct sk_buff *skb) int old; if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { + void (*sk_write_space)(struct sock *sk); + + sk_write_space = READ_ONCE(sk->sk_write_space); + if (sock_flag(sk, SOCK_RCU_FREE) && - sk->sk_write_space == sock_def_write_space) { + sk_write_space == sock_def_write_space) { rcu_read_lock(); free = __refcount_sub_and_test(len, &sk->sk_wmem_alloc, &old); @@ -2690,7 +2694,7 @@ void sock_wfree(struct sk_buff *skb) * after sk_write_space() call */ WARN_ON(refcount_sub_and_test(len - 1, &sk->sk_wmem_alloc)); - sk->sk_write_space(sk); + sk_write_space(sk); len = 1; } /* diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c index 91233e37cd97..779a3a03762f 100644 --- a/net/ipv4/udp_bpf.c +++ b/net/ipv4/udp_bpf.c @@ -158,7 +158,7 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) int family = sk->sk_family == AF_INET ? UDP_BPF_IPV4 : UDP_BPF_IPV6; if (restore) { - sk->sk_write_space = psock->saved_write_space; + WRITE_ONCE(sk->sk_write_space, psock->saved_write_space); sock_replace_proto(sk, psock->sk_proto); return 0; } -- 2.53.0.371.g1d285c8824-goog ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() for UDP. 2026-02-21 23:30 ` [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() " Kuniyuki Iwashima @ 2026-03-05 1:48 ` Jiayuan Chen 2026-03-05 3:43 ` Kuniyuki Iwashima 2026-03-05 11:35 ` Jiayuan Chen 2026-03-05 11:51 ` Jakub Sitnicki 2 siblings, 1 reply; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 1:48 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev On Sat, Feb 21, 2026 at 11:30:49PM +0800, Kuniyuki Iwashima wrote: > UDP TX skb->destructor() is sock_wfree(), and UDP only > holds lock_sock() for UDP_CORK / MSG_MORE sendmsg(). > > Otherwise, sk->sk_write_space() is read locklessly. > > Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_write_space(). > > Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > v3: Use WRITE_ONCE() in udp_bpf_update_proto() > v2: Cache sk->sk_write_space in sock_wfree() > --- > net/core/skmsg.c | 2 +- > net/core/sock.c | 8 ++++++-- > net/ipv4/udp_bpf.c | 2 +- > 3 files changed, 8 insertions(+), 4 deletions(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 75fa94217e1e..3d7eb2f4ac98 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -1297,7 +1297,7 @@ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) > > psock->saved_data_ready = sk->sk_data_ready; > WRITE_ONCE(sk->sk_data_ready, sk_psock_verdict_data_ready); > - sk->sk_write_space = sk_psock_write_space; > + WRITE_ONCE(sk->sk_write_space, sk_psock_write_space); > } I noticed that Patch 1 and 2 were submitted earlier and are on bpf tree. Given that Eric's commit (which you reviewed) is now in net.git, these two patches should probably be dropped in the next version, right? Just confirming. https://web.git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=2ef2b20cf4e04ac8a6ba68493f8780776ff84300 > void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock) > diff --git a/net/core/sock.c b/net/core/sock.c > index 693e6d80f501..710f57ff3768 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -2673,8 +2673,12 @@ void sock_wfree(struct sk_buff *skb) > int old; > > if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { > + void (*sk_write_space)(struct sock *sk); > + > + sk_write_space = READ_ONCE(sk->sk_write_space); > + > if (sock_flag(sk, SOCK_RCU_FREE) && > - sk->sk_write_space == sock_def_write_space) { > + sk_write_space == sock_def_write_space) { > rcu_read_lock(); > free = __refcount_sub_and_test(len, &sk->sk_wmem_alloc, > &old); > @@ -2690,7 +2694,7 @@ void sock_wfree(struct sk_buff *skb) > * after sk_write_space() call > */ > WARN_ON(refcount_sub_and_test(len - 1, &sk->sk_wmem_alloc)); > - sk->sk_write_space(sk); > + sk_write_space(sk); > len = 1; > } > /* > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c > index 91233e37cd97..779a3a03762f 100644 > --- a/net/ipv4/udp_bpf.c > +++ b/net/ipv4/udp_bpf.c > @@ -158,7 +158,7 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) > int family = sk->sk_family == AF_INET ? UDP_BPF_IPV4 : UDP_BPF_IPV6; > > if (restore) { > - sk->sk_write_space = psock->saved_write_space; > + WRITE_ONCE(sk->sk_write_space, psock->saved_write_space); > sock_replace_proto(sk, psock->sk_proto); > return 0; > } > -- > 2.53.0.371.g1d285c8824-goog > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() for UDP. 2026-03-05 1:48 ` Jiayuan Chen @ 2026-03-05 3:43 ` Kuniyuki Iwashima 2026-03-07 0:03 ` Martin KaFai Lau 0 siblings, 1 reply; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-05 3:43 UTC (permalink / raw) To: Jiayuan Chen Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev On Wed, Mar 4, 2026 at 5:48 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > On Sat, Feb 21, 2026 at 11:30:49PM +0800, Kuniyuki Iwashima wrote: > > UDP TX skb->destructor() is sock_wfree(), and UDP only > > holds lock_sock() for UDP_CORK / MSG_MORE sendmsg(). > > > > Otherwise, sk->sk_write_space() is read locklessly. > > > > Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_write_space(). > > > > Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > --- > > v3: Use WRITE_ONCE() in udp_bpf_update_proto() > > v2: Cache sk->sk_write_space in sock_wfree() > > --- > > net/core/skmsg.c | 2 +- > > net/core/sock.c | 8 ++++++-- > > net/ipv4/udp_bpf.c | 2 +- > > 3 files changed, 8 insertions(+), 4 deletions(-) > > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > index 75fa94217e1e..3d7eb2f4ac98 100644 > > --- a/net/core/skmsg.c > > +++ b/net/core/skmsg.c > > @@ -1297,7 +1297,7 @@ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) > > > > psock->saved_data_ready = sk->sk_data_ready; > > WRITE_ONCE(sk->sk_data_ready, sk_psock_verdict_data_ready); > > - sk->sk_write_space = sk_psock_write_space; > > + WRITE_ONCE(sk->sk_write_space, sk_psock_write_space); > > } > > I noticed that Patch 1 and 2 were submitted earlier and are on bpf tree. > Given that Eric's commit (which you reviewed) is now in net.git, these two patches > should probably be dropped in the next version, right? Just confirming. I was thinking git can just handle them well, and even the patch 2 has a delta in sock_wfree(). > > https://web.git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=2ef2b20cf4e04ac8a6ba68493f8780776ff84300 > > > void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock) > > diff --git a/net/core/sock.c b/net/core/sock.c > > index 693e6d80f501..710f57ff3768 100644 > > --- a/net/core/sock.c > > +++ b/net/core/sock.c > > @@ -2673,8 +2673,12 @@ void sock_wfree(struct sk_buff *skb) > > int old; > > > > if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { > > + void (*sk_write_space)(struct sock *sk); > > + > > + sk_write_space = READ_ONCE(sk->sk_write_space); > > + > > if (sock_flag(sk, SOCK_RCU_FREE) && > > - sk->sk_write_space == sock_def_write_space) { > > + sk_write_space == sock_def_write_space) { > > rcu_read_lock(); > > free = __refcount_sub_and_test(len, &sk->sk_wmem_alloc, > > &old); > > @@ -2690,7 +2694,7 @@ void sock_wfree(struct sk_buff *skb) > > * after sk_write_space() call > > */ > > WARN_ON(refcount_sub_and_test(len - 1, &sk->sk_wmem_alloc)); > > - sk->sk_write_space(sk); > > + sk_write_space(sk); > > len = 1; > > } > > /* > > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c > > index 91233e37cd97..779a3a03762f 100644 > > --- a/net/ipv4/udp_bpf.c > > +++ b/net/ipv4/udp_bpf.c > > @@ -158,7 +158,7 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) > > int family = sk->sk_family == AF_INET ? UDP_BPF_IPV4 : UDP_BPF_IPV6; > > > > if (restore) { > > - sk->sk_write_space = psock->saved_write_space; > > + WRITE_ONCE(sk->sk_write_space, psock->saved_write_space); > > sock_replace_proto(sk, psock->sk_proto); > > return 0; > > } > > -- > > 2.53.0.371.g1d285c8824-goog > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() for UDP. 2026-03-05 3:43 ` Kuniyuki Iwashima @ 2026-03-07 0:03 ` Martin KaFai Lau 2026-03-07 2:51 ` Kuniyuki Iwashima 0 siblings, 1 reply; 31+ messages in thread From: Martin KaFai Lau @ 2026-03-07 0:03 UTC (permalink / raw) To: Kuniyuki Iwashima, Jiayuan Chen Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev On 3/4/26 7:43 PM, Kuniyuki Iwashima wrote: > On Wed, Mar 4, 2026 at 5:48 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: >> >> On Sat, Feb 21, 2026 at 11:30:49PM +0800, Kuniyuki Iwashima wrote: >>> UDP TX skb->destructor() is sock_wfree(), and UDP only >>> holds lock_sock() for UDP_CORK / MSG_MORE sendmsg(). >>> >>> Otherwise, sk->sk_write_space() is read locklessly. >>> >>> Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_write_space(). >>> >>> Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") >>> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> >>> --- >>> v3: Use WRITE_ONCE() in udp_bpf_update_proto() >>> v2: Cache sk->sk_write_space in sock_wfree() >>> --- >>> net/core/skmsg.c | 2 +- >>> net/core/sock.c | 8 ++++++-- >>> net/ipv4/udp_bpf.c | 2 +- >>> 3 files changed, 8 insertions(+), 4 deletions(-) >>> >>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c >>> index 75fa94217e1e..3d7eb2f4ac98 100644 >>> --- a/net/core/skmsg.c >>> +++ b/net/core/skmsg.c >>> @@ -1297,7 +1297,7 @@ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) >>> >>> psock->saved_data_ready = sk->sk_data_ready; >>> WRITE_ONCE(sk->sk_data_ready, sk_psock_verdict_data_ready); >>> - sk->sk_write_space = sk_psock_write_space; >>> + WRITE_ONCE(sk->sk_write_space, sk_psock_write_space); >>> } >> >> I noticed that Patch 1 and 2 were submitted earlier and are on bpf tree. >> Given that Eric's commit (which you reviewed) is now in net.git, these two patches >> should probably be dropped in the next version, right? Just confirming. > > I was thinking git can just handle them well, and even the > patch 2 has a delta in sock_wfree(). At least for patch 1, I am not sure git can handle it well without human intervention because the sk_write_space is backward from Eric. For the missing change in sock_wfree (?), it probably makes sense to create a separate patch on top of Eric's change and target net. My understanding is patch 3-6 can be on its own without patch 1+2. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() for UDP. 2026-03-07 0:03 ` Martin KaFai Lau @ 2026-03-07 2:51 ` Kuniyuki Iwashima 0 siblings, 0 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-07 2:51 UTC (permalink / raw) To: Martin KaFai Lau Cc: Jiayuan Chen, John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev On Fri, Mar 6, 2026 at 4:03 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > > > On 3/4/26 7:43 PM, Kuniyuki Iwashima wrote: > > On Wed, Mar 4, 2026 at 5:48 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > >> > >> On Sat, Feb 21, 2026 at 11:30:49PM +0800, Kuniyuki Iwashima wrote: > >>> UDP TX skb->destructor() is sock_wfree(), and UDP only > >>> holds lock_sock() for UDP_CORK / MSG_MORE sendmsg(). > >>> > >>> Otherwise, sk->sk_write_space() is read locklessly. > >>> > >>> Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_write_space(). > >>> > >>> Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > >>> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > >>> --- > >>> v3: Use WRITE_ONCE() in udp_bpf_update_proto() > >>> v2: Cache sk->sk_write_space in sock_wfree() > >>> --- > >>> net/core/skmsg.c | 2 +- > >>> net/core/sock.c | 8 ++++++-- > >>> net/ipv4/udp_bpf.c | 2 +- > >>> 3 files changed, 8 insertions(+), 4 deletions(-) > >>> > >>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c > >>> index 75fa94217e1e..3d7eb2f4ac98 100644 > >>> --- a/net/core/skmsg.c > >>> +++ b/net/core/skmsg.c > >>> @@ -1297,7 +1297,7 @@ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) > >>> > >>> psock->saved_data_ready = sk->sk_data_ready; > >>> WRITE_ONCE(sk->sk_data_ready, sk_psock_verdict_data_ready); > >>> - sk->sk_write_space = sk_psock_write_space; > >>> + WRITE_ONCE(sk->sk_write_space, sk_psock_write_space); > >>> } > >> > >> I noticed that Patch 1 and 2 were submitted earlier and are on bpf tree. > >> Given that Eric's commit (which you reviewed) is now in net.git, these two patches > >> should probably be dropped in the next version, right? Just confirming. > > > > I was thinking git can just handle them well, and even the > > patch 2 has a delta in sock_wfree(). > > At least for patch 1, I am not sure git can handle it well without human > intervention because the sk_write_space is backward from Eric. > > For the missing change in sock_wfree (?), it probably makes sense to > create a separate patch on top of Eric's change and target net. My > understanding is patch 3-6 can be on its own without patch 1+2. Sure, I posted 1&2 in the same series as I thought they are sockmap stuff, but I will post 3~6 separately. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() for UDP. 2026-02-21 23:30 ` [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() " Kuniyuki Iwashima 2026-03-05 1:48 ` Jiayuan Chen @ 2026-03-05 11:35 ` Jiayuan Chen 2026-03-05 11:51 ` Jakub Sitnicki 2 siblings, 0 replies; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 11:35 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev On Sat, Feb 21, 2026 at 11:30:49PM +0800, Kuniyuki Iwashima wrote: > UDP TX skb->destructor() is sock_wfree(), and UDP only > holds lock_sock() for UDP_CORK / MSG_MORE sendmsg(). > > Otherwise, sk->sk_write_space() is read locklessly. > > Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_write_space(). > > Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> > --- > v3: Use WRITE_ONCE() in udp_bpf_update_proto() > v2: Cache sk->sk_write_space in sock_wfree() > --- > net/core/skmsg.c | 2 +- > net/core/sock.c | 8 ++++++-- > net/ipv4/udp_bpf.c | 2 +- > 3 files changed, 8 insertions(+), 4 deletions(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 75fa94217e1e..3d7eb2f4ac98 100644 [...] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() for UDP. 2026-02-21 23:30 ` [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() " Kuniyuki Iwashima 2026-03-05 1:48 ` Jiayuan Chen 2026-03-05 11:35 ` Jiayuan Chen @ 2026-03-05 11:51 ` Jakub Sitnicki 2 siblings, 0 replies; 31+ messages in thread From: Jakub Sitnicki @ 2026-03-05 11:51 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev On Sat, Feb 21, 2026 at 11:30 PM GMT, Kuniyuki Iwashima wrote: > UDP TX skb->destructor() is sock_wfree(), and UDP only > holds lock_sock() for UDP_CORK / MSG_MORE sendmsg(). > > Otherwise, sk->sk_write_space() is read locklessly. > > Let's use WRITE_ONCE() and READ_ONCE() for sk->sk_write_space(). > > Fixes: 7b98cd42b049 ("bpf: sockmap: Add UDP support") > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > v3: Use WRITE_ONCE() in udp_bpf_update_proto() > v2: Cache sk->sk_write_space in sock_wfree() > --- I agree with Jiayuan that it would be less confusing to drop the overlapping parts but Eric's patch went through net tree and this is targeted for bpf tree (?). I'm actually not sure what exactly the bpf/net target signifies. So if it makes everyone's life easier: Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg(). 2026-02-21 23:30 [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() " Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() " Kuniyuki Iwashima @ 2026-02-21 23:30 ` Kuniyuki Iwashima 2026-03-05 2:30 ` Jiayuan Chen ` (2 more replies) 2026-02-21 23:30 ` [PATCH v4 bpf/net 4/6] sockmap: Inline sk_psock_create_ingress_msg() Kuniyuki Iwashima ` (2 subsequent siblings) 5 siblings, 3 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-02-21 23:30 UTC (permalink / raw) To: John Fastabend, Jakub Sitnicki Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev, syzbot+9307c991a6d07ce6e6d8 syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0] sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock, but its processing is lockless. Thus, sk_msg_recvmsg() must be serialised by callers, otherwise multiple threads could touch the same sk_msg. For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock. Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited commit accidentally removed it. Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg(). Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not an option due to copy_page_to_iter() in sk_msg_recvmsg(). [0]: BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020 CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xba/0x230 mm/kasan/report.c:482 kasan_report+0x117/0x150 mm/kasan/report.c:595 sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 sock_recvmsg_nosec net/socket.c:1078 [inline] sock_recvmsg+0x1a8/0x270 net/socket.c:1100 ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 ___sys_recvmsg+0x215/0x590 net/socket.c:2854 do_recvmmsg+0x334/0x800 net/socket.c:2949 __sys_recvmmsg net/socket.c:3023 [inline] __do_sys_recvmmsg net/socket.c:3046 [inline] __se_sys_recvmmsg net/socket.c:3039 [inline] __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb319f9aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb31ad97028 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 00007fb31a216090 RCX: 00007fb319f9aeb9 RDX: 0000000000000001 RSI: 0000200000000400 RDI: 0000000000000004 RBP: 00007fb31a008c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000040000021 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fb31a216128 R14: 00007fb31a216090 R15: 00007ffe21dd0a98 </TASK> Allocated by task 6019: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 poison_kmalloc_redzone mm/kasan/common.c:398 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 kasan_kmalloc include/linux/kasan.h:263 [inline] __kmalloc_cache_noprof+0x3d1/0x6e0 mm/slub.c:5780 kmalloc_noprof include/linux/slab.h:957 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] alloc_sk_msg net/core/skmsg.c:510 [inline] sk_psock_skb_ingress_self+0x60/0x350 net/core/skmsg.c:612 sk_psock_verdict_apply net/core/skmsg.c:1038 [inline] sk_psock_verdict_recv+0x7d9/0x8d0 net/core/skmsg.c:1236 udp_read_skb+0x73e/0x7e0 net/ipv4/udp.c:2045 sk_psock_verdict_data_ready+0x12d/0x550 net/core/skmsg.c:1257 __udp_enqueue_schedule_skb+0xc54/0x10b0 net/ipv4/udp.c:1789 __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] udp_queue_rcv_one_skb+0xac5/0x19c0 net/ipv4/udp.c:2475 __udp4_lib_mcast_deliver+0xc06/0xcf0 net/ipv4/udp.c:2585 __udp4_lib_rcv+0x10f6/0x2620 net/ipv4/udp.c:2724 ip_protocol_deliver_rcu+0x282/0x440 net/ipv4/ip_input.c:207 ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241 NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318 dst_input include/net/dst.h:474 [inline] ip_sublist_rcv_finish+0x221/0x2a0 net/ipv4/ip_input.c:584 ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] ip_sublist_rcv+0x5c6/0xa70 net/ipv4/ip_input.c:644 ip_list_rcv+0x3f1/0x450 net/ipv4/ip_input.c:678 __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] __netif_receive_skb_list_core+0x7e5/0x810 net/core/dev.c:6242 __netif_receive_skb_list net/core/dev.c:6294 [inline] netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6385 netif_receive_skb_list+0x54/0x410 net/core/dev.c:6437 xdp_recv_frames net/bpf/test_run.c:269 [inline] xdp_test_run_batch net/bpf/test_run.c:350 [inline] bpf_test_run_xdp_live+0x1946/0x1cf0 net/bpf/test_run.c:379 bpf_prog_test_run_xdp+0x81c/0x1160 net/bpf/test_run.c:1396 bpf_prog_test_run+0x2c7/0x340 kernel/bpf/syscall.c:4703 __sys_bpf+0x5cb/0x920 kernel/bpf/syscall.c:6182 __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6272 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 6021: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:2540 [inline] slab_free mm/slub.c:6674 [inline] kfree+0x1be/0x650 mm/slub.c:6882 kfree_sk_msg include/linux/skmsg.h:385 [inline] sk_msg_recvmsg+0xaa8/0xc30 net/core/skmsg.c:483 udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 sock_recvmsg_nosec net/socket.c:1078 [inline] sock_recvmsg+0x1a8/0x270 net/socket.c:1100 ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 ___sys_recvmsg+0x215/0x590 net/socket.c:2854 do_recvmmsg+0x334/0x800 net/socket.c:2949 __sys_recvmmsg net/socket.c:3023 [inline] __do_sys_recvmmsg net/socket.c:3046 [inline] __se_sys_recvmmsg net/socket.c:3039 [inline] __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- net/ipv4/udp_bpf.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c index 779a3a03762f..b07774034a91 100644 --- a/net/ipv4/udp_bpf.c +++ b/net/ipv4/udp_bpf.c @@ -52,7 +52,9 @@ static int udp_msg_wait_data(struct sock *sk, struct sk_psock *psock, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); ret = udp_msg_has_data(sk, psock); if (!ret) { + release_sock(sk); wait_woken(&wait, TASK_INTERRUPTIBLE, timeo); + lock_sock(sk); ret = udp_msg_has_data(sk, psock); } sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); @@ -81,6 +83,7 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, goto out; } + lock_sock(sk); msg_bytes_ready: copied = sk_msg_recvmsg(sk, psock, msg, len, flags); if (!copied) { @@ -92,11 +95,17 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, if (data) { if (psock_has_data(psock)) goto msg_bytes_ready; + + release_sock(sk); + ret = sk_udp_recvmsg(sk, msg, len, flags, addr_len); goto out; } copied = -EAGAIN; } + + release_sock(sk); + ret = copied; out: sk_psock_put(sk, psock); -- 2.53.0.371.g1d285c8824-goog ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg(). 2026-02-21 23:30 ` [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg() Kuniyuki Iwashima @ 2026-03-05 2:30 ` Jiayuan Chen 2026-03-05 3:41 ` Kuniyuki Iwashima 2026-03-05 11:36 ` Jiayuan Chen 2026-03-05 11:39 ` Jakub Sitnicki 2 siblings, 1 reply; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 2:30 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+9307c991a6d07ce6e6d8 On Sat, Feb 21, 2026 at 11:30:50PM +0800, Kuniyuki Iwashima wrote: > syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0] > > sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock, > but its processing is lockless. > > Thus, sk_msg_recvmsg() must be serialised by callers, otherwise > multiple threads could touch the same sk_msg. > > For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock. > > Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited > commit accidentally removed it. > > Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg(). > > Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not > an option due to copy_page_to_iter() in sk_msg_recvmsg(). > > [0]: > BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020 > > CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full) > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 > Call Trace: > <TASK> > dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:378 [inline] > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > [...] > Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") > Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > net/ipv4/udp_bpf.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c > index 779a3a03762f..b07774034a91 100644 > --- a/net/ipv4/udp_bpf.c > +++ b/net/ipv4/udp_bpf.c > @@ -52,7 +52,9 @@ static int udp_msg_wait_data(struct sock *sk, struct sk_psock *psock, > sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); > ret = udp_msg_has_data(sk, psock); > if (!ret) { > + release_sock(sk); > wait_woken(&wait, TASK_INTERRUPTIBLE, timeo); > + lock_sock(sk); > ret = udp_msg_has_data(sk, psock); > } > sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); > @@ -81,6 +83,7 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > goto out; > } > > + lock_sock(sk); > msg_bytes_ready: > copied = sk_msg_recvmsg(sk, psock, msg, len, flags); > if (!copied) { > @@ -92,11 +95,17 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > if (data) { > if (psock_has_data(psock)) > goto msg_bytes_ready; > + > + release_sock(sk); > + > ret = sk_udp_recvmsg(sk, msg, len, flags, addr_len); > goto out; > } > copied = -EAGAIN; > } > + > + release_sock(sk); > + > ret = copied; > out: > sk_psock_put(sk, psock); > -- > 2.53.0.371.g1d285c8824-goog > LGTM. One minor observation for a possible follow-up: psock_has_data() is checked before lock_sock(), which leaves a TOCTOU window. If the BPF verdict path consumes the skb from the UDP receive queue and redirects it to psock->ingress_msg between the check and sk_udp_recvmsg(), the thread may block in __skb_recv_udp() while data is already available in the psock. Moving lock_sock() before the psock_has_data() check (as the original code did before 9f2470fbc4cb) would narrow this window, similar to how __tcp_bpf_recvmsg() acquires lock_sock() upfront. But that can be a separate patch. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg(). 2026-03-05 2:30 ` Jiayuan Chen @ 2026-03-05 3:41 ` Kuniyuki Iwashima 0 siblings, 0 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-05 3:41 UTC (permalink / raw) To: Jiayuan Chen Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+9307c991a6d07ce6e6d8 On Wed, Mar 4, 2026 at 6:30 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > On Sat, Feb 21, 2026 at 11:30:50PM +0800, Kuniyuki Iwashima wrote: > > syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0] > > > > sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock, > > but its processing is lockless. > > > > Thus, sk_msg_recvmsg() must be serialised by callers, otherwise > > multiple threads could touch the same sk_msg. > > > > For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock. > > > > Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited > > commit accidentally removed it. > > > > Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg(). > > > > Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not > > an option due to copy_page_to_iter() in sk_msg_recvmsg(). > > > > [0]: > > BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > > Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020 > > > > CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full) > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 > > Call Trace: > > <TASK> > > dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 > > print_address_description mm/kasan/report.c:378 [inline] > > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > [...] > > Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") > > Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/ > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > --- > > net/ipv4/udp_bpf.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c > > index 779a3a03762f..b07774034a91 100644 > > --- a/net/ipv4/udp_bpf.c > > +++ b/net/ipv4/udp_bpf.c > > @@ -52,7 +52,9 @@ static int udp_msg_wait_data(struct sock *sk, struct sk_psock *psock, > > sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); > > ret = udp_msg_has_data(sk, psock); > > if (!ret) { > > + release_sock(sk); > > wait_woken(&wait, TASK_INTERRUPTIBLE, timeo); > > + lock_sock(sk); > > ret = udp_msg_has_data(sk, psock); > > } > > sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); > > @@ -81,6 +83,7 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > > goto out; > > } > > > > + lock_sock(sk); > > msg_bytes_ready: > > copied = sk_msg_recvmsg(sk, psock, msg, len, flags); > > if (!copied) { > > @@ -92,11 +95,17 @@ static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > > if (data) { > > if (psock_has_data(psock)) > > goto msg_bytes_ready; > > + > > + release_sock(sk); > > + > > ret = sk_udp_recvmsg(sk, msg, len, flags, addr_len); > > goto out; > > } > > copied = -EAGAIN; > > } > > + > > + release_sock(sk); > > + > > ret = copied; > > out: > > sk_psock_put(sk, psock); > > -- > > 2.53.0.371.g1d285c8824-goog > > > > LGTM. > > One minor observation for a possible follow-up: > psock_has_data() is checked before lock_sock(), which leaves a TOCTOU window. This is intentional. We don't want to hold lock_sock() just to check if the queue is empty, which can be done locklessly. Otherwise, you will introduce an unnecessary lock dance for every sk_udp_recvmsg() in case the psock queue is empty, which is too bad. > > If the BPF verdict path consumes the skb from the UDP receive queue and redirects it to > psock->ingress_msg between the check and sk_udp_recvmsg(), the thread may block in > __skb_recv_udp() while data is already available in the psock. And this is just a matter of timing. If it's already empty before the first check, the thread is anyway blocked in sk_udp_recvmsg(). > > Moving lock_sock() before the psock_has_data() check (as the original code > did before 9f2470fbc4cb) would narrow this window, similar to how > __tcp_bpf_recvmsg() acquires lock_sock() upfront. But that can be a > separate patch. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg(). 2026-02-21 23:30 ` [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg() Kuniyuki Iwashima 2026-03-05 2:30 ` Jiayuan Chen @ 2026-03-05 11:36 ` Jiayuan Chen 2026-03-05 11:39 ` Jakub Sitnicki 2 siblings, 0 replies; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 11:36 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+9307c991a6d07ce6e6d8 On Sat, Feb 21, 2026 at 11:30:50PM +0800, Kuniyuki Iwashima wrote: > syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0] > > sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock, > but its processing is lockless. > > Thus, sk_msg_recvmsg() must be serialised by callers, otherwise > multiple threads could touch the same sk_msg. > > For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock. > > Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited > commit accidentally removed it. > > Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg(). > > Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not > an option due to copy_page_to_iter() in sk_msg_recvmsg(). > > [0]: > BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020 > > CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full) > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 > Call Trace: > <TASK> > dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:378 [inline] > print_report+0xba/0x230 mm/kasan/report.c:482 > kasan_report+0x117/0x150 mm/kasan/report.c:595 > sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 > inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 > sock_recvmsg_nosec net/socket.c:1078 [inline] > sock_recvmsg+0x1a8/0x270 net/socket.c:1100 > ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 > ___sys_recvmsg+0x215/0x590 net/socket.c:2854 > do_recvmmsg+0x334/0x800 net/socket.c:2949 > __sys_recvmmsg net/socket.c:3023 [inline] > __do_sys_recvmmsg net/socket.c:3046 [inline] > __se_sys_recvmmsg net/socket.c:3039 [inline] > __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7fb319f9aeb9 > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007fb31ad97028 EFLAGS: 00000246 ORIG_RAX: 000000000000012b > RAX: ffffffffffffffda RBX: 00007fb31a216090 RCX: 00007fb319f9aeb9 > RDX: 0000000000000001 RSI: 0000200000000400 RDI: 0000000000000004 > RBP: 00007fb31a008c1f R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000040000021 R11: 0000000000000246 R12: 0000000000000000 > R13: 00007fb31a216128 R14: 00007fb31a216090 R15: 00007ffe21dd0a98 > </TASK> > > Allocated by task 6019: > kasan_save_stack mm/kasan/common.c:57 [inline] > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > poison_kmalloc_redzone mm/kasan/common.c:398 [inline] > __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 > kasan_kmalloc include/linux/kasan.h:263 [inline] > __kmalloc_cache_noprof+0x3d1/0x6e0 mm/slub.c:5780 > kmalloc_noprof include/linux/slab.h:957 [inline] > kzalloc_noprof include/linux/slab.h:1094 [inline] > alloc_sk_msg net/core/skmsg.c:510 [inline] > sk_psock_skb_ingress_self+0x60/0x350 net/core/skmsg.c:612 > sk_psock_verdict_apply net/core/skmsg.c:1038 [inline] > sk_psock_verdict_recv+0x7d9/0x8d0 net/core/skmsg.c:1236 > udp_read_skb+0x73e/0x7e0 net/ipv4/udp.c:2045 > sk_psock_verdict_data_ready+0x12d/0x550 net/core/skmsg.c:1257 > __udp_enqueue_schedule_skb+0xc54/0x10b0 net/ipv4/udp.c:1789 > __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] > udp_queue_rcv_one_skb+0xac5/0x19c0 net/ipv4/udp.c:2475 > __udp4_lib_mcast_deliver+0xc06/0xcf0 net/ipv4/udp.c:2585 > __udp4_lib_rcv+0x10f6/0x2620 net/ipv4/udp.c:2724 > ip_protocol_deliver_rcu+0x282/0x440 net/ipv4/ip_input.c:207 > ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241 > NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318 > dst_input include/net/dst.h:474 [inline] > ip_sublist_rcv_finish+0x221/0x2a0 net/ipv4/ip_input.c:584 > ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] > ip_sublist_rcv+0x5c6/0xa70 net/ipv4/ip_input.c:644 > ip_list_rcv+0x3f1/0x450 net/ipv4/ip_input.c:678 > __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] > __netif_receive_skb_list_core+0x7e5/0x810 net/core/dev.c:6242 > __netif_receive_skb_list net/core/dev.c:6294 [inline] > netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6385 > netif_receive_skb_list+0x54/0x410 net/core/dev.c:6437 > xdp_recv_frames net/bpf/test_run.c:269 [inline] > xdp_test_run_batch net/bpf/test_run.c:350 [inline] > bpf_test_run_xdp_live+0x1946/0x1cf0 net/bpf/test_run.c:379 > bpf_prog_test_run_xdp+0x81c/0x1160 net/bpf/test_run.c:1396 > bpf_prog_test_run+0x2c7/0x340 kernel/bpf/syscall.c:4703 > __sys_bpf+0x5cb/0x920 kernel/bpf/syscall.c:6182 > __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] > __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] > __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6272 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Freed by task 6021: > kasan_save_stack mm/kasan/common.c:57 [inline] > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 > poison_slab_object mm/kasan/common.c:253 [inline] > __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 > kasan_slab_free include/linux/kasan.h:235 [inline] > slab_free_hook mm/slub.c:2540 [inline] > slab_free mm/slub.c:6674 [inline] > kfree+0x1be/0x650 mm/slub.c:6882 > kfree_sk_msg include/linux/skmsg.h:385 [inline] > sk_msg_recvmsg+0xaa8/0xc30 net/core/skmsg.c:483 > udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 > inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 > sock_recvmsg_nosec net/socket.c:1078 [inline] > sock_recvmsg+0x1a8/0x270 net/socket.c:1100 > ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 > ___sys_recvmsg+0x215/0x590 net/socket.c:2854 > do_recvmmsg+0x334/0x800 net/socket.c:2949 > __sys_recvmmsg net/socket.c:3023 [inline] > __do_sys_recvmmsg net/socket.c:3046 [inline] > __se_sys_recvmmsg net/socket.c:3039 [inline] > __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") > Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> > --- > net/ipv4/udp_bpf.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c > index 779a3a03762f..b07774034a91 100644 > --- a/net/ipv4/udp_bpf.c > +++ b/net/ipv4/udp_bpf.c > @@ -52,7 +52,9 @@ static int udp_msg_wait_data(struct sock *sk, struct sk_psock *psock, [...] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg(). 2026-02-21 23:30 ` [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg() Kuniyuki Iwashima 2026-03-05 2:30 ` Jiayuan Chen 2026-03-05 11:36 ` Jiayuan Chen @ 2026-03-05 11:39 ` Jakub Sitnicki 2026-03-05 17:46 ` Kuniyuki Iwashima 2 siblings, 1 reply; 31+ messages in thread From: Jakub Sitnicki @ 2026-03-05 11:39 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+9307c991a6d07ce6e6d8 On Sat, Feb 21, 2026 at 11:30 PM GMT, Kuniyuki Iwashima wrote: > syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0] > > sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock, > but its processing is lockless. > > Thus, sk_msg_recvmsg() must be serialised by callers, otherwise > multiple threads could touch the same sk_msg. > > For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock. > > Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited > commit accidentally removed it. FWIW, it doesn't sound like commit 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") removed it by accident. The commit message calls it out explicitly: Also, UDP does not lock the sock during BH Rx path, it makes no sense for its ->recvmsg() to lock the sock. It is always possible for ->recvmsg() to be called before packets actually arrive in the receive queue, we just use best effort to make it accurate here. Looks like we just didn't understand the consequences at that time. > > Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg(). > > Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not > an option due to copy_page_to_iter() in sk_msg_recvmsg(). > > [0]: > BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020 > > CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full) > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 > Call Trace: > <TASK> > dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:378 [inline] > print_report+0xba/0x230 mm/kasan/report.c:482 > kasan_report+0x117/0x150 mm/kasan/report.c:595 > sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 > inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 > sock_recvmsg_nosec net/socket.c:1078 [inline] > sock_recvmsg+0x1a8/0x270 net/socket.c:1100 > ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 > ___sys_recvmsg+0x215/0x590 net/socket.c:2854 > do_recvmmsg+0x334/0x800 net/socket.c:2949 > __sys_recvmmsg net/socket.c:3023 [inline] > __do_sys_recvmmsg net/socket.c:3046 [inline] > __se_sys_recvmmsg net/socket.c:3039 [inline] > __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7fb319f9aeb9 > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007fb31ad97028 EFLAGS: 00000246 ORIG_RAX: 000000000000012b > RAX: ffffffffffffffda RBX: 00007fb31a216090 RCX: 00007fb319f9aeb9 > RDX: 0000000000000001 RSI: 0000200000000400 RDI: 0000000000000004 > RBP: 00007fb31a008c1f R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000040000021 R11: 0000000000000246 R12: 0000000000000000 > R13: 00007fb31a216128 R14: 00007fb31a216090 R15: 00007ffe21dd0a98 > </TASK> > > Allocated by task 6019: > kasan_save_stack mm/kasan/common.c:57 [inline] > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > poison_kmalloc_redzone mm/kasan/common.c:398 [inline] > __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 > kasan_kmalloc include/linux/kasan.h:263 [inline] > __kmalloc_cache_noprof+0x3d1/0x6e0 mm/slub.c:5780 > kmalloc_noprof include/linux/slab.h:957 [inline] > kzalloc_noprof include/linux/slab.h:1094 [inline] > alloc_sk_msg net/core/skmsg.c:510 [inline] > sk_psock_skb_ingress_self+0x60/0x350 net/core/skmsg.c:612 > sk_psock_verdict_apply net/core/skmsg.c:1038 [inline] > sk_psock_verdict_recv+0x7d9/0x8d0 net/core/skmsg.c:1236 > udp_read_skb+0x73e/0x7e0 net/ipv4/udp.c:2045 > sk_psock_verdict_data_ready+0x12d/0x550 net/core/skmsg.c:1257 > __udp_enqueue_schedule_skb+0xc54/0x10b0 net/ipv4/udp.c:1789 > __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] > udp_queue_rcv_one_skb+0xac5/0x19c0 net/ipv4/udp.c:2475 > __udp4_lib_mcast_deliver+0xc06/0xcf0 net/ipv4/udp.c:2585 > __udp4_lib_rcv+0x10f6/0x2620 net/ipv4/udp.c:2724 > ip_protocol_deliver_rcu+0x282/0x440 net/ipv4/ip_input.c:207 > ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241 > NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318 > dst_input include/net/dst.h:474 [inline] > ip_sublist_rcv_finish+0x221/0x2a0 net/ipv4/ip_input.c:584 > ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] > ip_sublist_rcv+0x5c6/0xa70 net/ipv4/ip_input.c:644 > ip_list_rcv+0x3f1/0x450 net/ipv4/ip_input.c:678 > __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] > __netif_receive_skb_list_core+0x7e5/0x810 net/core/dev.c:6242 > __netif_receive_skb_list net/core/dev.c:6294 [inline] > netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6385 > netif_receive_skb_list+0x54/0x410 net/core/dev.c:6437 > xdp_recv_frames net/bpf/test_run.c:269 [inline] > xdp_test_run_batch net/bpf/test_run.c:350 [inline] > bpf_test_run_xdp_live+0x1946/0x1cf0 net/bpf/test_run.c:379 > bpf_prog_test_run_xdp+0x81c/0x1160 net/bpf/test_run.c:1396 > bpf_prog_test_run+0x2c7/0x340 kernel/bpf/syscall.c:4703 > __sys_bpf+0x5cb/0x920 kernel/bpf/syscall.c:6182 > __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] > __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] > __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6272 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Freed by task 6021: > kasan_save_stack mm/kasan/common.c:57 [inline] > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 > poison_slab_object mm/kasan/common.c:253 [inline] > __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 > kasan_slab_free include/linux/kasan.h:235 [inline] > slab_free_hook mm/slub.c:2540 [inline] > slab_free mm/slub.c:6674 [inline] > kfree+0x1be/0x650 mm/slub.c:6882 > kfree_sk_msg include/linux/skmsg.h:385 [inline] > sk_msg_recvmsg+0xaa8/0xc30 net/core/skmsg.c:483 > udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 > inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 > sock_recvmsg_nosec net/socket.c:1078 [inline] > sock_recvmsg+0x1a8/0x270 net/socket.c:1100 > ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 > ___sys_recvmsg+0x215/0x590 net/socket.c:2854 > do_recvmmsg+0x334/0x800 net/socket.c:2949 > __sys_recvmmsg net/socket.c:3023 [inline] > __do_sys_recvmmsg net/socket.c:3046 [inline] > __se_sys_recvmmsg net/socket.c:3039 [inline] > __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") > Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg(). 2026-03-05 11:39 ` Jakub Sitnicki @ 2026-03-05 17:46 ` Kuniyuki Iwashima 0 siblings, 0 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-05 17:46 UTC (permalink / raw) To: Jakub Sitnicki Cc: John Fastabend, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+9307c991a6d07ce6e6d8 On Thu, Mar 5, 2026 at 3:39 AM Jakub Sitnicki <jakub@cloudflare.com> wrote: > > On Sat, Feb 21, 2026 at 11:30 PM GMT, Kuniyuki Iwashima wrote: > > syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0] > > > > sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock, > > but its processing is lockless. > > > > Thus, sk_msg_recvmsg() must be serialised by callers, otherwise > > multiple threads could touch the same sk_msg. > > > > For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock. > > > > Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited > > commit accidentally removed it. > > FWIW, it doesn't sound like commit 9f2470fbc4cb ("skmsg: Improve > udp_bpf_recvmsg() accuracy") removed it by accident. The commit message > calls it out explicitly: > > Also, UDP does not lock the sock during BH Rx path, it makes > no sense for its ->recvmsg() to lock the sock. It is always > possible for ->recvmsg() to be called before packets actually > arrive in the receive queue, we just use best effort to make > it accurate here. > > Looks like we just didn't understand the consequences at that time. I meant that's the accident :) > by mistake https://dictionary.cambridge.org/us/dictionary/english/accidentally > > > > > Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg(). > > > > Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not > > an option due to copy_page_to_iter() in sk_msg_recvmsg(). > > > > [0]: > > BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > > Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020 > > > > CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full) > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 > > Call Trace: > > <TASK> > > dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 > > print_address_description mm/kasan/report.c:378 [inline] > > print_report+0xba/0x230 mm/kasan/report.c:482 > > kasan_report+0x117/0x150 mm/kasan/report.c:595 > > sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 > > udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 > > inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 > > sock_recvmsg_nosec net/socket.c:1078 [inline] > > sock_recvmsg+0x1a8/0x270 net/socket.c:1100 > > ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 > > ___sys_recvmsg+0x215/0x590 net/socket.c:2854 > > do_recvmmsg+0x334/0x800 net/socket.c:2949 > > __sys_recvmmsg net/socket.c:3023 [inline] > > __do_sys_recvmmsg net/socket.c:3046 [inline] > > __se_sys_recvmmsg net/socket.c:3039 [inline] > > __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 > > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > RIP: 0033:0x7fb319f9aeb9 > > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 > > RSP: 002b:00007fb31ad97028 EFLAGS: 00000246 ORIG_RAX: 000000000000012b > > RAX: ffffffffffffffda RBX: 00007fb31a216090 RCX: 00007fb319f9aeb9 > > RDX: 0000000000000001 RSI: 0000200000000400 RDI: 0000000000000004 > > RBP: 00007fb31a008c1f R08: 0000000000000000 R09: 0000000000000000 > > R10: 0000000040000021 R11: 0000000000000246 R12: 0000000000000000 > > R13: 00007fb31a216128 R14: 00007fb31a216090 R15: 00007ffe21dd0a98 > > </TASK> > > > > Allocated by task 6019: > > kasan_save_stack mm/kasan/common.c:57 [inline] > > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > > poison_kmalloc_redzone mm/kasan/common.c:398 [inline] > > __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 > > kasan_kmalloc include/linux/kasan.h:263 [inline] > > __kmalloc_cache_noprof+0x3d1/0x6e0 mm/slub.c:5780 > > kmalloc_noprof include/linux/slab.h:957 [inline] > > kzalloc_noprof include/linux/slab.h:1094 [inline] > > alloc_sk_msg net/core/skmsg.c:510 [inline] > > sk_psock_skb_ingress_self+0x60/0x350 net/core/skmsg.c:612 > > sk_psock_verdict_apply net/core/skmsg.c:1038 [inline] > > sk_psock_verdict_recv+0x7d9/0x8d0 net/core/skmsg.c:1236 > > udp_read_skb+0x73e/0x7e0 net/ipv4/udp.c:2045 > > sk_psock_verdict_data_ready+0x12d/0x550 net/core/skmsg.c:1257 > > __udp_enqueue_schedule_skb+0xc54/0x10b0 net/ipv4/udp.c:1789 > > __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] > > udp_queue_rcv_one_skb+0xac5/0x19c0 net/ipv4/udp.c:2475 > > __udp4_lib_mcast_deliver+0xc06/0xcf0 net/ipv4/udp.c:2585 > > __udp4_lib_rcv+0x10f6/0x2620 net/ipv4/udp.c:2724 > > ip_protocol_deliver_rcu+0x282/0x440 net/ipv4/ip_input.c:207 > > ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241 > > NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318 > > dst_input include/net/dst.h:474 [inline] > > ip_sublist_rcv_finish+0x221/0x2a0 net/ipv4/ip_input.c:584 > > ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] > > ip_sublist_rcv+0x5c6/0xa70 net/ipv4/ip_input.c:644 > > ip_list_rcv+0x3f1/0x450 net/ipv4/ip_input.c:678 > > __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] > > __netif_receive_skb_list_core+0x7e5/0x810 net/core/dev.c:6242 > > __netif_receive_skb_list net/core/dev.c:6294 [inline] > > netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6385 > > netif_receive_skb_list+0x54/0x410 net/core/dev.c:6437 > > xdp_recv_frames net/bpf/test_run.c:269 [inline] > > xdp_test_run_batch net/bpf/test_run.c:350 [inline] > > bpf_test_run_xdp_live+0x1946/0x1cf0 net/bpf/test_run.c:379 > > bpf_prog_test_run_xdp+0x81c/0x1160 net/bpf/test_run.c:1396 > > bpf_prog_test_run+0x2c7/0x340 kernel/bpf/syscall.c:4703 > > __sys_bpf+0x5cb/0x920 kernel/bpf/syscall.c:6182 > > __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] > > __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] > > __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6272 > > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > Freed by task 6021: > > kasan_save_stack mm/kasan/common.c:57 [inline] > > kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 > > kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 > > poison_slab_object mm/kasan/common.c:253 [inline] > > __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 > > kasan_slab_free include/linux/kasan.h:235 [inline] > > slab_free_hook mm/slub.c:2540 [inline] > > slab_free mm/slub.c:6674 [inline] > > kfree+0x1be/0x650 mm/slub.c:6882 > > kfree_sk_msg include/linux/skmsg.h:385 [inline] > > sk_msg_recvmsg+0xaa8/0xc30 net/core/skmsg.c:483 > > udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 > > inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 > > sock_recvmsg_nosec net/socket.c:1078 [inline] > > sock_recvmsg+0x1a8/0x270 net/socket.c:1100 > > ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 > > ___sys_recvmsg+0x215/0x590 net/socket.c:2854 > > do_recvmmsg+0x334/0x800 net/socket.c:2949 > > __sys_recvmmsg net/socket.c:3023 [inline] > > __do_sys_recvmmsg net/socket.c:3046 [inline] > > __se_sys_recvmmsg net/socket.c:3039 [inline] > > __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 > > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > > do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") > > Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/ > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > --- > > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Thanks ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v4 bpf/net 4/6] sockmap: Inline sk_psock_create_ingress_msg(). 2026-02-21 23:30 [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP Kuniyuki Iwashima ` (2 preceding siblings ...) 2026-02-21 23:30 ` [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg() Kuniyuki Iwashima @ 2026-02-21 23:30 ` Kuniyuki Iwashima 2026-03-05 11:44 ` Jakub Sitnicki 2026-02-21 23:30 ` [PATCH v4 bpf/net 5/6] sockmap: Consolidate sk_psock_skb_ingress_self() Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP Kuniyuki Iwashima 5 siblings, 1 reply; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-02-21 23:30 UTC (permalink / raw) To: John Fastabend, Jakub Sitnicki Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev SOCKMAP memory accounting for UDP is broken for two reasons: 1. sk->sk_forward_alloc must be changed under spin_lock_bh(&sk->sk_receive_queue.lock) 2. sk_psock_skb_ingress_self() should not be used for UDP since UDP may reclaim sk->sk_forward_alloc partially before passing skb to sockmap, resulting in a negative sk->sk_forward_alloc This is a prep commit to consolidate sk_psock_skb_ingress_self() and centralise the fix to sk_psock_skb_ingress(). Let's inline sk_psock_create_ingress_msg(). Note that now alloc_sk_msg() is called first. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- v4: Don't pass gfp_flags and instead change it based on take_ref in the next patch --- net/core/skmsg.c | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 3d7eb2f4ac98..6b1fef6ef85b 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -529,18 +529,6 @@ static struct sk_msg *alloc_sk_msg(gfp_t gfp) return msg; } -static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, - struct sk_buff *skb) -{ - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) - return NULL; - - if (!sk_rmem_schedule(sk, skb, skb->truesize)) - return NULL; - - return alloc_sk_msg(GFP_KERNEL); -} - static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, u32 off, u32 len, struct sk_psock *psock, @@ -592,7 +580,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, { struct sock *sk = psock->sk; struct sk_msg *msg; - int err; + int err = -EAGAIN; /* If we are receiving on the same sock skb->sk is already assigned, * skip memory accounting and owner transition seeing it already set @@ -600,9 +588,16 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, */ if (unlikely(skb->sk == sk)) return sk_psock_skb_ingress_self(psock, skb, off, len, true); - msg = sk_psock_create_ingress_msg(sk, skb); + + msg = alloc_sk_msg(GFP_KERNEL); if (!msg) - return -EAGAIN; + goto out; + + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + goto free; + + if (!sk_rmem_schedule(sk, skb, skb->truesize)) + goto free; /* This will transition ownership of the data from the socket where * the BPF program was run initiating the redirect to the socket @@ -613,8 +608,12 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, skb_set_owner_r(skb, sk); err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true); if (err < 0) - kfree(msg); + goto free; +out: return err; +free: + kfree(msg); + goto out; } /* Puts an skb on the ingress queue of the socket already assigned to the -- 2.53.0.371.g1d285c8824-goog ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 4/6] sockmap: Inline sk_psock_create_ingress_msg(). 2026-02-21 23:30 ` [PATCH v4 bpf/net 4/6] sockmap: Inline sk_psock_create_ingress_msg() Kuniyuki Iwashima @ 2026-03-05 11:44 ` Jakub Sitnicki 0 siblings, 0 replies; 31+ messages in thread From: Jakub Sitnicki @ 2026-03-05 11:44 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev On Sat, Feb 21, 2026 at 11:30 PM GMT, Kuniyuki Iwashima wrote: > SOCKMAP memory accounting for UDP is broken for two reasons: > > 1. sk->sk_forward_alloc must be changed under > spin_lock_bh(&sk->sk_receive_queue.lock) > > 2. sk_psock_skb_ingress_self() should not be used for UDP > since UDP may reclaim sk->sk_forward_alloc partially > before passing skb to sockmap, resulting in a negative > sk->sk_forward_alloc > > This is a prep commit to consolidate sk_psock_skb_ingress_self() > and centralise the fix to sk_psock_skb_ingress(). > > Let's inline sk_psock_create_ingress_msg(). > > Note that now alloc_sk_msg() is called first. > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > v4: Don't pass gfp_flags and instead change it based on > take_ref in the next patch > --- Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v4 bpf/net 5/6] sockmap: Consolidate sk_psock_skb_ingress_self(). 2026-02-21 23:30 [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP Kuniyuki Iwashima ` (3 preceding siblings ...) 2026-02-21 23:30 ` [PATCH v4 bpf/net 4/6] sockmap: Inline sk_psock_create_ingress_msg() Kuniyuki Iwashima @ 2026-02-21 23:30 ` Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP Kuniyuki Iwashima 5 siblings, 0 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-02-21 23:30 UTC (permalink / raw) To: John Fastabend, Jakub Sitnicki Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev SOCKMAP memory accounting for UDP is broken, and sk_psock_skb_ingress_self() should not be used for UDP. Let's consolidate sk_psock_skb_ingress_self() to sk_psock_skb_ingress() so we can centralise the fix. Note that now sk_psock_skb_ingress() always use GFP_KERNEL if called from sk_psock_backlog(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- v4: Fix memory accounting condition from (skb->sk != sk) to (skb->sk != sk && take_ref), the cause of selftest failure v2: Keep msg->sk assignment --- net/core/skmsg.c | 62 +++++++++++++----------------------------------- 1 file changed, 17 insertions(+), 45 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 6b1fef6ef85b..96f43e0dbb17 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -572,32 +572,30 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, return copied; } -static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb, - u32 off, u32 len, bool take_ref); - static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, - u32 off, u32 len) + u32 off, u32 len, bool take_ref) { struct sock *sk = psock->sk; struct sk_msg *msg; int err = -EAGAIN; - /* If we are receiving on the same sock skb->sk is already assigned, - * skip memory accounting and owner transition seeing it already set - * correctly. - */ - if (unlikely(skb->sk == sk)) - return sk_psock_skb_ingress_self(psock, skb, off, len, true); - - msg = alloc_sk_msg(GFP_KERNEL); + msg = alloc_sk_msg(take_ref ? GFP_KERNEL : GFP_ATOMIC); if (!msg) goto out; - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) - goto free; + if (skb->sk != sk && take_ref) { + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + goto free; - if (!sk_rmem_schedule(sk, skb, skb->truesize)) - goto free; + if (!sk_rmem_schedule(sk, skb, skb->truesize)) + goto free; + } else { + /* This is used in tcp_bpf_recvmsg_parser() to determine whether the + * data originates from the socket's own protocol stack. No need to + * refcount sk because msg's lifetime is bound to sk via the ingress_msg. + */ + msg->sk = sk; + } /* This will transition ownership of the data from the socket where * the BPF program was run initiating the redirect to the socket @@ -606,7 +604,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, * into user buffers. */ skb_set_owner_r(skb, sk); - err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true); + err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); if (err < 0) goto free; out: @@ -616,32 +614,6 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, goto out; } -/* Puts an skb on the ingress queue of the socket already assigned to the - * skb. In this case we do not need to check memory limits or skb_set_owner_r - * because the skb is already accounted for here. - */ -static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb, - u32 off, u32 len, bool take_ref) -{ - struct sk_msg *msg = alloc_sk_msg(GFP_ATOMIC); - struct sock *sk = psock->sk; - int err; - - if (unlikely(!msg)) - return -EAGAIN; - skb_set_owner_r(skb, sk); - - /* This is used in tcp_bpf_recvmsg_parser() to determine whether the - * data originates from the socket's own protocol stack. No need to - * refcount sk because msg's lifetime is bound to sk via the ingress_msg. - */ - msg->sk = sk; - err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); - if (err < 0) - kfree(msg); - return err; -} - static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len, bool ingress) { @@ -651,7 +623,7 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, return skb_send_sock(psock->sk, skb, off, len); } - return sk_psock_skb_ingress(psock, skb, off, len); + return sk_psock_skb_ingress(psock, skb, off, len, true); } static void sk_psock_skb_state(struct sk_psock *psock, @@ -1058,7 +1030,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, off = stm->offset; len = stm->full_len; } - err = sk_psock_skb_ingress_self(psock, skb, off, len, false); + err = sk_psock_skb_ingress(psock, skb, off, len, false); } if (err < 0) { spin_lock_bh(&psock->ingress_lock); -- 2.53.0.371.g1d285c8824-goog ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-02-21 23:30 [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP Kuniyuki Iwashima ` (4 preceding siblings ...) 2026-02-21 23:30 ` [PATCH v4 bpf/net 5/6] sockmap: Consolidate sk_psock_skb_ingress_self() Kuniyuki Iwashima @ 2026-02-21 23:30 ` Kuniyuki Iwashima 2026-03-04 20:04 ` Martin KaFai Lau 2026-03-05 6:37 ` Jiayuan Chen 5 siblings, 2 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-02-21 23:30 UTC (permalink / raw) To: John Fastabend, Jakub Sitnicki Cc: Willem de Bruijn, Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a syzbot reported imbalanced sk->sk_forward_alloc [0] and demonstrated that UDP memory accounting by SOCKMAP is broken. The repro put a UDP sk into SOCKMAP and redirected skb to itself, where skb->truesize was 4240. First, udp_rmem_schedule() set sk->sk_forward_alloc to 8192 (2 * PAGE_SIZE), and skb->truesize was charged: sk->sk_forward_alloc = 0 + 8192 - 4240; // => 3952 Then, udp_read_skb() dequeued the skb by skb_recv_udp(), which finally calls udp_rmem_release() and _partially_ reclaims sk->sk_forward_alloc because skb->truesize was larger than PAGE_SIZE: sk->sk_forward_alloc += 4240; // => 8192 (PAGE_SIZE is reclaimable) sk->sk_forward_alloc -= 4096; // => 4096 Later, sk_psock_skb_ingress_self() called skb_set_owner_r() to charge the skb again, triggering an sk->sk_forward_alloc underflow: sk->sk_forward_alloc -= 4240 // => -144 Another problem is that UDP memory accounting is not performed under spin_lock_bh(&sk->sk_receive_queue.lock). skb_set_owner_r() and sock_rfree() are called locklessly and corrupt sk->sk_forward_alloc, leading to the splat. Let's not skip memory accounting for UDP and ensure the proper lock is held. Note that UDP does not need msg->sk assignment, which is for TCP. [0]: WARNING: net/ipv4/af_inet.c:157 at inet_sock_destruct+0x62d/0x740 net/ipv4/af_inet.c:157, CPU#0: ksoftirqd/0/15 Modules linked in: CPU: 0 UID: 0 PID: 15 Comm: ksoftirqd/0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 RIP: 0010:inet_sock_destruct+0x62d/0x740 net/ipv4/af_inet.c:157 Code: 0f 0b 90 e9 58 fe ff ff e8 40 55 b3 f7 90 0f 0b 90 e9 8b fe ff ff e8 32 55 b3 f7 90 0f 0b 90 e9 b1 fe ff ff e8 24 55 b3 f7 90 <0f> 0b 90 e9 d7 fe ff ff 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 95 fc RSP: 0018:ffffc90000147a48 EFLAGS: 00010246 RAX: ffffffff8a1121dc RBX: dffffc0000000000 RCX: ffff88801d2c3d00 RDX: 0000000000000100 RSI: 0000000000000f70 RDI: 0000000000000000 RBP: 0000000000000f70 R08: ffff888030ce1327 R09: 1ffff1100619c264 R10: dffffc0000000000 R11: ffffed100619c265 R12: ffff888030ce1080 R13: dffffc0000000000 R14: ffff888030ce130c R15: ffffffff8fa87e00 FS: 0000000000000000(0000) GS:ffff8881256f8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000200000000700 CR3: 000000007200c000 CR4: 00000000003526f0 Call Trace: <TASK> __sk_destruct+0x85/0x880 net/core/sock.c:2350 rcu_do_batch kernel/rcu/tree.c:2605 [inline] rcu_core+0xc9e/0x1750 kernel/rcu/tree.c:2857 handle_softirqs+0x22a/0x7c0 kernel/softirq.c:622 run_ksoftirqd+0x36/0x60 kernel/softirq.c:1063 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 kthread+0x726/0x8b0 kernel/kthread.c:463 ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> --- v4: Fix deadlock when requeued v2: Fix build failure when CONFIG_INET=n --- include/net/udp.h | 9 +++++++++ net/core/skmsg.c | 29 +++++++++++++++++++++++++---- net/ipv4/udp.c | 9 +++++++++ 3 files changed, 43 insertions(+), 4 deletions(-) diff --git a/include/net/udp.h b/include/net/udp.h index 700dbedcb15f..ae38a4da9388 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, struct sk_buff *skb); struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, __be16 sport, __be16 dport); + +#ifdef CONFIG_INET +void udp_sock_rfree(struct sk_buff *skb); +#else +static inline void udp_sock_rfree(struct sk_buff *skb) +{ +} +#endif + int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); /* UDP uses skb->dev_scratch to cache as much information as possible and avoid diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 96f43e0dbb17..dd9134a45663 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -7,6 +7,7 @@ #include <net/sock.h> #include <net/tcp.h> +#include <net/udp.h> #include <net/tls.h> #include <trace/events/sock.h> @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len, bool take_ref) { struct sock *sk = psock->sk; + bool is_udp = sk_is_udp(sk); struct sk_msg *msg; int err = -EAGAIN; @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, if (!msg) goto out; - if (skb->sk != sk && take_ref) { + if (is_udp) { + if (unlikely(skb->destructor == udp_sock_rfree)) + goto enqueue; + + spin_lock_bh(&sk->sk_receive_queue.lock); + } + + if (is_udp || (skb->sk != sk && take_ref)) { if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) - goto free; + goto unlock; if (!sk_rmem_schedule(sk, skb, skb->truesize)) - goto free; - } else { + goto unlock; + } else if (skb->sk == sk || !take_ref) { /* This is used in tcp_bpf_recvmsg_parser() to determine whether the * data originates from the socket's own protocol stack. No need to * refcount sk because msg's lifetime is bound to sk via the ingress_msg. @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, * into user buffers. */ skb_set_owner_r(skb, sk); + + if (is_udp) { + spin_unlock_bh(&sk->sk_receive_queue.lock); + + skb->destructor = udp_sock_rfree; + } + +enqueue: err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); if (err < 0) goto free; out: return err; + +unlock: + if (is_udp) + spin_unlock_bh(&sk->sk_receive_queue.lock); free: kfree(msg); goto out; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 422c96fea249..831d26748a90 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, } EXPORT_SYMBOL(__skb_recv_udp); +void udp_sock_rfree(struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + + spin_lock_bh(&sk->sk_receive_queue.lock); + sock_rfree(skb); + spin_unlock_bh(&sk->sk_receive_queue.lock); +} + int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { struct sk_buff *skb; -- 2.53.0.371.g1d285c8824-goog ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-02-21 23:30 ` [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP Kuniyuki Iwashima @ 2026-03-04 20:04 ` Martin KaFai Lau 2026-03-04 20:14 ` Kuniyuki Iwashima 2026-03-05 6:37 ` Jiayuan Chen 1 sibling, 1 reply; 31+ messages in thread From: Martin KaFai Lau @ 2026-03-04 20:04 UTC (permalink / raw) To: Kuniyuki Iwashima, John Fastabend, Jakub Sitnicki, Jiayuan Chen, Cong Wang Cc: Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a On 2/21/26 3:30 PM, Kuniyuki Iwashima wrote: > syzbot reported imbalanced sk->sk_forward_alloc [0] and > demonstrated that UDP memory accounting by SOCKMAP is broken. > > The repro put a UDP sk into SOCKMAP and redirected skb to itself, > where skb->truesize was 4240. > > First, udp_rmem_schedule() set sk->sk_forward_alloc to 8192 > (2 * PAGE_SIZE), and skb->truesize was charged: > > sk->sk_forward_alloc = 0 + 8192 - 4240; // => 3952 > > Then, udp_read_skb() dequeued the skb by skb_recv_udp(), which finally > calls udp_rmem_release() and _partially_ reclaims sk->sk_forward_alloc > because skb->truesize was larger than PAGE_SIZE: > > sk->sk_forward_alloc += 4240; // => 8192 (PAGE_SIZE is reclaimable) > sk->sk_forward_alloc -= 4096; // => 4096 > > Later, sk_psock_skb_ingress_self() called skb_set_owner_r() to > charge the skb again, triggering an sk->sk_forward_alloc underflow: > > sk->sk_forward_alloc -= 4240 // => -144 > > Another problem is that UDP memory accounting is not performed > under spin_lock_bh(&sk->sk_receive_queue.lock). > > skb_set_owner_r() and sock_rfree() are called locklessly and > corrupt sk->sk_forward_alloc, leading to the splat. > > Let's not skip memory accounting for UDP and ensure the proper > lock is held. > > Note that UDP does not need msg->sk assignment, which is for TCP. > > [0]: > WARNING: net/ipv4/af_inet.c:157 at inet_sock_destruct+0x62d/0x740 net/ipv4/af_inet.c:157, CPU#0: ksoftirqd/0/15 > Modules linked in: > CPU: 0 UID: 0 PID: 15 Comm: ksoftirqd/0 Not tainted syzkaller #0 PREEMPT(full) > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 > RIP: 0010:inet_sock_destruct+0x62d/0x740 net/ipv4/af_inet.c:157 > Code: 0f 0b 90 e9 58 fe ff ff e8 40 55 b3 f7 90 0f 0b 90 e9 8b fe ff ff e8 32 55 b3 f7 90 0f 0b 90 e9 b1 fe ff ff e8 24 55 b3 f7 90 <0f> 0b 90 e9 d7 fe ff ff 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 95 fc > RSP: 0018:ffffc90000147a48 EFLAGS: 00010246 > RAX: ffffffff8a1121dc RBX: dffffc0000000000 RCX: ffff88801d2c3d00 > RDX: 0000000000000100 RSI: 0000000000000f70 RDI: 0000000000000000 > RBP: 0000000000000f70 R08: ffff888030ce1327 R09: 1ffff1100619c264 > R10: dffffc0000000000 R11: ffffed100619c265 R12: ffff888030ce1080 > R13: dffffc0000000000 R14: ffff888030ce130c R15: ffffffff8fa87e00 > FS: 0000000000000000(0000) GS:ffff8881256f8000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000200000000700 CR3: 000000007200c000 CR4: 00000000003526f0 > Call Trace: > <TASK> > __sk_destruct+0x85/0x880 net/core/sock.c:2350 > rcu_do_batch kernel/rcu/tree.c:2605 [inline] > rcu_core+0xc9e/0x1750 kernel/rcu/tree.c:2857 > handle_softirqs+0x22a/0x7c0 kernel/softirq.c:622 > run_ksoftirqd+0x36/0x60 kernel/softirq.c:1063 > smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 > kthread+0x726/0x8b0 kernel/kthread.c:463 > ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158 > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 > </TASK> > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") Cc: Cong Wang, who is the author of the quoted commit and also the author of the full UDP support in sockmap. JakubS and JohnF who are the maintainers of sockmap and skmsg. Please take a look also. Jiayuan Chen, you had submitted patches(/fixes) for sockmap/skmsg. It is a good chance to gain review credit. > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 96f43e0dbb17..dd9134a45663 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -7,6 +7,7 @@ > > #include <net/sock.h> > #include <net/tcp.h> > +#include <net/udp.h> > #include <net/tls.h> > #include <trace/events/sock.h> > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > u32 off, u32 len, bool take_ref) > { > struct sock *sk = psock->sk; > + bool is_udp = sk_is_udp(sk); > struct sk_msg *msg; > int err = -EAGAIN; > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > if (!msg) > goto out; > > - if (skb->sk != sk && take_ref) { > + if (is_udp) { > + if (unlikely(skb->destructor == udp_sock_rfree)) A quick question. This case happens when the earlier sk_psock_skb_ingress_enqueue() failed? > + goto enqueue; > + > + spin_lock_bh(&sk->sk_receive_queue.lock); > + } > + > + if (is_udp || (skb->sk != sk && take_ref)) { > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > - goto free; > + goto unlock; > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > - goto free; > - } else { > + goto unlock; > + } else if (skb->sk == sk || !take_ref) { > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > * data originates from the socket's own protocol stack. No need to > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > * into user buffers. > */ > skb_set_owner_r(skb, sk); > + > + if (is_udp) { > + spin_unlock_bh(&sk->sk_receive_queue.lock); > + > + skb->destructor = udp_sock_rfree; > + } > + > +enqueue: > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > if (err < 0) > goto free; > out: > return err; > + > +unlock: > + if (is_udp) > + spin_unlock_bh(&sk->sk_receive_queue.lock); > free: > kfree(msg); > goto out; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-04 20:04 ` Martin KaFai Lau @ 2026-03-04 20:14 ` Kuniyuki Iwashima 0 siblings, 0 replies; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-04 20:14 UTC (permalink / raw) To: Martin KaFai Lau Cc: John Fastabend, Jakub Sitnicki, Jiayuan Chen, Cong Wang, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a On Wed, Mar 4, 2026 at 12:04 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 2/21/26 3:30 PM, Kuniyuki Iwashima wrote: > > syzbot reported imbalanced sk->sk_forward_alloc [0] and > > demonstrated that UDP memory accounting by SOCKMAP is broken. > > > > The repro put a UDP sk into SOCKMAP and redirected skb to itself, > > where skb->truesize was 4240. > > > > First, udp_rmem_schedule() set sk->sk_forward_alloc to 8192 > > (2 * PAGE_SIZE), and skb->truesize was charged: > > > > sk->sk_forward_alloc = 0 + 8192 - 4240; // => 3952 > > > > Then, udp_read_skb() dequeued the skb by skb_recv_udp(), which finally > > calls udp_rmem_release() and _partially_ reclaims sk->sk_forward_alloc > > because skb->truesize was larger than PAGE_SIZE: > > > > sk->sk_forward_alloc += 4240; // => 8192 (PAGE_SIZE is reclaimable) > > sk->sk_forward_alloc -= 4096; // => 4096 > > > > Later, sk_psock_skb_ingress_self() called skb_set_owner_r() to > > charge the skb again, triggering an sk->sk_forward_alloc underflow: > > > > sk->sk_forward_alloc -= 4240 // => -144 > > > > Another problem is that UDP memory accounting is not performed > > under spin_lock_bh(&sk->sk_receive_queue.lock). > > > > skb_set_owner_r() and sock_rfree() are called locklessly and > > corrupt sk->sk_forward_alloc, leading to the splat. > > > > Let's not skip memory accounting for UDP and ensure the proper > > lock is held. > > > > Note that UDP does not need msg->sk assignment, which is for TCP. > > > > [0]: > > WARNING: net/ipv4/af_inet.c:157 at inet_sock_destruct+0x62d/0x740 net/ipv4/af_inet.c:157, CPU#0: ksoftirqd/0/15 > > Modules linked in: > > CPU: 0 UID: 0 PID: 15 Comm: ksoftirqd/0 Not tainted syzkaller #0 PREEMPT(full) > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 > > RIP: 0010:inet_sock_destruct+0x62d/0x740 net/ipv4/af_inet.c:157 > > Code: 0f 0b 90 e9 58 fe ff ff e8 40 55 b3 f7 90 0f 0b 90 e9 8b fe ff ff e8 32 55 b3 f7 90 0f 0b 90 e9 b1 fe ff ff e8 24 55 b3 f7 90 <0f> 0b 90 e9 d7 fe ff ff 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 95 fc > > RSP: 0018:ffffc90000147a48 EFLAGS: 00010246 > > RAX: ffffffff8a1121dc RBX: dffffc0000000000 RCX: ffff88801d2c3d00 > > RDX: 0000000000000100 RSI: 0000000000000f70 RDI: 0000000000000000 > > RBP: 0000000000000f70 R08: ffff888030ce1327 R09: 1ffff1100619c264 > > R10: dffffc0000000000 R11: ffffed100619c265 R12: ffff888030ce1080 > > R13: dffffc0000000000 R14: ffff888030ce130c R15: ffffffff8fa87e00 > > FS: 0000000000000000(0000) GS:ffff8881256f8000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000200000000700 CR3: 000000007200c000 CR4: 00000000003526f0 > > Call Trace: > > <TASK> > > __sk_destruct+0x85/0x880 net/core/sock.c:2350 > > rcu_do_batch kernel/rcu/tree.c:2605 [inline] > > rcu_core+0xc9e/0x1750 kernel/rcu/tree.c:2857 > > handle_softirqs+0x22a/0x7c0 kernel/softirq.c:622 > > run_ksoftirqd+0x36/0x60 kernel/softirq.c:1063 > > smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 > > kthread+0x726/0x8b0 kernel/kthread.c:463 > > ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158 > > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 > > </TASK> > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > Cc: Cong Wang, who is the author of the quoted commit and also the > author of the full UDP support in sockmap. > > JakubS and JohnF who are the maintainers of sockmap and skmsg. Please > take a look also. > > Jiayuan Chen, you had submitted patches(/fixes) for sockmap/skmsg. It is > a good chance to gain review credit. > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > index 96f43e0dbb17..dd9134a45663 100644 > > --- a/net/core/skmsg.c > > +++ b/net/core/skmsg.c > > @@ -7,6 +7,7 @@ > > > > #include <net/sock.h> > > #include <net/tcp.h> > > +#include <net/udp.h> > > #include <net/tls.h> > > #include <trace/events/sock.h> > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > u32 off, u32 len, bool take_ref) > > { > > struct sock *sk = psock->sk; > > + bool is_udp = sk_is_udp(sk); > > struct sk_msg *msg; > > int err = -EAGAIN; > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > if (!msg) > > goto out; > > > > - if (skb->sk != sk && take_ref) { > > + if (is_udp) { > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > A quick question. This case happens when the earlier > sk_psock_skb_ingress_enqueue() failed? Yes, the skb is charged by skb_set_owner_r() and the lock is released, so I left skb->destructor as is and avoid extra locking when sk_psock_skb_ingress_enqueue() fails. > > > + goto enqueue; > > + > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > + } > > + > > + if (is_udp || (skb->sk != sk && take_ref)) { > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > - goto free; > > + goto unlock; > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > - goto free; > > - } else { > > + goto unlock; > > + } else if (skb->sk == sk || !take_ref) { > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > * data originates from the socket's own protocol stack. No need to > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > * into user buffers. > > */ > > skb_set_owner_r(skb, sk); > > + > > + if (is_udp) { > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > + > > + skb->destructor = udp_sock_rfree; > > + } > > + > > +enqueue: > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > if (err < 0) > > goto free; > > out: > > return err; > > + > > +unlock: > > + if (is_udp) > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > free: > > kfree(msg); > > goto out; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-02-21 23:30 ` [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP Kuniyuki Iwashima 2026-03-04 20:04 ` Martin KaFai Lau @ 2026-03-05 6:37 ` Jiayuan Chen 2026-03-05 7:48 ` Kuniyuki Iwashima 1 sibling, 1 reply; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 6:37 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: [...] > </TASK> > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > --- > v4: Fix deadlock when requeued > v2: Fix build failure when CONFIG_INET=n > --- > include/net/udp.h | 9 +++++++++ > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > net/ipv4/udp.c | 9 +++++++++ > 3 files changed, 43 insertions(+), 4 deletions(-) > > diff --git a/include/net/udp.h b/include/net/udp.h > index 700dbedcb15f..ae38a4da9388 100644 > --- a/include/net/udp.h > +++ b/include/net/udp.h > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > struct sk_buff *skb); > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > __be16 sport, __be16 dport); > + > +#ifdef CONFIG_INET > +void udp_sock_rfree(struct sk_buff *skb); > +#else > +static inline void udp_sock_rfree(struct sk_buff *skb) > +{ > +} > +#endif > + > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 96f43e0dbb17..dd9134a45663 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -7,6 +7,7 @@ > > #include <net/sock.h> > #include <net/tcp.h> > +#include <net/udp.h> > #include <net/tls.h> > #include <trace/events/sock.h> > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > u32 off, u32 len, bool take_ref) > { > struct sock *sk = psock->sk; > + bool is_udp = sk_is_udp(sk); > struct sk_msg *msg; > int err = -EAGAIN; > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > if (!msg) > goto out; > > - if (skb->sk != sk && take_ref) { > + if (is_udp) { > + if (unlikely(skb->destructor == udp_sock_rfree)) > + goto enqueue; > + > + spin_lock_bh(&sk->sk_receive_queue.lock); > + } > + > + if (is_udp || (skb->sk != sk && take_ref)) { > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > - goto free; > + goto unlock; > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > - goto free; > - } else { > + goto unlock; > + } else if (skb->sk == sk || !take_ref) { > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > * data originates from the socket's own protocol stack. No need to > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > * into user buffers. > */ > skb_set_owner_r(skb, sk); > + > + if (is_udp) { > + spin_unlock_bh(&sk->sk_receive_queue.lock); > + > + skb->destructor = udp_sock_rfree; > + } > + > +enqueue: > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > if (err < 0) > goto free; > out: > return err; > + > +unlock: > + if (is_udp) > + spin_unlock_bh(&sk->sk_receive_queue.lock); > free: > kfree(msg); > goto out; > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 422c96fea249..831d26748a90 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > } > EXPORT_SYMBOL(__skb_recv_udp); > > +void udp_sock_rfree(struct sk_buff *skb) > +{ > + struct sock *sk = skb->sk; > + > + spin_lock_bh(&sk->sk_receive_queue.lock); > + sock_rfree(skb); > + spin_unlock_bh(&sk->sk_receive_queue.lock); > +} > + > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > { > struct sk_buff *skb; > -- > 2.53.0.371.g1d285c8824-goog > There are too many protocol-specific checks in the generic skmsg code here. This should be abstracted out. Also TCP also has a similar issue with sk_forward_alloc concurrency between the backlog charge and recvmsg uncharge paths so the abstraction is necessary. I've put together a simple diff based on your patch for reference. I haven't tested it thoroughly, but it at least handles TCP and UDP separately through a callback, keeping the generic code clean. diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 829b281d6c9c..d9a607e11cd1 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -80,6 +80,20 @@ struct sk_psock_work_state { u32 off; }; +struct sk_psock; + +/* + * Callback for protocol-specific memory accounting on the ingress path. + * Each protocol uses different locks to protect sk->sk_forward_alloc: + * TCP: psock->ingress_lock + * UDP: sk->sk_receive_queue.lock + * This callback encapsulates the protocol-specific locking and charging. + * + * Returns 0 on success, negative errno on failure. + */ +typedef int (*psock_ingress_charge_t)(struct sk_psock *psock, struct sock *sk, + struct sk_buff *skb, struct sk_msg *msg); + struct sk_psock { struct sock *sk; struct sock *sk_redir; @@ -97,6 +111,7 @@ struct sk_psock { struct sk_buff_head ingress_skb; struct list_head ingress_msg; spinlock_t ingress_lock; + psock_ingress_charge_t ingress_charge; /** @msg_tot_len: Total bytes queued in ingress_msg list. */ u32 msg_tot_len; unsigned long state; @@ -411,10 +426,13 @@ static inline bool sk_psock_queue_empty(const struct sk_psock *psock) return psock ? list_empty(&psock->ingress_msg) : true; } -static inline void kfree_sk_msg(struct sk_msg *msg) +static inline void kfree_sk_msg(struct sk_psock *psock, struct sk_msg *msg) { - if (msg->skb) + if (msg->skb) { + spin_lock_bh(&psock->ingress_lock); consume_skb(msg->skb); + spin_unlock_bh(&psock->ingress_lock); + } kfree(msg); } diff --git a/include/net/udp.h b/include/net/udp.h index 700dbedcb15f..78229bc7424e 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -401,6 +401,7 @@ void udp_destruct_common(struct sock *sk); void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len); int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb); void udp_skb_destructor(struct sock *sk, struct sk_buff *skb); +void udp_sock_rfree(struct sk_buff *skb); struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, int *off, int *err); static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int flags, diff --git a/net/core/skmsg.c b/net/core/skmsg.c index ddde93dd8bc6..df0f846c63bb 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -488,7 +488,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg msg_rx->sg.start = i; if (!sge->length && (i == msg_rx->sg.end || sg_is_last(sge))) { msg_rx = sk_psock_dequeue_msg(psock); - kfree_sk_msg(msg_rx); + kfree_sk_msg(psock, msg_rx); } msg_rx = sk_psock_peek_msg(psock); } @@ -529,18 +529,6 @@ static struct sk_msg *alloc_sk_msg(gfp_t gfp) return msg; } -static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, - struct sk_buff *skb) -{ - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) - return NULL; - - if (!sk_rmem_schedule(sk, skb, skb->truesize)) - return NULL; - - return alloc_sk_msg(GFP_KERNEL); -} - static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, u32 off, u32 len, struct sk_psock *psock, @@ -584,60 +572,51 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, return copied; } -static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb, - u32 off, u32 len, bool take_ref); - -static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, - u32 off, u32 len) +/* Default ingress charge for protocols that don't set a custom callback. + * Behaves like the original TCP logic: skip accounting for self-redirect. + */ +static int sk_psock_ingress_charge_default(struct sk_psock *psock, + struct sock *sk, + struct sk_buff *skb, + struct sk_msg *msg) { - struct sock *sk = psock->sk; - struct sk_msg *msg; - int err; + if (skb->sk == sk) { + skb_set_owner_r(skb, sk); + msg->sk = sk; + return 0; + } - /* If we are receiving on the same sock skb->sk is already assigned, - * skip memory accounting and owner transition seeing it already set - * correctly. - */ - if (unlikely(skb->sk == sk)) - return sk_psock_skb_ingress_self(psock, skb, off, len, true); - msg = sk_psock_create_ingress_msg(sk, skb); - if (!msg) + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + return -EAGAIN; + + if (!sk_rmem_schedule(sk, skb, skb->truesize)) return -EAGAIN; - /* This will transition ownership of the data from the socket where - * the BPF program was run initiating the redirect to the socket - * we will eventually receive this data on. The data will be released - * from skb_consume found in __tcp_bpf_recvmsg() after its been copied - * into user buffers. - */ skb_set_owner_r(skb, sk); - err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true); - if (err < 0) - kfree(msg); - return err; + return 0; } -/* Puts an skb on the ingress queue of the socket already assigned to the - * skb. In this case we do not need to check memory limits or skb_set_owner_r - * because the skb is already accounted for here. - */ -static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb, - u32 off, u32 len, bool take_ref) +static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, + u32 off, u32 len, bool take_ref) { - struct sk_msg *msg = alloc_sk_msg(GFP_ATOMIC); struct sock *sk = psock->sk; + psock_ingress_charge_t charge_fn; + struct sk_msg *msg; int err; - if (unlikely(!msg)) + msg = alloc_sk_msg(take_ref ? GFP_KERNEL : GFP_ATOMIC); + if (!msg) return -EAGAIN; - skb_set_owner_r(skb, sk); - /* This is used in tcp_bpf_recvmsg_parser() to determine whether the - * data originates from the socket's own protocol stack. No need to - * refcount sk because msg's lifetime is bound to sk via the ingress_msg. - */ - msg->sk = sk; - err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); + charge_fn = psock->ingress_charge ?: sk_psock_ingress_charge_default; + err = charge_fn(psock, sk, skb, msg); + if (err < 0) { + kfree(msg); + return err; + } + + err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, + take_ref); if (err < 0) kfree(msg); return err; @@ -652,7 +631,7 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, return skb_send_sock(psock->sk, skb, off, len); } - return sk_psock_skb_ingress(psock, skb, off, len); + return sk_psock_skb_ingress(psock, skb, off, len, true); } static void sk_psock_skb_state(struct sk_psock *psock, @@ -1059,7 +1038,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, off = stm->offset; len = stm->full_len; } - err = sk_psock_skb_ingress_self(psock, skb, off, len, false); + err = sk_psock_skb_ingress(psock, skb, off, len, false); } if (err < 0) { spin_lock_bh(&psock->ingress_lock); diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index ca8a5cb8e569..d88b9516a470 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -706,6 +706,37 @@ int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc, } #endif /* CONFIG_BPF_STREAM_PARSER */ +static int tcp_psock_ingress_charge(struct sk_psock *psock, struct sock *sk, + struct sk_buff *skb, struct sk_msg *msg) +{ + /* TCP self-redirect: skb is already accounted by TCP stack, and + * lock_sock serializes the verdict path with recvmsg. Skip memory + * accounting but mark msg->sk for tcp_bpf_recvmsg_parser() to + * identify self-originated data. + */ + if (skb->sk == sk) { + skb_set_owner_r(skb, sk); + msg->sk = sk; + return 0; + } + + spin_lock_bh(&psock->ingress_lock); + + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + goto err; + + if (!sk_rmem_schedule(sk, skb, skb->truesize)) + goto err; + + skb_set_owner_r(skb, sk); + spin_unlock_bh(&psock->ingress_lock); + return 0; + +err: + spin_unlock_bh(&psock->ingress_lock); + return -EAGAIN; +} + int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) { int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4; @@ -739,6 +770,7 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) tcp_bpf_check_v6_needs_rebuild(psock->sk_proto); } + psock->ingress_charge = tcp_psock_ingress_charge; /* Pairs with lockless read in sk_clone_lock() */ sock_replace_proto(sk, &tcp_bpf_prots[family][config]); return 0; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index b96e47f1c8a2..8cefc3ed76c5 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2039,6 +2039,19 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, } EXPORT_SYMBOL(__skb_recv_udp); +/* skb destructor for sockmap-owned UDP skbs. + * sock_rfree modifies sk->sk_forward_alloc which must be protected + * by sk->sk_receive_queue.lock for UDP. + */ +void udp_sock_rfree(struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + + spin_lock_bh(&sk->sk_receive_queue.lock); + sock_rfree(skb); + spin_unlock_bh(&sk->sk_receive_queue.lock); +} + int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { struct sk_buff *skb; diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c index 91233e37cd97..6b117711193e 100644 --- a/net/ipv4/udp_bpf.c +++ b/net/ipv4/udp_bpf.c @@ -153,6 +153,42 @@ static int __init udp_bpf_v4_build_proto(void) } late_initcall(udp_bpf_v4_build_proto); +static int udp_psock_ingress_charge(struct sk_psock *psock, struct sock *sk, + struct sk_buff *skb, struct sk_msg *msg) +{ + /* When backlog retries a previously charged skb, the destructor + * is already set to udp_sock_rfree. Skip re-charging. + */ + if (unlikely(skb->destructor == udp_sock_rfree)) + return 0; + + /* UDP must always do full memory accounting, even for self-redirect. + * Unlike TCP, udp_rmem_release() partially reclaims sk_forward_alloc + * during skb_recv_udp(), so the skb is no longer properly accounted + * when it reaches the sockmap ingress path. + * + * sk->sk_receive_queue.lock is required because that's the lock UDP + * uses internally to protect sk_forward_alloc. + */ + spin_lock_bh(&sk->sk_receive_queue.lock); + + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + goto err; + + if (!sk_rmem_schedule(sk, skb, skb->truesize)) + goto err; + + skb_set_owner_r(skb, sk); + spin_unlock_bh(&sk->sk_receive_queue.lock); + + skb->destructor = udp_sock_rfree; + return 0; + +err: + spin_unlock_bh(&sk->sk_receive_queue.lock); + return -EAGAIN; +} + int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) { int family = sk->sk_family == AF_INET ? UDP_BPF_IPV4 : UDP_BPF_IPV6; @@ -166,6 +202,7 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore) if (sk->sk_family == AF_INET6) udp_bpf_check_v6_needs_rebuild(psock->sk_proto); + psock->ingress_charge = udp_psock_ingress_charge; sock_replace_proto(sk, &udp_bpf_prots[family]); return 0; } ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-05 6:37 ` Jiayuan Chen @ 2026-03-05 7:48 ` Kuniyuki Iwashima 2026-03-05 8:30 ` Jiayuan Chen 0 siblings, 1 reply; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-05 7:48 UTC (permalink / raw) To: Jiayuan Chen Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a On Wed, Mar 4, 2026 at 10:37 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: > [...] > > </TASK> > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > --- > > v4: Fix deadlock when requeued > > v2: Fix build failure when CONFIG_INET=n > > --- > > include/net/udp.h | 9 +++++++++ > > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > > net/ipv4/udp.c | 9 +++++++++ > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > diff --git a/include/net/udp.h b/include/net/udp.h > > index 700dbedcb15f..ae38a4da9388 100644 > > --- a/include/net/udp.h > > +++ b/include/net/udp.h > > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > > struct sk_buff *skb); > > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > > __be16 sport, __be16 dport); > > + > > +#ifdef CONFIG_INET > > +void udp_sock_rfree(struct sk_buff *skb); > > +#else > > +static inline void udp_sock_rfree(struct sk_buff *skb) > > +{ > > +} > > +#endif > > + > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > index 96f43e0dbb17..dd9134a45663 100644 > > --- a/net/core/skmsg.c > > +++ b/net/core/skmsg.c > > @@ -7,6 +7,7 @@ > > > > #include <net/sock.h> > > #include <net/tcp.h> > > +#include <net/udp.h> > > #include <net/tls.h> > > #include <trace/events/sock.h> > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > u32 off, u32 len, bool take_ref) > > { > > struct sock *sk = psock->sk; > > + bool is_udp = sk_is_udp(sk); > > struct sk_msg *msg; > > int err = -EAGAIN; > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > if (!msg) > > goto out; > > > > - if (skb->sk != sk && take_ref) { > > + if (is_udp) { > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > + goto enqueue; > > + > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > + } > > + > > + if (is_udp || (skb->sk != sk && take_ref)) { > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > - goto free; > > + goto unlock; > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > - goto free; > > - } else { > > + goto unlock; > > + } else if (skb->sk == sk || !take_ref) { > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > * data originates from the socket's own protocol stack. No need to > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > * into user buffers. > > */ > > skb_set_owner_r(skb, sk); > > + > > + if (is_udp) { > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > + > > + skb->destructor = udp_sock_rfree; > > + } > > + > > +enqueue: > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > if (err < 0) > > goto free; > > out: > > return err; > > + > > +unlock: > > + if (is_udp) > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > free: > > kfree(msg); > > goto out; > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > index 422c96fea249..831d26748a90 100644 > > --- a/net/ipv4/udp.c > > +++ b/net/ipv4/udp.c > > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > > } > > EXPORT_SYMBOL(__skb_recv_udp); > > > > +void udp_sock_rfree(struct sk_buff *skb) > > +{ > > + struct sock *sk = skb->sk; > > + > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > + sock_rfree(skb); > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > +} > > + > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > { > > struct sk_buff *skb; > > -- > > 2.53.0.371.g1d285c8824-goog > > > > There are too many protocol-specific checks in the generic skmsg code > here. This should be abstracted out. > > Also TCP also has a similar issue with sk_forward_alloc concurrency > between the backlog charge and recvmsg uncharge paths so the abstraction > is necessary. > > I've put together a simple diff based on your patch for reference. > I haven't tested it thoroughly, but it at least handles TCP and UDP > separately through a callback, keeping the generic code clean. I can follow up on TCP, but TCP needs another fix, which cannot be factored out. Some wrong assumptions in your patch: 1) tcp_psock_ingress_charge() uses psock->ingress_lock, but it does not protect any tcp_sock fields, especially sk->sk_forwad_alloc. 2) TCP memory accounting happens under bh_lock_sock() _or_ lock_sock(). sendmsg()/recvmsg() works under lock_sock() and TCP fast path puts skb to backlog if lock_sock() is held by userspace. This means even a simple bh_lock_sock() in tcp_psock_ingress_charge() race with memory accounting happening under lock_sock(). Since sk_psock_skb_ingress() is called from both GFP_KERNEL and GFP_ATOMIC context, the simplest fix would be to put lock_sock() in sk_psock_backlog() for TCP, which is ugly though. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-05 7:48 ` Kuniyuki Iwashima @ 2026-03-05 8:30 ` Jiayuan Chen 2026-03-05 9:27 ` Kuniyuki Iwashima 0 siblings, 1 reply; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 8:30 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a March 5, 2026 at 15:48, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > On Wed, Mar 4, 2026 at 10:37 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: > > [...] > > </TASK> > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > --- > > v4: Fix deadlock when requeued > > v2: Fix build failure when CONFIG_INET=n > > --- > > include/net/udp.h | 9 +++++++++ > > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > > net/ipv4/udp.c | 9 +++++++++ > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > diff --git a/include/net/udp.h b/include/net/udp.h > > index 700dbedcb15f..ae38a4da9388 100644 > > --- a/include/net/udp.h > > +++ b/include/net/udp.h > > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > > struct sk_buff *skb); > > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > > __be16 sport, __be16 dport); > > + > > +#ifdef CONFIG_INET > > +void udp_sock_rfree(struct sk_buff *skb); > > +#else > > +static inline void udp_sock_rfree(struct sk_buff *skb) > > +{ > > +} > > +#endif > > + > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > index 96f43e0dbb17..dd9134a45663 100644 > > --- a/net/core/skmsg.c > > +++ b/net/core/skmsg.c > > @@ -7,6 +7,7 @@ > > > > #include <net/sock.h> > > #include <net/tcp.h> > > +#include <net/udp.h> > > #include <net/tls.h> > > #include <trace/events/sock.h> > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > u32 off, u32 len, bool take_ref) > > { > > struct sock *sk = psock->sk; > > + bool is_udp = sk_is_udp(sk); > > struct sk_msg *msg; > > int err = -EAGAIN; > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > if (!msg) > > goto out; > > > > - if (skb->sk != sk && take_ref) { > > + if (is_udp) { > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > + goto enqueue; > > + > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > + } > > + > > + if (is_udp || (skb->sk != sk && take_ref)) { > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > - goto free; > > + goto unlock; > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > - goto free; > > - } else { > > + goto unlock; > > + } else if (skb->sk == sk || !take_ref) { > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > * data originates from the socket's own protocol stack. No need to > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > * into user buffers. > > */ > > skb_set_owner_r(skb, sk); > > + > > + if (is_udp) { > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > + > > + skb->destructor = udp_sock_rfree; > > + } > > + > > +enqueue: > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > if (err < 0) > > goto free; > > out: > > return err; > > + > > +unlock: > > + if (is_udp) > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > free: > > kfree(msg); > > goto out; > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > index 422c96fea249..831d26748a90 100644 > > --- a/net/ipv4/udp.c > > +++ b/net/ipv4/udp.c > > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > > } > > EXPORT_SYMBOL(__skb_recv_udp); > > > > +void udp_sock_rfree(struct sk_buff *skb) > > +{ > > + struct sock *sk = skb->sk; > > + > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > + sock_rfree(skb); > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > +} > > + > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > { > > struct sk_buff *skb; > > -- > > 2.53.0.371.g1d285c8824-goog > > > > There are too many protocol-specific checks in the generic skmsg code > > here. This should be abstracted out. > > > > Also TCP also has a similar issue with sk_forward_alloc concurrency > > between the backlog charge and recvmsg uncharge paths so the abstraction > > is necessary. > > > > I've put together a simple diff based on your patch for reference. > > I haven't tested it thoroughly, but it at least handles TCP and UDP > > separately through a callback, keeping the generic code clean. > > > I can follow up on TCP, but TCP needs another fix, which > cannot be factored out. > > Some wrong assumptions in your patch: > > 1) tcp_psock_ingress_charge() uses psock->ingress_lock, > but it does not protect any tcp_sock fields, especially > sk->sk_forwad_alloc. > > 2) TCP memory accounting happens under bh_lock_sock() > _or_ lock_sock(). sendmsg()/recvmsg() works under > lock_sock() and TCP fast path puts skb to backlog if > lock_sock() is held by userspace. This means even > a simple bh_lock_sock() in tcp_psock_ingress_charge() > race with memory accounting happening under lock_sock(). > > Since sk_psock_skb_ingress() is called from both GFP_KERNEL > and GFP_ATOMIC context, the simplest fix would be to put > lock_sock() in sk_psock_backlog() for TCP, which is ugly though. > OK, thanks. Now my only concern is keeping UDP-specific logic out of skmsg.c. We could use a function pointer so that UDP implements its own ingress charge in udp_bpf.c, while other protocols just use a default no-op or even NULL. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-05 8:30 ` Jiayuan Chen @ 2026-03-05 9:27 ` Kuniyuki Iwashima 2026-03-05 10:45 ` Jiayuan Chen 0 siblings, 1 reply; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-05 9:27 UTC (permalink / raw) To: Jiayuan Chen Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a On Thu, Mar 5, 2026 at 12:30 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > March 5, 2026 at 15:48, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > > > On Wed, Mar 4, 2026 at 10:37 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > > > On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: > > > [...] > > > </TASK> > > > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > > > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > > --- > > > v4: Fix deadlock when requeued > > > v2: Fix build failure when CONFIG_INET=n > > > --- > > > include/net/udp.h | 9 +++++++++ > > > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > > > net/ipv4/udp.c | 9 +++++++++ > > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > > > diff --git a/include/net/udp.h b/include/net/udp.h > > > index 700dbedcb15f..ae38a4da9388 100644 > > > --- a/include/net/udp.h > > > +++ b/include/net/udp.h > > > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > > > struct sk_buff *skb); > > > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > > > __be16 sport, __be16 dport); > > > + > > > +#ifdef CONFIG_INET > > > +void udp_sock_rfree(struct sk_buff *skb); > > > +#else > > > +static inline void udp_sock_rfree(struct sk_buff *skb) > > > +{ > > > +} > > > +#endif > > > + > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > > > > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > > index 96f43e0dbb17..dd9134a45663 100644 > > > --- a/net/core/skmsg.c > > > +++ b/net/core/skmsg.c > > > @@ -7,6 +7,7 @@ > > > > > > #include <net/sock.h> > > > #include <net/tcp.h> > > > +#include <net/udp.h> > > > #include <net/tls.h> > > > #include <trace/events/sock.h> > > > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > u32 off, u32 len, bool take_ref) > > > { > > > struct sock *sk = psock->sk; > > > + bool is_udp = sk_is_udp(sk); > > > struct sk_msg *msg; > > > int err = -EAGAIN; > > > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > if (!msg) > > > goto out; > > > > > > - if (skb->sk != sk && take_ref) { > > > + if (is_udp) { > > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > > + goto enqueue; > > > + > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > + } > > > + > > > + if (is_udp || (skb->sk != sk && take_ref)) { > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > - goto free; > > > + goto unlock; > > > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > > - goto free; > > > - } else { > > > + goto unlock; > > > + } else if (skb->sk == sk || !take_ref) { > > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > > * data originates from the socket's own protocol stack. No need to > > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > * into user buffers. > > > */ > > > skb_set_owner_r(skb, sk); > > > + > > > + if (is_udp) { > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > + > > > + skb->destructor = udp_sock_rfree; > > > + } > > > + > > > +enqueue: > > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > > if (err < 0) > > > goto free; > > > out: > > > return err; > > > + > > > +unlock: > > > + if (is_udp) > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > free: > > > kfree(msg); > > > goto out; > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > index 422c96fea249..831d26748a90 100644 > > > --- a/net/ipv4/udp.c > > > +++ b/net/ipv4/udp.c > > > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > > > } > > > EXPORT_SYMBOL(__skb_recv_udp); > > > > > > +void udp_sock_rfree(struct sk_buff *skb) > > > +{ > > > + struct sock *sk = skb->sk; > > > + > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > + sock_rfree(skb); > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > +} > > > + > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > > { > > > struct sk_buff *skb; > > > -- > > > 2.53.0.371.g1d285c8824-goog > > > > > > There are too many protocol-specific checks in the generic skmsg code > > > here. This should be abstracted out. > > > > > > Also TCP also has a similar issue with sk_forward_alloc concurrency > > > between the backlog charge and recvmsg uncharge paths so the abstraction > > > is necessary. > > > > > > I've put together a simple diff based on your patch for reference. > > > I haven't tested it thoroughly, but it at least handles TCP and UDP > > > separately through a callback, keeping the generic code clean. > > > > > I can follow up on TCP, but TCP needs another fix, which > > cannot be factored out. > > > > Some wrong assumptions in your patch: > > > > 1) tcp_psock_ingress_charge() uses psock->ingress_lock, > > but it does not protect any tcp_sock fields, especially > > sk->sk_forwad_alloc. > > > > 2) TCP memory accounting happens under bh_lock_sock() > > _or_ lock_sock(). sendmsg()/recvmsg() works under > > lock_sock() and TCP fast path puts skb to backlog if > > lock_sock() is held by userspace. This means even > > a simple bh_lock_sock() in tcp_psock_ingress_charge() > > race with memory accounting happening under lock_sock(). > > > > Since sk_psock_skb_ingress() is called from both GFP_KERNEL > > and GFP_ATOMIC context, the simplest fix would be to put > > lock_sock() in sk_psock_backlog() for TCP, which is ugly though. > > > > OK, thanks. > > Now my only concern is keeping UDP-specific logic > out of skmsg.c. We already have TCP-only msg->sk logic there that you added. It just does not look like TCP code. If really needed, we can do such cleanup in bpf-next for TCP and UDP altogether (probably after another TCP fix ?). > We could use a function pointer so that UDP > implements its own ingress charge in udp_bpf.c, while other > protocols just use a default no-op or even NULL. If TCP requires locking outside of the scope, there's no point adding one more extra layer/complexity just for UDP. Even if we go that way, most likely we cannot get rid of protocol-specific code (UDP/TCP-only hook) and end up with the indirect call helpers, which will be expensive than simple if logic. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-05 9:27 ` Kuniyuki Iwashima @ 2026-03-05 10:45 ` Jiayuan Chen 2026-03-05 11:04 ` Jiayuan Chen 0 siblings, 1 reply; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 10:45 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a March 5, 2026 at 17:27, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > On Thu, Mar 5, 2026 at 12:30 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > March 5, 2026 at 15:48, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > On Wed, Mar 4, 2026 at 10:37 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > > > On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: > > > [...] > > > </TASK> > > > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > > > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > > --- > > > v4: Fix deadlock when requeued > > > v2: Fix build failure when CONFIG_INET=n > > > --- > > > include/net/udp.h | 9 +++++++++ > > > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > > > net/ipv4/udp.c | 9 +++++++++ > > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > > > diff --git a/include/net/udp.h b/include/net/udp.h > > > index 700dbedcb15f..ae38a4da9388 100644 > > > --- a/include/net/udp.h > > > +++ b/include/net/udp.h > > > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > > > struct sk_buff *skb); > > > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > > > __be16 sport, __be16 dport); > > > + > > > +#ifdef CONFIG_INET > > > +void udp_sock_rfree(struct sk_buff *skb); > > > +#else > > > +static inline void udp_sock_rfree(struct sk_buff *skb) > > > +{ > > > +} > > > +#endif > > > + > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > > > > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > > index 96f43e0dbb17..dd9134a45663 100644 > > > --- a/net/core/skmsg.c > > > +++ b/net/core/skmsg.c > > > @@ -7,6 +7,7 @@ > > > > > > #include <net/sock.h> > > > #include <net/tcp.h> > > > +#include <net/udp.h> > > > #include <net/tls.h> > > > #include <trace/events/sock.h> > > > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > u32 off, u32 len, bool take_ref) > > > { > > > struct sock *sk = psock->sk; > > > + bool is_udp = sk_is_udp(sk); > > > struct sk_msg *msg; > > > int err = -EAGAIN; > > > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > if (!msg) > > > goto out; > > > > > > - if (skb->sk != sk && take_ref) { > > > + if (is_udp) { > > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > > + goto enqueue; > > > + > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > + } > > > + > > > + if (is_udp || (skb->sk != sk && take_ref)) { > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > - goto free; > > > + goto unlock; > > > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > > - goto free; > > > - } else { > > > + goto unlock; > > > + } else if (skb->sk == sk || !take_ref) { > > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > > * data originates from the socket's own protocol stack. No need to > > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > * into user buffers. > > > */ > > > skb_set_owner_r(skb, sk); > > > + > > > + if (is_udp) { > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > + > > > + skb->destructor = udp_sock_rfree; > > > + } > > > + > > > +enqueue: > > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > > if (err < 0) > > > goto free; > > > out: > > > return err; > > > + > > > +unlock: > > > + if (is_udp) > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > free: > > > kfree(msg); > > > goto out; > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > index 422c96fea249..831d26748a90 100644 > > > --- a/net/ipv4/udp.c > > > +++ b/net/ipv4/udp.c > > > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > > > } > > > EXPORT_SYMBOL(__skb_recv_udp); > > > > > > +void udp_sock_rfree(struct sk_buff *skb) > > > +{ > > > + struct sock *sk = skb->sk; > > > + > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > + sock_rfree(skb); > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > +} > > > + > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > > { > > > struct sk_buff *skb; > > > -- > > > 2.53.0.371.g1d285c8824-goog > > > > > > There are too many protocol-specific checks in the generic skmsg code > > > here. This should be abstracted out. > > > > > > Also TCP also has a similar issue with sk_forward_alloc concurrency > > > between the backlog charge and recvmsg uncharge paths so the abstraction > > > is necessary. > > > > > > I've put together a simple diff based on your patch for reference. > > > I haven't tested it thoroughly, but it at least handles TCP and UDP > > > separately through a callback, keeping the generic code clean. > > > > > I can follow up on TCP, but TCP needs another fix, which > > cannot be factored out. > > > > Some wrong assumptions in your patch: > > > > 1) tcp_psock_ingress_charge() uses psock->ingress_lock, > > but it does not protect any tcp_sock fields, especially > > sk->sk_forwad_alloc. > > > > 2) TCP memory accounting happens under bh_lock_sock() > > _or_ lock_sock(). sendmsg()/recvmsg() works under > > lock_sock() and TCP fast path puts skb to backlog if > > lock_sock() is held by userspace. This means even > > a simple bh_lock_sock() in tcp_psock_ingress_charge() > > race with memory accounting happening under lock_sock(). > > > > Since sk_psock_skb_ingress() is called from both GFP_KERNEL > > and GFP_ATOMIC context, the simplest fix would be to put > > lock_sock() in sk_psock_backlog() for TCP, which is ugly though. > > > > OK, thanks. > > > > Now my only concern is keeping UDP-specific logic > > out of skmsg.c. > > > We already have TCP-only msg->sk logic there that > you added. It just does not look like TCP code. The msg->sk field marks ingress-self vs ingress-other, which is protocol-agnostic metadata. TCP happens to consume it for seq tracking, but the assignment itself has no TCP-specific check (no is_tcp_sk() or similar). > If really needed, we can do such cleanup in bpf-next > for TCP and UDP altogether (probably after another > TCP fix ?). > > > > > We could use a function pointer so that UDP > > implements its own ingress charge in udp_bpf.c, while other > > protocols just use a default no-op or even NULL. > > > If TCP requires locking outside of the scope, there's > no point adding one more extra layer/complexity just > for UDP. > > Even if we go that way, most likely we cannot get rid > of protocol-specific code (UDP/TCP-only hook) and > end up with the indirect call helpers, which will be > expensive than simple if logic. > An alternative would be to add such sk_psock_udp_skb_ingress() in udp_bpf.c, without adding any indirection layer. This way we don't need to design a new abstraction before the approach is settled. Just a UDP-specific ingress helper, which keeps things clean enough for now, and avoid scattering is_udp checks all over sk_psock_skb_ingress, just need one if (is_udp). Thanks. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-05 10:45 ` Jiayuan Chen @ 2026-03-05 11:04 ` Jiayuan Chen 2026-03-05 17:42 ` Kuniyuki Iwashima 0 siblings, 1 reply; 31+ messages in thread From: Jiayuan Chen @ 2026-03-05 11:04 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a March 5, 2026 at 18:45, "Jiayuan Chen" <jiayuan.chen@linux.dev mailto:jiayuan.chen@linux.dev?to=%22Jiayuan%20Chen%22%20%3Cjiayuan.chen%40linux.dev%3E > wrote: > > March 5, 2026 at 17:27, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > > On Thu, Mar 5, 2026 at 12:30 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > March 5, 2026 at 15:48, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > On Wed, Mar 4, 2026 at 10:37 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > > > On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: > > > [...] > > > </TASK> > > > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > > > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > > --- > > > v4: Fix deadlock when requeued > > > v2: Fix build failure when CONFIG_INET=n > > > --- > > > include/net/udp.h | 9 +++++++++ > > > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > > > net/ipv4/udp.c | 9 +++++++++ > > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > > > diff --git a/include/net/udp.h b/include/net/udp.h > > > index 700dbedcb15f..ae38a4da9388 100644 > > > --- a/include/net/udp.h > > > +++ b/include/net/udp.h > > > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > > > struct sk_buff *skb); > > > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > > > __be16 sport, __be16 dport); > > > + > > > +#ifdef CONFIG_INET > > > +void udp_sock_rfree(struct sk_buff *skb); > > > +#else > > > +static inline void udp_sock_rfree(struct sk_buff *skb) > > > +{ > > > +} > > > +#endif > > > + > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > > > > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > > index 96f43e0dbb17..dd9134a45663 100644 > > > --- a/net/core/skmsg.c > > > +++ b/net/core/skmsg.c > > > @@ -7,6 +7,7 @@ > > > > > > #include <net/sock.h> > > > #include <net/tcp.h> > > > +#include <net/udp.h> > > > #include <net/tls.h> > > > #include <trace/events/sock.h> > > > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > u32 off, u32 len, bool take_ref) > > > { > > > struct sock *sk = psock->sk; > > > + bool is_udp = sk_is_udp(sk); > > > struct sk_msg *msg; > > > int err = -EAGAIN; > > > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > if (!msg) > > > goto out; > > > > > > - if (skb->sk != sk && take_ref) { > > > + if (is_udp) { > > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > > + goto enqueue; > > > + > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > + } > > > + > > > + if (is_udp || (skb->sk != sk && take_ref)) { > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > - goto free; > > > + goto unlock; > > > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > > - goto free; > > > - } else { > > > + goto unlock; > > > + } else if (skb->sk == sk || !take_ref) { > > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > > * data originates from the socket's own protocol stack. No need to > > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > * into user buffers. > > > */ > > > skb_set_owner_r(skb, sk); > > > + > > > + if (is_udp) { > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > + > > > + skb->destructor = udp_sock_rfree; > > > + } > > > + > > > +enqueue: > > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > > if (err < 0) > > > goto free; > > > out: > > > return err; > > > + > > > +unlock: > > > + if (is_udp) > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > free: > > > kfree(msg); > > > goto out; > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > index 422c96fea249..831d26748a90 100644 > > > --- a/net/ipv4/udp.c > > > +++ b/net/ipv4/udp.c > > > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > > > } > > > EXPORT_SYMBOL(__skb_recv_udp); > > > > > > +void udp_sock_rfree(struct sk_buff *skb) > > > +{ > > > + struct sock *sk = skb->sk; > > > + > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > + sock_rfree(skb); > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > +} > > > + > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > > { > > > struct sk_buff *skb; > > > -- > > > 2.53.0.371.g1d285c8824-goog > > > > > > There are too many protocol-specific checks in the generic skmsg code > > > here. This should be abstracted out. > > > > > > Also TCP also has a similar issue with sk_forward_alloc concurrency > > > between the backlog charge and recvmsg uncharge paths so the abstraction > > > is necessary. > > > > > > I've put together a simple diff based on your patch for reference. > > > I haven't tested it thoroughly, but it at least handles TCP and UDP > > > separately through a callback, keeping the generic code clean. > > > > > I can follow up on TCP, but TCP needs another fix, which > > cannot be factored out. > > > > Some wrong assumptions in your patch: > > > > 1) tcp_psock_ingress_charge() uses psock->ingress_lock, > > but it does not protect any tcp_sock fields, especially > > sk->sk_forwad_alloc. > > > > 2) TCP memory accounting happens under bh_lock_sock() > > _or_ lock_sock(). sendmsg()/recvmsg() works under > > lock_sock() and TCP fast path puts skb to backlog if > > lock_sock() is held by userspace. This means even > > a simple bh_lock_sock() in tcp_psock_ingress_charge() > > race with memory accounting happening under lock_sock(). > > > > Since sk_psock_skb_ingress() is called from both GFP_KERNEL > > and GFP_ATOMIC context, the simplest fix would be to put > > lock_sock() in sk_psock_backlog() for TCP, which is ugly though. > > > > OK, thanks. > > > > Now my only concern is keeping UDP-specific logic > > out of skmsg.c. > > > > We already have TCP-only msg->sk logic there that > > you added. It just does not look like TCP code. > > > The msg->sk field marks ingress-self vs ingress-other, which is > protocol-agnostic metadata. TCP happens to consume it for seq > tracking, but the assignment itself has no TCP-specific check > (no is_tcp_sk() or similar). > > > > > If really needed, we can do such cleanup in bpf-next > > for TCP and UDP altogether (probably after another > > TCP fix ?). > > > > > > We could use a function pointer so that UDP > > implements its own ingress charge in udp_bpf.c, while other > > protocols just use a default no-op or even NULL. > > > > If TCP requires locking outside of the scope, there's > > no point adding one more extra layer/complexity just > > for UDP. > > > > Even if we go that way, most likely we cannot get rid > > of protocol-specific code (UDP/TCP-only hook) and > > end up with the indirect call helpers, which will be > > expensive than simple if logic. > > > An alternative would be to add such sk_psock_udp_skb_ingress() in > udp_bpf.c, without adding any indirection layer. > > This way we don't need to design a new abstraction before the > approach is settled. Just a UDP-specific ingress helper, which > keeps things clean enough for now, and avoid scattering is_udp > checks all over sk_psock_skb_ingress, just need one if (is_udp). > > Thanks. > After thinking a while, since the UDP-specific behaviour only appears in the redirect-to-self path, how about implementing sk_psock_udp_skb_ingress_self() in udp_bpf.c and calling it from sk_psock_skb_ingress_self()? That gives us a single, contained check point rather than scattering is_udp all over skmsg.c, which seems more aligned with the "simple" you described before. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-05 11:04 ` Jiayuan Chen @ 2026-03-05 17:42 ` Kuniyuki Iwashima 2026-03-06 7:44 ` Jiayuan Chen 0 siblings, 1 reply; 31+ messages in thread From: Kuniyuki Iwashima @ 2026-03-05 17:42 UTC (permalink / raw) To: Jiayuan Chen Cc: John Fastabend, Jakub Sitnicki, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a On Thu, Mar 5, 2026 at 3:04 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > March 5, 2026 at 18:45, "Jiayuan Chen" <jiayuan.chen@linux.dev mailto:jiayuan.chen@linux.dev?to=%22Jiayuan%20Chen%22%20%3Cjiayuan.chen%40linux.dev%3E > wrote: > > > > > > March 5, 2026 at 17:27, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > > > > > On Thu, Mar 5, 2026 at 12:30 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > > > > March 5, 2026 at 15:48, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > > > On Wed, Mar 4, 2026 at 10:37 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > > > > > > On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: > > > > [...] > > > > </TASK> > > > > > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > > > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > > > > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > > > --- > > > > v4: Fix deadlock when requeued > > > > v2: Fix build failure when CONFIG_INET=n > > > > --- > > > > include/net/udp.h | 9 +++++++++ > > > > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > > > > net/ipv4/udp.c | 9 +++++++++ > > > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/include/net/udp.h b/include/net/udp.h > > > > index 700dbedcb15f..ae38a4da9388 100644 > > > > --- a/include/net/udp.h > > > > +++ b/include/net/udp.h > > > > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > > > > struct sk_buff *skb); > > > > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > > > > __be16 sport, __be16 dport); > > > > + > > > > +#ifdef CONFIG_INET > > > > +void udp_sock_rfree(struct sk_buff *skb); > > > > +#else > > > > +static inline void udp_sock_rfree(struct sk_buff *skb) > > > > +{ > > > > +} > > > > +#endif > > > > + > > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > > > > > > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > > > index 96f43e0dbb17..dd9134a45663 100644 > > > > --- a/net/core/skmsg.c > > > > +++ b/net/core/skmsg.c > > > > @@ -7,6 +7,7 @@ > > > > > > > > #include <net/sock.h> > > > > #include <net/tcp.h> > > > > +#include <net/udp.h> > > > > #include <net/tls.h> > > > > #include <trace/events/sock.h> > > > > > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > > u32 off, u32 len, bool take_ref) > > > > { > > > > struct sock *sk = psock->sk; > > > > + bool is_udp = sk_is_udp(sk); > > > > struct sk_msg *msg; > > > > int err = -EAGAIN; > > > > > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > > if (!msg) > > > > goto out; > > > > > > > > - if (skb->sk != sk && take_ref) { > > > > + if (is_udp) { > > > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > > > + goto enqueue; > > > > + > > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > > + } > > > > + > > > > + if (is_udp || (skb->sk != sk && take_ref)) { > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > > - goto free; > > > > + goto unlock; > > > > > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > > > - goto free; > > > > - } else { > > > > + goto unlock; > > > > + } else if (skb->sk == sk || !take_ref) { > > > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > > > * data originates from the socket's own protocol stack. No need to > > > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > > * into user buffers. > > > > */ > > > > skb_set_owner_r(skb, sk); > > > > + > > > > + if (is_udp) { > > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > > + > > > > + skb->destructor = udp_sock_rfree; > > > > + } > > > > + > > > > +enqueue: > > > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > > > if (err < 0) > > > > goto free; > > > > out: > > > > return err; > > > > + > > > > +unlock: > > > > + if (is_udp) > > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > > free: > > > > kfree(msg); > > > > goto out; > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > > index 422c96fea249..831d26748a90 100644 > > > > --- a/net/ipv4/udp.c > > > > +++ b/net/ipv4/udp.c > > > > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > > > > } > > > > EXPORT_SYMBOL(__skb_recv_udp); > > > > > > > > +void udp_sock_rfree(struct sk_buff *skb) > > > > +{ > > > > + struct sock *sk = skb->sk; > > > > + > > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > > + sock_rfree(skb); > > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > > +} > > > > + > > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > > > { > > > > struct sk_buff *skb; > > > > -- > > > > 2.53.0.371.g1d285c8824-goog > > > > > > > > There are too many protocol-specific checks in the generic skmsg code > > > > here. This should be abstracted out. > > > > > > > > Also TCP also has a similar issue with sk_forward_alloc concurrency > > > > between the backlog charge and recvmsg uncharge paths so the abstraction > > > > is necessary. > > > > > > > > I've put together a simple diff based on your patch for reference. > > > > I haven't tested it thoroughly, but it at least handles TCP and UDP > > > > separately through a callback, keeping the generic code clean. > > > > > > > I can follow up on TCP, but TCP needs another fix, which > > > cannot be factored out. > > > > > > Some wrong assumptions in your patch: > > > > > > 1) tcp_psock_ingress_charge() uses psock->ingress_lock, > > > but it does not protect any tcp_sock fields, especially > > > sk->sk_forwad_alloc. > > > > > > 2) TCP memory accounting happens under bh_lock_sock() > > > _or_ lock_sock(). sendmsg()/recvmsg() works under > > > lock_sock() and TCP fast path puts skb to backlog if > > > lock_sock() is held by userspace. This means even > > > a simple bh_lock_sock() in tcp_psock_ingress_charge() > > > race with memory accounting happening under lock_sock(). > > > > > > Since sk_psock_skb_ingress() is called from both GFP_KERNEL > > > and GFP_ATOMIC context, the simplest fix would be to put > > > lock_sock() in sk_psock_backlog() for TCP, which is ugly though. > > > > > > OK, thanks. > > > > > > Now my only concern is keeping UDP-specific logic > > > out of skmsg.c. > > > > > > We already have TCP-only msg->sk logic there that > > > you added. It just does not look like TCP code. > > > > > The msg->sk field marks ingress-self vs ingress-other, which is > > protocol-agnostic metadata. TCP happens to consume it for seq > > tracking, but the assignment itself has no TCP-specific check > > (no is_tcp_sk() or similar). > > > > > > > > If really needed, we can do such cleanup in bpf-next > > > for TCP and UDP altogether (probably after another > > > TCP fix ?). > > > > > > > > > We could use a function pointer so that UDP > > > implements its own ingress charge in udp_bpf.c, while other > > > protocols just use a default no-op or even NULL. > > > > > > If TCP requires locking outside of the scope, there's > > > no point adding one more extra layer/complexity just > > > for UDP. > > > > > > Even if we go that way, most likely we cannot get rid > > > of protocol-specific code (UDP/TCP-only hook) and > > > end up with the indirect call helpers, which will be > > > expensive than simple if logic. > > > > > An alternative would be to add such sk_psock_udp_skb_ingress() in > > udp_bpf.c, without adding any indirection layer. > > > > This way we don't need to design a new abstraction before the > > approach is settled. Just a UDP-specific ingress helper, which > > keeps things clean enough for now, and avoid scattering is_udp > > checks all over sk_psock_skb_ingress, just need one if (is_udp). > > > > Thanks. > > > > After thinking a while, since the UDP-specific behaviour only appears > in the redirect-to-self path, Are you saying psock->sk == skb->sk is always true and skb_set_owner_r() in the else case is not called for UDP ? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP. 2026-03-05 17:42 ` Kuniyuki Iwashima @ 2026-03-06 7:44 ` Jiayuan Chen 0 siblings, 0 replies; 31+ messages in thread From: Jiayuan Chen @ 2026-03-06 7:44 UTC (permalink / raw) To: Kuniyuki Iwashima, jakub Cc: John Fastabend, Willem de Bruijn, Kuniyuki Iwashima, bpf, netdev, syzbot+5b3b7e51dda1be027b7a March 6, 2026 at 01:42, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > On Thu, Mar 5, 2026 at 3:04 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > March 5, 2026 at 18:45, "Jiayuan Chen" <jiayuan.chen@linux.dev mailto:jiayuan.chen@linux.dev?to=%22Jiayuan%20Chen%22%20%3Cjiayuan.chen%40linux.dev%3E > wrote: > > > > March 5, 2026 at 17:27, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > > > > > On Thu, Mar 5, 2026 at 12:30 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > > > > March 5, 2026 at 15:48, "Kuniyuki Iwashima" <kuniyu@google.com mailto:kuniyu@google.com?to=%22Kuniyuki%20Iwashima%22%20%3Ckuniyu%40google.com%3E > wrote: > > > > > > On Wed, Mar 4, 2026 at 10:37 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote: > > > > > > > > > > > On Sat, Feb 21, 2026 at 11:30:53PM +0800, Kuniyuki Iwashima wrote: > > > > [...] > > > > </TASK> > > > > > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > > > Reported-by: syzbot+5b3b7e51dda1be027b7a@syzkaller.appspotmail.com > > > > Closes: https://lore.kernel.org/netdev/698f84c6.a70a0220.2c38d7.00cb.GAE@google.com/ > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> > > > > --- > > > > v4: Fix deadlock when requeued > > > > v2: Fix build failure when CONFIG_INET=n > > > > --- > > > > include/net/udp.h | 9 +++++++++ > > > > net/core/skmsg.c | 29 +++++++++++++++++++++++++---- > > > > net/ipv4/udp.c | 9 +++++++++ > > > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/include/net/udp.h b/include/net/udp.h > > > > index 700dbedcb15f..ae38a4da9388 100644 > > > > --- a/include/net/udp.h > > > > +++ b/include/net/udp.h > > > > @@ -455,6 +455,15 @@ struct sock *__udp6_lib_lookup(const struct net *net, > > > > struct sk_buff *skb); > > > > struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, > > > > __be16 sport, __be16 dport); > > > > + > > > > +#ifdef CONFIG_INET > > > > +void udp_sock_rfree(struct sk_buff *skb); > > > > +#else > > > > +static inline void udp_sock_rfree(struct sk_buff *skb) > > > > +{ > > > > +} > > > > +#endif > > > > + > > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); > > > > > > > > /* UDP uses skb->dev_scratch to cache as much information as possible and avoid > > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > > > index 96f43e0dbb17..dd9134a45663 100644 > > > > --- a/net/core/skmsg.c > > > > +++ b/net/core/skmsg.c > > > > @@ -7,6 +7,7 @@ > > > > > > > > #include <net/sock.h> > > > > #include <net/tcp.h> > > > > +#include <net/udp.h> > > > > #include <net/tls.h> > > > > #include <trace/events/sock.h> > > > > > > > > @@ -576,6 +577,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > > u32 off, u32 len, bool take_ref) > > > > { > > > > struct sock *sk = psock->sk; > > > > + bool is_udp = sk_is_udp(sk); > > > > struct sk_msg *msg; > > > > int err = -EAGAIN; > > > > > > > > @@ -583,13 +585,20 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > > if (!msg) > > > > goto out; > > > > > > > > - if (skb->sk != sk && take_ref) { > > > > + if (is_udp) { > > > > + if (unlikely(skb->destructor == udp_sock_rfree)) > > > > + goto enqueue; > > > > + > > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > > + } > > > > + > > > > + if (is_udp || (skb->sk != sk && take_ref)) { > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > > - goto free; > > > > + goto unlock; > > > > > > > > if (!sk_rmem_schedule(sk, skb, skb->truesize)) > > > > - goto free; > > > > - } else { > > > > + goto unlock; > > > > + } else if (skb->sk == sk || !take_ref) { > > > > /* This is used in tcp_bpf_recvmsg_parser() to determine whether the > > > > * data originates from the socket's own protocol stack. No need to > > > > * refcount sk because msg's lifetime is bound to sk via the ingress_msg. > > > > @@ -604,11 +613,23 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > > > > * into user buffers. > > > > */ > > > > skb_set_owner_r(skb, sk); > > > > + > > > > + if (is_udp) { > > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > > + > > > > + skb->destructor = udp_sock_rfree; > > > > + } > > > > + > > > > +enqueue: > > > > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref); > > > > if (err < 0) > > > > goto free; > > > > out: > > > > return err; > > > > + > > > > +unlock: > > > > + if (is_udp) > > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > > free: > > > > kfree(msg); > > > > goto out; > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > > index 422c96fea249..831d26748a90 100644 > > > > --- a/net/ipv4/udp.c > > > > +++ b/net/ipv4/udp.c > > > > @@ -2039,6 +2039,15 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, > > > > } > > > > EXPORT_SYMBOL(__skb_recv_udp); > > > > > > > > +void udp_sock_rfree(struct sk_buff *skb) > > > > +{ > > > > + struct sock *sk = skb->sk; > > > > + > > > > + spin_lock_bh(&sk->sk_receive_queue.lock); > > > > + sock_rfree(skb); > > > > + spin_unlock_bh(&sk->sk_receive_queue.lock); > > > > +} > > > > + > > > > int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > > > { > > > > struct sk_buff *skb; > > > > -- > > > > 2.53.0.371.g1d285c8824-goog > > > > > > > > There are too many protocol-specific checks in the generic skmsg code > > > > here. This should be abstracted out. > > > > > > > > Also TCP also has a similar issue with sk_forward_alloc concurrency > > > > between the backlog charge and recvmsg uncharge paths so the abstraction > > > > is necessary. > > > > > > > > I've put together a simple diff based on your patch for reference. > > > > I haven't tested it thoroughly, but it at least handles TCP and UDP > > > > separately through a callback, keeping the generic code clean. > > > > > > > I can follow up on TCP, but TCP needs another fix, which > > > cannot be factored out. > > > > > > Some wrong assumptions in your patch: > > > > > > 1) tcp_psock_ingress_charge() uses psock->ingress_lock, > > > but it does not protect any tcp_sock fields, especially > > > sk->sk_forwad_alloc. > > > > > > 2) TCP memory accounting happens under bh_lock_sock() > > > _or_ lock_sock(). sendmsg()/recvmsg() works under > > > lock_sock() and TCP fast path puts skb to backlog if > > > lock_sock() is held by userspace. This means even > > > a simple bh_lock_sock() in tcp_psock_ingress_charge() > > > race with memory accounting happening under lock_sock(). > > > > > > Since sk_psock_skb_ingress() is called from both GFP_KERNEL > > > and GFP_ATOMIC context, the simplest fix would be to put > > > lock_sock() in sk_psock_backlog() for TCP, which is ugly though. > > > > > > OK, thanks. > > > > > > Now my only concern is keeping UDP-specific logic > > > out of skmsg.c. > > > > > > We already have TCP-only msg->sk logic there that > > > you added. It just does not look like TCP code. > > > > > The msg->sk field marks ingress-self vs ingress-other, which is > > protocol-agnostic metadata. TCP happens to consume it for seq > > tracking, but the assignment itself has no TCP-specific check > > (no is_tcp_sk() or similar). > > > > > > > > If really needed, we can do such cleanup in bpf-next > > > for TCP and UDP altogether (probably after another > > > TCP fix ?). > > > > > > > > > We could use a function pointer so that UDP > > > implements its own ingress charge in udp_bpf.c, while other > > > protocols just use a default no-op or even NULL. > > > > > > If TCP requires locking outside of the scope, there's > > > no point adding one more extra layer/complexity just > > > for UDP. > > > > > > Even if we go that way, most likely we cannot get rid > > > of protocol-specific code (UDP/TCP-only hook) and > > > end up with the indirect call helpers, which will be > > > expensive than simple if logic. > > > > > An alternative would be to add such sk_psock_udp_skb_ingress() in > > udp_bpf.c, without adding any indirection layer. > > > > This way we don't need to design a new abstraction before the > > approach is settled. Just a UDP-specific ingress helper, which > > keeps things clean enough for now, and avoid scattering is_udp > > checks all over sk_psock_skb_ingress, just need one if (is_udp). > > > > Thanks. > > > > After thinking a while, since the UDP-specific behaviour only appears > > in the redirect-to-self path, > > > Are you saying psock->sk == skb->sk is always true and > skb_set_owner_r() in the else case is not called for UDP ? > Before we figure out how to properly handle TCP, I'm wondering if we can take a simpler approach for UDP like this (diff below). This keeps the original code cleaner — two sk_is_udp() dispatch points in sk_psock_skb_ingress() and sk_psock_verdict_apply(), both routing to a dedicated sk_psock_skb_ingress_udp() that handles all UDP-specific locking and accounting in one place. As for whether sk_psock_skb_ingress_udp() should live in udp_bpf.c instead of skmsg.c, I'm open to either way. diff --git a/net/core/skmsg.c b/net/core/skmsg.c index ddde93dd8bc6..70876ed51b91 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -8,6 +8,7 @@ #include <net/sock.h> #include <net/tcp.h> #include <net/tls.h> +#include <net/udp.h> #include <trace/events/sock.h> static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce) @@ -587,6 +588,48 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len, bool take_ref); +/* UDP needs sk_receive_queue.lock to protect sk_forward_alloc from concurrent + * modification by udp_rmem_release(). This handles both self-redirect and + * cross-socket redirect for UDP. + */ +static int sk_psock_skb_ingress_udp(struct sk_psock *psock, struct sk_buff *skb, + u32 off, u32 len, bool take_ref) +{ + struct sock *sk = psock->sk; + struct sk_msg *msg; + int err; + + if (skb->destructor == udp_sock_rfree) { + msg = alloc_sk_msg(GFP_ATOMIC); + if (unlikely(!msg)) + return -EAGAIN; + goto enqueue; + } + + msg = alloc_sk_msg(take_ref ? GFP_KERNEL : GFP_ATOMIC); + if (unlikely(!msg)) + return -EAGAIN; + + spin_lock_bh(&sk->sk_receive_queue.lock); + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || + !sk_rmem_schedule(sk, skb, skb->truesize)) { + spin_unlock_bh(&sk->sk_receive_queue.lock); + kfree(msg); + return -EAGAIN; + } + skb_set_owner_r(skb, sk); + spin_unlock_bh(&sk->sk_receive_queue.lock); + + skb->destructor = udp_sock_rfree; + +enqueue: + err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, + take_ref); + if (err < 0) + kfree(msg); + return err; +} + static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len) { @@ -594,6 +637,9 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, struct sk_msg *msg; int err; + if (sk_is_udp(sk)) + return sk_psock_skb_ingress_udp(psock, skb, off, len, true); + /* If we are receiving on the same sock skb->sk is already assigned, * skip memory accounting and owner transition seeing it already set * correctly. @@ -1059,7 +1105,12 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, off = stm->offset; len = stm->full_len; } - err = sk_psock_skb_ingress_self(psock, skb, off, len, false); + if (sk_is_udp(sk_other)) + err = sk_psock_skb_ingress_udp(psock, skb, + off, len, false); + else + err = sk_psock_skb_ingress_self(psock, skb, + off, len, false); } if (err < 0) { spin_lock_bh(&psock->ingress_lock); ^ permalink raw reply related [flat|nested] 31+ messages in thread
end of thread, other threads:[~2026-03-07 2:52 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-21 23:30 [PATCH v4 bpf/net 0/6] sockmap: Fix UAF and broken memory accounting for UDP Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 1/6] sockmap: Annotate sk->sk_data_ready() " Kuniyuki Iwashima 2026-03-05 11:05 ` Jakub Sitnicki 2026-03-05 11:27 ` Jiayuan Chen 2026-02-21 23:30 ` [PATCH v4 bpf/net 2/6] sockmap: Annotate sk->sk_write_space() " Kuniyuki Iwashima 2026-03-05 1:48 ` Jiayuan Chen 2026-03-05 3:43 ` Kuniyuki Iwashima 2026-03-07 0:03 ` Martin KaFai Lau 2026-03-07 2:51 ` Kuniyuki Iwashima 2026-03-05 11:35 ` Jiayuan Chen 2026-03-05 11:51 ` Jakub Sitnicki 2026-02-21 23:30 ` [PATCH v4 bpf/net 3/6] sockmap: Fix use-after-free in udp_bpf_recvmsg() Kuniyuki Iwashima 2026-03-05 2:30 ` Jiayuan Chen 2026-03-05 3:41 ` Kuniyuki Iwashima 2026-03-05 11:36 ` Jiayuan Chen 2026-03-05 11:39 ` Jakub Sitnicki 2026-03-05 17:46 ` Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 4/6] sockmap: Inline sk_psock_create_ingress_msg() Kuniyuki Iwashima 2026-03-05 11:44 ` Jakub Sitnicki 2026-02-21 23:30 ` [PATCH v4 bpf/net 5/6] sockmap: Consolidate sk_psock_skb_ingress_self() Kuniyuki Iwashima 2026-02-21 23:30 ` [PATCH v4 bpf/net 6/6] sockmap: Fix broken memory accounting for UDP Kuniyuki Iwashima 2026-03-04 20:04 ` Martin KaFai Lau 2026-03-04 20:14 ` Kuniyuki Iwashima 2026-03-05 6:37 ` Jiayuan Chen 2026-03-05 7:48 ` Kuniyuki Iwashima 2026-03-05 8:30 ` Jiayuan Chen 2026-03-05 9:27 ` Kuniyuki Iwashima 2026-03-05 10:45 ` Jiayuan Chen 2026-03-05 11:04 ` Jiayuan Chen 2026-03-05 17:42 ` Kuniyuki Iwashima 2026-03-06 7:44 ` Jiayuan Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox