[syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
@ 2025-07-02 13:50 syzbot
  2025-07-02 15:20 ` Rik van Riel
  0 siblings, 1 reply; 13+ messages in thread
From: syzbot @ 2025-07-02 13:50 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, linux-kernel, luto, mingo, neeraj.upadhyay,
	paulmck, peterz, riel, syzkaller-bugs, tglx, x86, yury.norov

Hello,

syzbot found the following issue on:

HEAD commit:    1343433ed389 Add linux-next specific files for 20250630
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15b3e3d4580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=c1ce97baf6bd6397
dashboard link: https://syzkaller.appspot.com/bug?extid=084b6e5bc1016723a9c4
compiler:       Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15716770580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/c3387c64e9ec/disk-1343433e.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/abf15e85d8dd/vmlinux-1343433e.xz
kernel image: https://storage.googleapis.com/syzbot-assets/081c344403bc/bzImage-1343433e.xz

The issue was bisected to:

commit a12a498a9738db65152203467820bb15b6102bd2
Author: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Date:   Mon Jun 23 00:00:08 2025 +0000

    smp: Don't wait for remote work done if not needed in smp_call_function_many_cond()

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15ee348c580000
final oops:     https://syzkaller.appspot.com/x/report.txt?x=17ee348c580000
console output: https://syzkaller.appspot.com/x/log.txt?x=13ee348c580000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+084b6e5bc1016723a9c4@syzkaller.appspotmail.com
Fixes: a12a498a9738 ("smp: Don't wait for remote work done if not needed in smp_call_function_many_cond()")

==================================================================
BUG: KASAN: slab-use-after-free in instrument_atomic_write include/linux/instrumented.h:82 [inline]
BUG: KASAN: slab-use-after-free in clear_bit include/asm-generic/bitops/instrumented-atomic.h:41 [inline]
BUG: KASAN: slab-use-after-free in cpumask_clear_cpu include/linux/cpumask.h:628 [inline]
BUG: KASAN: slab-use-after-free in flush_tlb_func+0x23d/0x6c0 arch/x86/mm/tlb.c:1132
Write of size 8 at addr ffff88805f1dca80 by task kworker/1:1/43

CPU: 1 UID: 0 PID: 43 Comm: kworker/1:1 Not tainted 6.16.0-rc4-next-20250630-syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Workqueue: usb_hub_wq hub_event
Call Trace:
 <IRQ>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xd2/0x2b0 mm/kasan/report.c:521
 kasan_report+0x118/0x150 mm/kasan/report.c:634
 check_region_inline mm/kasan/generic.c:-1 [inline]
 kasan_check_range+0x2b0/0x2c0 mm/kasan/generic.c:189
 instrument_atomic_write include/linux/instrumented.h:82 [inline]
 clear_bit include/asm-generic/bitops/instrumented-atomic.h:41 [inline]
 cpumask_clear_cpu include/linux/cpumask.h:628 [inline]
 flush_tlb_func+0x23d/0x6c0 arch/x86/mm/tlb.c:1132
 csd_do_func kernel/smp.c:134 [inline]
 __flush_smp_call_function_queue+0x370/0xaa0 kernel/smp.c:540
 __sysvec_call_function_single+0xa8/0x3d0 arch/x86/kernel/smp.c:271
 instr_sysvec_call_function_single arch/x86/kernel/smp.c:266 [inline]
 sysvec_call_function_single+0x9e/0xc0 arch/x86/kernel/smp.c:266
 </IRQ>
 <TASK>
 asm_sysvec_call_function_single+0x1a/0x20 arch/x86/include/asm/idtentry.h:709
RIP: 0010:console_flush_all+0x7f7/0xc40 kernel/printk/printk.c:3227
Code: 48 21 c3 0f 85 e9 01 00 00 e8 65 2d 1f 00 48 8b 5c 24 20 4d 85 f6 75 07 e8 56 2d 1f 00 eb 06 e8 4f 2d 1f 00 fb 48 8b 44 24 28 <42> 80 3c 20 00 74 08 48 89 df e8 3a 41 83 00 48 8b 1b 48 8b 44 24
RSP: 0018:ffffc90000b36fc0 EFLAGS: 00000293
RAX: 1ffffffff1d36a63 RBX: ffffffff8e9b5318 RCX: ffff88801faa9e00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000b37110 R08: ffffffff8fa17437 R09: 1ffffffff1f42e86
R10: dffffc0000000000 R11: fffffbfff1f42e87 R12: dffffc0000000000
R13: 0000000000000001 R14: 0000000000000200 R15: ffffffff8e9b52c0
 __console_flush_and_unlock kernel/printk/printk.c:3285 [inline]
 console_unlock+0xc4/0x270 kernel/printk/printk.c:3325
 vprintk_emit+0x5b7/0x7a0 kernel/printk/printk.c:2450
 dev_vprintk_emit+0x337/0x3f0 drivers/base/core.c:4916
 dev_printk_emit+0xe0/0x130 drivers/base/core.c:4927
 _dev_info+0x10a/0x160 drivers/base/core.c:4985
 announce_device+0x117/0x2c0 drivers/usb/core/hub.c:2417
 usb_new_device+0x4ef/0x16f0 drivers/usb/core/hub.c:2680
 hub_port_connect drivers/usb/core/hub.c:5571 [inline]
 hub_port_connect_change drivers/usb/core/hub.c:5711 [inline]
 port_event drivers/usb/core/hub.c:5871 [inline]
 hub_event+0x2941/0x4a00 drivers/usb/core/hub.c:5953
 process_one_work kernel/workqueue.c:3239 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3322
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3403
 kthread+0x70e/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Allocated by task 5962:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 unpoison_slab_object mm/kasan/common.c:319 [inline]
 __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:345
 kasan_slab_alloc include/linux/kasan.h:250 [inline]
 slab_post_alloc_hook mm/slub.c:4180 [inline]
 slab_alloc_node mm/slub.c:4229 [inline]
 kmem_cache_alloc_noprof+0x1c1/0x3c0 mm/slub.c:4236
 dup_mm kernel/fork.c:1466 [inline]
 copy_mm+0xdb/0x4b0 kernel/fork.c:1528
 copy_process+0x1706/0x3c00 kernel/fork.c:2168
 kernel_clone+0x21e/0x870 kernel/fork.c:2598
 __do_sys_clone kernel/fork.c:2741 [inline]
 __se_sys_clone kernel/fork.c:2725 [inline]
 __x64_sys_clone+0x18b/0x1e0 kernel/fork.c:2725
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 8084:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2417 [inline]
 slab_free mm/slub.c:4680 [inline]
 kmem_cache_free+0x18f/0x400 mm/slub.c:4782
 exit_mm+0x1da/0x2c0 kernel/exit.c:581
 do_exit+0x648/0x2300 kernel/exit.c:947
 do_group_exit+0x21c/0x2d0 kernel/exit.c:1100
 __do_sys_exit_group kernel/exit.c:1111 [inline]
 __se_sys_exit_group kernel/exit.c:1109 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1109
 x64_sys_call+0x21ba/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff88805f1dc080
 which belongs to the cache mm_struct of size 2584
The buggy address is located 2560 bytes inside of
 freed 2584-byte region [ffff88805f1dc080, ffff88805f1dca98)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88805f1dcb40 pfn:0x5f1d8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff88805f612801
anon flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000040 ffff88801a44bb40 0000000000000000 dead000000000001
raw: ffff88805f1dcb40 00000000000b0009 00000000f5000000 ffff88805f612801
head: 00fff00000000040 ffff88801a44bb40 0000000000000000 dead000000000001
head: ffff88805f1dcb40 00000000000b0009 00000000f5000000 ffff88805f612801
head: 00fff00000000003 ffffea00017c7601 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5719, tgid 5719 (dhcpcd-run-hook), ts 61767719991, free_ts 61767008311
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1848
 prep_new_page mm/page_alloc.c:1856 [inline]
 get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3855
 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5145
 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2419
 alloc_slab_page mm/slub.c:2487 [inline]
 allocate_slab+0x8a/0x370 mm/slub.c:2655
 new_slab mm/slub.c:2709 [inline]
 ___slab_alloc+0xbeb/0x1410 mm/slub.c:3891
 __slab_alloc mm/slub.c:3981 [inline]
 __slab_alloc_node mm/slub.c:4056 [inline]
 slab_alloc_node mm/slub.c:4217 [inline]
 kmem_cache_alloc_noprof+0x283/0x3c0 mm/slub.c:4236
 dup_mm kernel/fork.c:1466 [inline]
 copy_mm+0xdb/0x4b0 kernel/fork.c:1528
 copy_process+0x1706/0x3c00 kernel/fork.c:2168
 kernel_clone+0x21e/0x870 kernel/fork.c:2598
 __do_sys_clone kernel/fork.c:2741 [inline]
 __se_sys_clone kernel/fork.c:2725 [inline]
 __x64_sys_clone+0x18b/0x1e0 kernel/fork.c:2725
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5719 tgid 5719 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 free_pages_prepare mm/page_alloc.c:1392 [inline]
 __free_frozen_pages+0xb80/0xd80 mm/page_alloc.c:2892
 __folio_put+0x21b/0x2c0 mm/swap.c:112
 put_netmem net/core/skbuff.c:7372 [inline]
 skb_page_unref include/linux/skbuff_ref.h:43 [inline]
 __skb_frag_unref include/linux/skbuff_ref.h:56 [inline]
 skb_release_data+0x49a/0x7c0 net/core/skbuff.c:1081
 skb_release_all net/core/skbuff.c:1152 [inline]
 napi_consume_skb+0x158/0x1e0 net/core/skbuff.c:1480
 skb_defer_free_flush net/core/dev.c:6632 [inline]
 net_rx_action+0x51b/0xe30 net/core/dev.c:7625
 handle_softirqs+0x286/0x870 kernel/softirq.c:579
 __do_softirq kernel/softirq.c:613 [inline]
 invoke_softirq kernel/softirq.c:453 [inline]
 __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
 common_interrupt+0xbb/0xe0 arch/x86/kernel/irq.c:285
 asm_common_interrupt+0x26/0x40 arch/x86/include/asm/idtentry.h:693

Memory state around the buggy address:
 ffff88805f1dc980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88805f1dca00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88805f1dca80: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
                   ^
 ffff88805f1dcb00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
 ffff88805f1dcb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
----------------
Code disassembly (best guess):
   0:	48 21 c3             	and    %rax,%rbx
   3:	0f 85 e9 01 00 00    	jne    0x1f2
   9:	e8 65 2d 1f 00       	call   0x1f2d73
   e:	48 8b 5c 24 20       	mov    0x20(%rsp),%rbx
  13:	4d 85 f6             	test   %r14,%r14
  16:	75 07                	jne    0x1f
  18:	e8 56 2d 1f 00       	call   0x1f2d73
  1d:	eb 06                	jmp    0x25
  1f:	e8 4f 2d 1f 00       	call   0x1f2d73
  24:	fb                   	sti
  25:	48 8b 44 24 28       	mov    0x28(%rsp),%rax
* 2a:	42 80 3c 20 00       	cmpb   $0x0,(%rax,%r12,1) <-- trapping instruction
  2f:	74 08                	je     0x39
  31:	48 89 df             	mov    %rbx,%rdi
  34:	e8 3a 41 83 00       	call   0x834173
  39:	48 8b 1b             	mov    (%rbx),%rbx
  3c:	48                   	rex.W
  3d:	8b                   	.byte 0x8b
  3e:	44                   	rex.R
  3f:	24                   	.byte 0x24


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
  2025-07-02 13:50 [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func syzbot
@ 2025-07-02 15:20 ` Rik van Riel
  2025-07-02 16:53   ` Jann Horn
  0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2025-07-02 15:20 UTC (permalink / raw)
  To: syzbot, bp, dave.hansen, hpa, linux-kernel, luto, mingo,
	neeraj.upadhyay, paulmck, peterz, syzkaller-bugs, tglx, x86,
	yury.norov
  Cc: kernel-team, David Hildenbrand, Jann Horn

On Wed, 2025-07-02 at 06:50 -0700, syzbot wrote:
> 
> The issue was bisected to:
> 
> commit a12a498a9738db65152203467820bb15b6102bd2
> Author: Yury Norov [NVIDIA] <yury.norov@gmail.com>
> Date:   Mon Jun 23 00:00:08 2025 +0000
> 
>     smp: Don't wait for remote work done if not needed in
> smp_call_function_many_cond()

While that change looks like it would increase the
likelihood of hitting this issue, it does not look
like the root cause.

Instead, the stack traces below show that the
TLB flush code is being asked to flush the TLB
for an mm that is exiting.

One CPU is running the TLB flush handler, while
another CPU is freeing the mm_struct.

The CPU that sent the simultaneous TLB flush
is not visible in the stack traces below,
but we seem to have various places around the
MM where we flush the TLB for another mm,
without taking any measures to protect against
that mm being freed while the flush is ongoing.

I wonder if an easy, and low overhead way to
fix that is just to make the mm_struct RCU
freed, and hold the rcu_read_lock() in any place
where we zap the TLB for another mm.

Are there any places where that wouldn't work?

Do we need to take refcounts on the mm_struct,
instead?

> ==================================================================
> BUG: KASAN: slab-use-after-free in instrument_atomic_write
> include/linux/instrumented.h:82 [inline]
> BUG: KASAN: slab-use-after-free in clear_bit include/asm-
> generic/bitops/instrumented-atomic.h:41 [inline]
> BUG: KASAN: slab-use-after-free in cpumask_clear_cpu
> include/linux/cpumask.h:628 [inline]
> BUG: KASAN: slab-use-after-free in flush_tlb_func+0x23d/0x6c0
> arch/x86/mm/tlb.c:1132
> Write of size 8 at addr ffff88805f1dca80 by task kworker/1:1/43
> 
> CPU: 1 UID: 0 PID: 43 Comm: kworker/1:1 Not tainted 6.16.0-rc4-next-
> 20250630-syzkaller #0 PREEMPT(full) 
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 05/07/2025
> Workqueue: usb_hub_wq hub_event
> Call Trace:
>  <IRQ>
>  dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
>  print_address_description mm/kasan/report.c:408 [inline]
>  print_report+0xd2/0x2b0 mm/kasan/report.c:521
>  kasan_report+0x118/0x150 mm/kasan/report.c:634
>  check_region_inline mm/kasan/generic.c:-1 [inline]
>  kasan_check_range+0x2b0/0x2c0 mm/kasan/generic.c:189
>  instrument_atomic_write include/linux/instrumented.h:82 [inline]
>  clear_bit include/asm-generic/bitops/instrumented-atomic.h:41
> [inline]
>  cpumask_clear_cpu include/linux/cpumask.h:628 [inline]
>  flush_tlb_func+0x23d/0x6c0 arch/x86/mm/tlb.c:1132
>  csd_do_func kernel/smp.c:134 [inline]
>  __flush_smp_call_function_queue+0x370/0xaa0 kernel/smp.c:540
>  __sysvec_call_function_single+0xa8/0x3d0 arch/x86/kernel/smp.c:271
>  instr_sysvec_call_function_single arch/x86/kernel/smp.c:266 [inline]
>  sysvec_call_function_single+0x9e/0xc0 arch/x86/kernel/smp.c:266
>  </IRQ>
>  <TASK>
>  asm_sysvec_call_function_single+0x1a/0x20
> arch/x86/include/asm/idtentry.h:709
> RIP: 0010:console_flush_all+0x7f7/0xc40 kernel/printk/printk.c:3227
> Code: 48 21 c3 0f 85 e9 01 00 00 e8 65 2d 1f 00 48 8b 5c 24 20 4d 85
> f6 75 07 e8 56 2d 1f 00 eb 06 e8 4f 2d 1f 00 fb 48 8b 44 24 28 <42>
> 80 3c 20 00 74 08 48 89 df e8 3a 41 83 00 48 8b 1b 48 8b 44 24
> RSP: 0018:ffffc90000b36fc0 EFLAGS: 00000293
> RAX: 1ffffffff1d36a63 RBX: ffffffff8e9b5318 RCX: ffff88801faa9e00
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffffc90000b37110 R08: ffffffff8fa17437 R09: 1ffffffff1f42e86
> R10: dffffc0000000000 R11: fffffbfff1f42e87 R12: dffffc0000000000
> R13: 0000000000000001 R14: 0000000000000200 R15: ffffffff8e9b52c0
>  __console_flush_and_unlock kernel/printk/printk.c:3285 [inline]
>  console_unlock+0xc4/0x270 kernel/printk/printk.c:3325
>  vprintk_emit+0x5b7/0x7a0 kernel/printk/printk.c:2450
>  dev_vprintk_emit+0x337/0x3f0 drivers/base/core.c:4916
>  dev_printk_emit+0xe0/0x130 drivers/base/core.c:4927
>  _dev_info+0x10a/0x160 drivers/base/core.c:4985
>  announce_device+0x117/0x2c0 drivers/usb/core/hub.c:2417
>  usb_new_device+0x4ef/0x16f0 drivers/usb/core/hub.c:2680
>  hub_port_connect drivers/usb/core/hub.c:5571 [inline]
>  hub_port_connect_change drivers/usb/core/hub.c:5711 [inline]
>  port_event drivers/usb/core/hub.c:5871 [inline]
>  hub_event+0x2941/0x4a00 drivers/usb/core/hub.c:5953
>  process_one_work kernel/workqueue.c:3239 [inline]
>  process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3322
>  worker_thread+0x8a0/0xda0 kernel/workqueue.c:3403
>  kthread+0x70e/0x8a0 kernel/kthread.c:463
>  ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
> 
> Allocated by task 5962:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
>  unpoison_slab_object mm/kasan/common.c:319 [inline]
>  __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:345
>  kasan_slab_alloc include/linux/kasan.h:250 [inline]
>  slab_post_alloc_hook mm/slub.c:4180 [inline]
>  slab_alloc_node mm/slub.c:4229 [inline]
>  kmem_cache_alloc_noprof+0x1c1/0x3c0 mm/slub.c:4236
>  dup_mm kernel/fork.c:1466 [inline]
>  copy_mm+0xdb/0x4b0 kernel/fork.c:1528
>  copy_process+0x1706/0x3c00 kernel/fork.c:2168
>  kernel_clone+0x21e/0x870 kernel/fork.c:2598
>  __do_sys_clone kernel/fork.c:2741 [inline]
>  __se_sys_clone kernel/fork.c:2725 [inline]
>  __x64_sys_clone+0x18b/0x1e0 kernel/fork.c:2725
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Freed by task 8084:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:68
>  kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576
>  poison_slab_object mm/kasan/common.c:247 [inline]
>  __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264
>  kasan_slab_free include/linux/kasan.h:233 [inline]
>  slab_free_hook mm/slub.c:2417 [inline]
>  slab_free mm/slub.c:4680 [inline]
>  kmem_cache_free+0x18f/0x400 mm/slub.c:4782
>  exit_mm+0x1da/0x2c0 kernel/exit.c:581
>  do_exit+0x648/0x2300 kernel/exit.c:947
>  do_group_exit+0x21c/0x2d0 kernel/exit.c:1100
>  __do_sys_exit_group kernel/exit.c:1111 [inline]
>  __se_sys_exit_group kernel/exit.c:1109 [inline]
>  __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1109
>  x64_sys_call+0x21ba/0x21c0
> arch/x86/include/generated/asm/syscalls_64.h:232
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> The buggy address belongs to the object at ffff88805f1dc080
>  which belongs to the cache mm_struct of size 2584
> The buggy address is located 2560 bytes inside of
>  freed 2584-byte region [ffff88805f1dc080, ffff88805f1dca98)
> 
> The buggy address belongs to the physical page:
> page: refcount:0 mapcount:0 mapping:0000000000000000
> index:0xffff88805f1dcb40 pfn:0x5f1d8
> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0
> pincount:0
> memcg:ffff88805f612801
> anon flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
> page_type: f5(slab)
> raw: 00fff00000000040 ffff88801a44bb40 0000000000000000
> dead000000000001
> raw: ffff88805f1dcb40 00000000000b0009 00000000f5000000
> ffff88805f612801
> head: 00fff00000000040 ffff88801a44bb40 0000000000000000
> dead000000000001
> head: ffff88805f1dcb40 00000000000b0009 00000000f5000000
> ffff88805f612801
> head: 00fff00000000003 ffffea00017c7601 00000000ffffffff
> 00000000ffffffff
> head: ffffffffffffffff 0000000000000000 00000000ffffffff
> 0000000000000008
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 3, migratetype Unmovable, gfp_mask
> 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP
> _NOMEMALLOC), pid 5719, tgid 5719 (dhcpcd-run-hook), ts 61767719991,
> free_ts 61767008311
>  set_page_owner include/linux/page_owner.h:32 [inline]
>  post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1848
>  prep_new_page mm/page_alloc.c:1856 [inline]
>  get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3855
>  __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5145
>  alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2419
>  alloc_slab_page mm/slub.c:2487 [inline]
>  allocate_slab+0x8a/0x370 mm/slub.c:2655
>  new_slab mm/slub.c:2709 [inline]
>  ___slab_alloc+0xbeb/0x1410 mm/slub.c:3891
>  __slab_alloc mm/slub.c:3981 [inline]
>  __slab_alloc_node mm/slub.c:4056 [inline]
>  slab_alloc_node mm/slub.c:4217 [inline]
>  kmem_cache_alloc_noprof+0x283/0x3c0 mm/slub.c:4236
>  dup_mm kernel/fork.c:1466 [inline]
>  copy_mm+0xdb/0x4b0 kernel/fork.c:1528
>  copy_process+0x1706/0x3c00 kernel/fork.c:2168
>  kernel_clone+0x21e/0x870 kernel/fork.c:2598
>  __do_sys_clone kernel/fork.c:2741 [inline]
>  __se_sys_clone kernel/fork.c:2725 [inline]
>  __x64_sys_clone+0x18b/0x1e0 kernel/fork.c:2725
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> page last free pid 5719 tgid 5719 stack trace:
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  free_pages_prepare mm/page_alloc.c:1392 [inline]
>  __free_frozen_pages+0xb80/0xd80 mm/page_alloc.c:2892
>  __folio_put+0x21b/0x2c0 mm/swap.c:112
>  put_netmem net/core/skbuff.c:7372 [inline]
>  skb_page_unref include/linux/skbuff_ref.h:43 [inline]
>  __skb_frag_unref include/linux/skbuff_ref.h:56 [inline]
>  skb_release_data+0x49a/0x7c0 net/core/skbuff.c:1081
>  skb_release_all net/core/skbuff.c:1152 [inline]
>  napi_consume_skb+0x158/0x1e0 net/core/skbuff.c:1480
>  skb_defer_free_flush net/core/dev.c:6632 [inline]
>  net_rx_action+0x51b/0xe30 net/core/dev.c:7625
>  handle_softirqs+0x286/0x870 kernel/softirq.c:579
>  __do_softirq kernel/softirq.c:613 [inline]
>  invoke_softirq kernel/softirq.c:453 [inline]
>  __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
>  irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
>  common_interrupt+0xbb/0xe0 arch/x86/kernel/irq.c:285
>  asm_common_interrupt+0x26/0x40 arch/x86/include/asm/idtentry.h:693
> 
> Memory state around the buggy address:
>  ffff88805f1dc980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff88805f1dca00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ffff88805f1dca80: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
>                    ^
>  ffff88805f1dcb00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>  ffff88805f1dcb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
> ----------------
> Code disassembly (best guess):
>    0:	48 21 c3             	and    %rax,%rbx
>    3:	0f 85 e9 01 00 00    	jne    0x1f2
>    9:	e8 65 2d 1f 00       	call   0x1f2d73
>    e:	48 8b 5c 24 20       	mov    0x20(%rsp),%rbx
>   13:	4d 85 f6             	test   %r14,%r14
>   16:	75 07                	jne    0x1f
>   18:	e8 56 2d 1f 00       	call   0x1f2d73
>   1d:	eb 06                	jmp    0x25
>   1f:	e8 4f 2d 1f 00       	call   0x1f2d73
>   24:	fb                   	sti
>   25:	48 8b 44 24 28       	mov    0x28(%rsp),%rax
> * 2a:	42 80 3c 20 00       	cmpb   $0x0,(%rax,%r12,1) <--
> trapping instruction
>   2f:	74 08                	je     0x39
>   31:	48 89 df             	mov    %rbx,%rdi
>   34:	e8 3a 41 83 00       	call   0x834173
>   39:	48 8b 1b             	mov    (%rbx),%rbx
>   3c:	48                   	rex.W
>   3d:	8b                   	.byte 0x8b
>   3e:	44                   	rex.R
>   3f:	24                   	.byte 0x24
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> For information about bisection process see:
> https://goo.gl/tpsmEJ#bisection
> 
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
> 
> If you want syzbot to run the reproducer, reply with:
> #syz test: git://repo/address.git branch-or-commit-hash
> If you attach or paste a git patch, syzbot will apply it before
> testing.
> 
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
> 
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
> 
> If you want to undo deduplication, reply with:
> #syz undup
> 

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
  2025-07-02 15:20 ` Rik van Riel
@ 2025-07-02 16:53   ` Jann Horn
  2025-07-02 17:00     ` Jann Horn
  2025-07-02 17:09     ` [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func Rik van Riel
  0 siblings, 2 replies; 13+ messages in thread
From: Jann Horn @ 2025-07-02 16:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: syzbot, bp, dave.hansen, hpa, linux-kernel, luto, mingo,
	neeraj.upadhyay, paulmck, peterz, syzkaller-bugs, tglx, x86,
	yury.norov, kernel-team, David Hildenbrand

On Wed, Jul 2, 2025 at 5:24 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2025-07-02 at 06:50 -0700, syzbot wrote:
> >
> > The issue was bisected to:
> >
> > commit a12a498a9738db65152203467820bb15b6102bd2
> > Author: Yury Norov [NVIDIA] <yury.norov@gmail.com>
> > Date:   Mon Jun 23 00:00:08 2025 +0000
> >
> >     smp: Don't wait for remote work done if not needed in
> > smp_call_function_many_cond()
>
> While that change looks like it would increase the
> likelihood of hitting this issue, it does not look
> like the root cause.
>
> Instead, the stack traces below show that the
> TLB flush code is being asked to flush the TLB
> for an mm that is exiting.
>
> One CPU is running the TLB flush handler, while
> another CPU is freeing the mm_struct.
>
> The CPU that sent the simultaneous TLB flush
> is not visible in the stack traces below,
> but we seem to have various places around the
> MM where we flush the TLB for another mm,
> without taking any measures to protect against
> that mm being freed while the flush is ongoing.

TLB flushes via IPIs on x86 are always synchronous, right?
flush_tlb_func is only referenced from native_flush_tlb_multi() in
calls to on_each_cpu_mask() (with wait=true) or
on_each_cpu_cond_mask() (with wait=1).
So I think this is not an issue, unless you're claiming that we call
native_flush_tlb_multi() with an already-freed info->mm?

And I think the bisected commit really is the buggy one: It looks at
"nr_cpus", which tracks *how many CPUs we have to IPI*, but assumes
that "nr_cpus" tracks *how many CPUs we posted work to*. Those numbers
are not the same: If we post work to a CPU that already had IPI work
pending, we just add a list entry without sending another IPI.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
  2025-07-02 16:53   ` Jann Horn
@ 2025-07-02 17:00     ` Jann Horn
  2025-07-02 17:12       ` Thomas Gleixner
  2025-07-02 17:09     ` [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func Rik van Riel
  1 sibling, 1 reply; 13+ messages in thread
From: Jann Horn @ 2025-07-02 17:00 UTC (permalink / raw)
  To: Rik van Riel, yury.norov, tglx
  Cc: syzbot, bp, dave.hansen, hpa, linux-kernel, luto, mingo,
	neeraj.upadhyay, paulmck, peterz, syzkaller-bugs, x86,
	kernel-team, David Hildenbrand

On Wed, Jul 2, 2025 at 6:53 PM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Jul 2, 2025 at 5:24 PM Rik van Riel <riel@surriel.com> wrote:
> >
> > On Wed, 2025-07-02 at 06:50 -0700, syzbot wrote:
> > >
> > > The issue was bisected to:
> > >
> > > commit a12a498a9738db65152203467820bb15b6102bd2
> > > Author: Yury Norov [NVIDIA] <yury.norov@gmail.com>
> > > Date:   Mon Jun 23 00:00:08 2025 +0000
> > >
> > >     smp: Don't wait for remote work done if not needed in
> > > smp_call_function_many_cond()
> >
> > While that change looks like it would increase the
> > likelihood of hitting this issue, it does not look
> > like the root cause.
> >
> > Instead, the stack traces below show that the
> > TLB flush code is being asked to flush the TLB
> > for an mm that is exiting.
> >
> > One CPU is running the TLB flush handler, while
> > another CPU is freeing the mm_struct.
> >
> > The CPU that sent the simultaneous TLB flush
> > is not visible in the stack traces below,
> > but we seem to have various places around the
> > MM where we flush the TLB for another mm,
> > without taking any measures to protect against
> > that mm being freed while the flush is ongoing.
>
> TLB flushes via IPIs on x86 are always synchronous, right?
> flush_tlb_func is only referenced from native_flush_tlb_multi() in
> calls to on_each_cpu_mask() (with wait=true) or
> on_each_cpu_cond_mask() (with wait=1).
> So I think this is not an issue, unless you're claiming that we call
> native_flush_tlb_multi() with an already-freed info->mm?
>
> And I think the bisected commit really is the buggy one: It looks at
> "nr_cpus", which tracks *how many CPUs we have to IPI*, but assumes
> that "nr_cpus" tracks *how many CPUs we posted work to*. Those numbers
> are not the same: If we post work to a CPU that already had IPI work
> pending, we just add a list entry without sending another IPI.

Or in other words: After that blamed commit, if CPU 1 posts a TLB
flush to CPU 3, and then CPU 2 also quickly posts a TLB flush to CPU
3, then CPU 2 will erroneously not wait for the TLB flush to complete
before reporting flush completion, which AFAICS means we can get both
stale TLB entries and (less often) UAF.

I think the correct version of that commit would be to revert that
commit and instead just move the "run_remote = true;" line down, below
the cond_func() check.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
  2025-07-02 16:53   ` Jann Horn
  2025-07-02 17:00     ` Jann Horn
@ 2025-07-02 17:09     ` Rik van Riel
  2025-07-02 17:23       ` Jann Horn
  1 sibling, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2025-07-02 17:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: syzbot, bp, dave.hansen, hpa, linux-kernel, luto, mingo,
	neeraj.upadhyay, paulmck, peterz, syzkaller-bugs, tglx, x86,
	yury.norov, kernel-team, David Hildenbrand

On Wed, 2025-07-02 at 18:53 +0200, Jann Horn wrote:
> 

> TLB flushes via IPIs on x86 are always synchronous, right?
> flush_tlb_func is only referenced from native_flush_tlb_multi() in
> calls to on_each_cpu_mask() (with wait=true) or
> on_each_cpu_cond_mask() (with wait=1).
> So I think this is not an issue, unless you're claiming that we call
> native_flush_tlb_multi() with an already-freed info->mm?
> 
It looks like there are a least some cases where
try_to_unmap() can call flush_tlb_range() with
an mm that belongs to some other process.

I don't know whether that is an issue.

> And I think the bisected commit really is the buggy one: It looks at
> "nr_cpus", which tracks *how many CPUs we have to IPI*, but assumes
> that "nr_cpus" tracks *how many CPUs we posted work to*. Those
> numbers
> are not the same: If we post work to a CPU that already had IPI work
> pending, we just add a list entry without sending another IPI.
> 
Good point, we do need to wait when we enqueue
work, even if we do not send an IPI anywhere!

You are right that the bisected commit is buggy
and should be reverted.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
  2025-07-02 17:00     ` Jann Horn
@ 2025-07-02 17:12       ` Thomas Gleixner
  2025-07-02 17:44         ` Yury Norov
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2025-07-02 17:12 UTC (permalink / raw)
  To: Jann Horn, Rik van Riel, yury.norov
  Cc: syzbot, bp, dave.hansen, hpa, linux-kernel, luto, mingo,
	neeraj.upadhyay, paulmck, peterz, syzkaller-bugs, x86,
	kernel-team, David Hildenbrand

On Wed, Jul 02 2025 at 19:00, Jann Horn wrote:
> On Wed, Jul 2, 2025 at 6:53 PM Jann Horn <jannh@google.com> wrote:
>> TLB flushes via IPIs on x86 are always synchronous, right?
>> flush_tlb_func is only referenced from native_flush_tlb_multi() in
>> calls to on_each_cpu_mask() (with wait=true) or
>> on_each_cpu_cond_mask() (with wait=1).
>> So I think this is not an issue, unless you're claiming that we call
>> native_flush_tlb_multi() with an already-freed info->mm?
>>
>> And I think the bisected commit really is the buggy one: It looks at
>> "nr_cpus", which tracks *how many CPUs we have to IPI*, but assumes
>> that "nr_cpus" tracks *how many CPUs we posted work to*. Those numbers
>> are not the same: If we post work to a CPU that already had IPI work
>> pending, we just add a list entry without sending another IPI.
>
> Or in other words: After that blamed commit, if CPU 1 posts a TLB
> flush to CPU 3, and then CPU 2 also quickly posts a TLB flush to CPU
> 3, then CPU 2 will erroneously not wait for the TLB flush to complete
> before reporting flush completion, which AFAICS means we can get both
> stale TLB entries and (less often) UAF.

Right you are. Well analyzed and I missed it when taking the lot.

> I think the correct version of that commit would be to revert that
> commit and instead just move the "run_remote = true;" line down, below
> the cond_func() check.

I remove it from the relevant tip branch

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
  2025-07-02 17:09     ` [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func Rik van Riel
@ 2025-07-02 17:23       ` Jann Horn
  0 siblings, 0 replies; 13+ messages in thread
From: Jann Horn @ 2025-07-02 17:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: syzbot, bp, dave.hansen, hpa, linux-kernel, luto, mingo,
	neeraj.upadhyay, paulmck, peterz, syzkaller-bugs, tglx, x86,
	yury.norov, kernel-team, David Hildenbrand, Lorenzo Stoakes

On Wed, Jul 2, 2025 at 7:17 PM Rik van Riel <riel@surriel.com> wrote:
> On Wed, 2025-07-02 at 18:53 +0200, Jann Horn wrote:
> > TLB flushes via IPIs on x86 are always synchronous, right?
> > flush_tlb_func is only referenced from native_flush_tlb_multi() in
> > calls to on_each_cpu_mask() (with wait=true) or
> > on_each_cpu_cond_mask() (with wait=1).
> > So I think this is not an issue, unless you're claiming that we call
> > native_flush_tlb_multi() with an already-freed info->mm?
> >
> It looks like there are a least some cases where
> try_to_unmap() can call flush_tlb_range() with
> an mm that belongs to some other process.
>
> I don't know whether that is an issue.

try_to_unmap() relies on read-locking either the anon_vma (for
anonymous pages) or the address_space (for file pages) throughout the
entire rmap walk to ensure that the list of VMAs attached to the
anon_vma/address_space stays stable during the operation, which
guarantees that those VMAs can't go away, which guarantees that the
associated MMs can't go away.

If the caller passes in TTU_RMAP_LOCKED, they promise that they've
already taken care of this rmap locking; otherwise, rmap_walk() will
do it internally.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func
  2025-07-02 17:12       ` Thomas Gleixner
@ 2025-07-02 17:44         ` Yury Norov
  2025-07-02 17:59           ` [PATCH] smp: Wait for enqueued work regardless of IPI sent Rik van Riel
  0 siblings, 1 reply; 13+ messages in thread
From: Yury Norov @ 2025-07-02 17:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jann Horn, Rik van Riel, syzbot, bp, dave.hansen, hpa,
	linux-kernel, luto, mingo, neeraj.upadhyay, paulmck, peterz,
	syzkaller-bugs, x86, kernel-team, David Hildenbrand

On Wed, Jul 02, 2025 at 07:12:31PM +0200, Thomas Gleixner wrote:
> On Wed, Jul 02 2025 at 19:00, Jann Horn wrote:
> > On Wed, Jul 2, 2025 at 6:53 PM Jann Horn <jannh@google.com> wrote:
> >> TLB flushes via IPIs on x86 are always synchronous, right?
> >> flush_tlb_func is only referenced from native_flush_tlb_multi() in
> >> calls to on_each_cpu_mask() (with wait=true) or
> >> on_each_cpu_cond_mask() (with wait=1).
> >> So I think this is not an issue, unless you're claiming that we call
> >> native_flush_tlb_multi() with an already-freed info->mm?
> >>
> >> And I think the bisected commit really is the buggy one: It looks at
> >> "nr_cpus", which tracks *how many CPUs we have to IPI*, but assumes
> >> that "nr_cpus" tracks *how many CPUs we posted work to*. Those numbers
> >> are not the same: If we post work to a CPU that already had IPI work
> >> pending, we just add a list entry without sending another IPI.
> >
> > Or in other words: After that blamed commit, if CPU 1 posts a TLB
> > flush to CPU 3, and then CPU 2 also quickly posts a TLB flush to CPU
> > 3, then CPU 2 will erroneously not wait for the TLB flush to complete
> > before reporting flush completion, which AFAICS means we can get both
> > stale TLB entries and (less often) UAF.
> 
> Right you are. Well analyzed and I missed it when taking the lot.
> 
> > I think the correct version of that commit would be to revert that
> > commit and instead just move the "run_remote = true;" line down, below
> > the cond_func() check.
> 
> I remove it from the relevant tip branch

Thank you guys for explaining that and sorry for the buggy patch.
I was actually under impression that run_remote duplicates nr_cpus !=0,
and even have a patch that removes run_remote.

Maybe worth to add a comment on what run_remote and nr_cpus track?

Thanks,
Yury

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] smp: Wait for enqueued work regardless of IPI sent
  2025-07-02 17:44         ` Yury Norov
@ 2025-07-02 17:59           ` Rik van Riel
  2025-07-03 13:52             ` Yury Norov
  2025-07-03 16:56             ` Thomas Gleixner
  0 siblings, 2 replies; 13+ messages in thread
From: Rik van Riel @ 2025-07-02 17:59 UTC (permalink / raw)
  To: Yury Norov
  Cc: Thomas Gleixner, Jann Horn, syzbot, bp, dave.hansen, hpa,
	linux-kernel, luto, mingo, neeraj.upadhyay, paulmck, peterz,
	syzkaller-bugs, x86, kernel-team, David Hildenbrand

On Wed, 2 Jul 2025 13:44:34 -0400
Yury Norov <yury.norov@gmail.com> wrote:

> Thank you guys for explaining that and sorry for the buggy patch.
> I was actually under impression that run_remote duplicates nr_cpus !=0,
> and even have a patch that removes run_remote.
> 
> Maybe worth to add a comment on what run_remote and nr_cpus track?

This thread did surface some useful content, and Jann also pointed out
a good optimization that can be made, by not setting run_remote if
"func" tells us to skip remote CPUs.

Thomas, please let me know if you already reverted Yury's patch,
and want me to re-send this without the last hunk.

---8<---
From 2ae6417fa7ce16f1bfa574cbabba572436adbed9 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@surriel.com>
Date: Wed, 2 Jul 2025 13:52:54 -0400
Subject: [PATCH] smp: Wait for enqueued work regardless of IPI sent

Whenever work is enqueued with a remote CPU, smp_call_function_many_cond()
may need to wait for that work to be completed, regardless of whether or
not the remote CPU needed to be woken up with an IPI, or the work was
being added to the queue of an already woken up CPU.

However, if no work is enqueued with a remote CPU, because "func"
told us to skip all CPUs, do not wait.

Document the difference between "work enqueued", and "CPU needs to be
woken up"

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Jann Horn <jannh@google.com>
Reported-by: syzbot+084b6e5bc1016723a9c4@syzkaller.appspotmail.com
Fixes: a12a498a9738 ("smp: Don't wait for remote work done if not needed in smp_call_function_many_cond()")
---
 kernel/smp.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 84561258cd22..c5e1da7a88da 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -802,7 +802,6 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
-		run_remote = true;
 		cfd = this_cpu_ptr(&cfd_data);
 		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
 		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
@@ -816,6 +815,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 				continue;
 			}
 
+			/* Work is enqueued on a remote CPU. */
+			run_remote = true;
+
 			csd_lock(csd);
 			if (wait)
 				csd->node.u_flags |= CSD_TYPE_SYNC;
@@ -827,6 +829,10 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 #endif
 			trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
 
+			/*
+			 * Kick the remote CPU if this is the first work
+			 * item enqueued.
+			 */
 			if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
 				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 				nr_cpus++;
@@ -843,8 +849,6 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 			send_call_function_single_ipi(last_cpu);
 		else if (likely(nr_cpus > 1))
 			send_call_function_ipi_mask(cfd->cpumask_ipi);
-		else
-			run_remote = false;
 	}
 
 	/* Check if we need local execution. */
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] smp: Wait for enqueued work regardless of IPI sent
  2025-07-02 17:59           ` [PATCH] smp: Wait for enqueued work regardless of IPI sent Rik van Riel
@ 2025-07-03 13:52             ` Yury Norov
  2025-07-03 16:56             ` Thomas Gleixner
  1 sibling, 0 replies; 13+ messages in thread
From: Yury Norov @ 2025-07-03 13:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Thomas Gleixner, Jann Horn, syzbot, bp, dave.hansen, hpa,
	linux-kernel, luto, mingo, neeraj.upadhyay, paulmck, peterz,
	syzkaller-bugs, x86, kernel-team, David Hildenbrand

On Wed, Jul 02, 2025 at 01:59:54PM -0400, Rik van Riel wrote:
> On Wed, 2 Jul 2025 13:44:34 -0400
> Yury Norov <yury.norov@gmail.com> wrote:
> 
> > Thank you guys for explaining that and sorry for the buggy patch.
> > I was actually under impression that run_remote duplicates nr_cpus !=0,
> > and even have a patch that removes run_remote.
> > 
> > Maybe worth to add a comment on what run_remote and nr_cpus track?
> 
> This thread did surface some useful content, and Jann also pointed out
> a good optimization that can be made, by not setting run_remote if
> "func" tells us to skip remote CPUs.
> 
> Thomas, please let me know if you already reverted Yury's patch,
> and want me to re-send this without the last hunk.
> 
> ---8<---
> From 2ae6417fa7ce16f1bfa574cbabba572436adbed9 Mon Sep 17 00:00:00 2001
> From: Rik van Riel <riel@surriel.com>
> Date: Wed, 2 Jul 2025 13:52:54 -0400
> Subject: [PATCH] smp: Wait for enqueued work regardless of IPI sent
> 
> Whenever work is enqueued with a remote CPU, smp_call_function_many_cond()
> may need to wait for that work to be completed, regardless of whether or
> not the remote CPU needed to be woken up with an IPI, or the work was
> being added to the queue of an already woken up CPU.
> 
> However, if no work is enqueued with a remote CPU, because "func"
> told us to skip all CPUs, do not wait.
> 
> Document the difference between "work enqueued", and "CPU needs to be
> woken up"
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Suggested-by: Jann Horn <jannh@google.com>
> Reported-by: syzbot+084b6e5bc1016723a9c4@syzkaller.appspotmail.com
> Fixes: a12a498a9738 ("smp: Don't wait for remote work done if not needed in smp_call_function_many_cond()")
> ---
>  kernel/smp.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 84561258cd22..c5e1da7a88da 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -802,7 +802,6 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  
>  	/* Check if we need remote execution, i.e., any CPU excluding this one. */
>  	if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
> -		run_remote = true;
>  		cfd = this_cpu_ptr(&cfd_data);
>  		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
>  		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
> @@ -816,6 +815,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  				continue;
>  			}
>  
> +			/* Work is enqueued on a remote CPU. */
> +			run_remote = true;
> +

I actually ended up with the same on my cratch branch:

https://github.com/norov/linux/commit/8a32ca4b60dc68ac54f3b70b4be7a5863dc3934e

So,

Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>

>  			csd_lock(csd);
>  			if (wait)
>  				csd->node.u_flags |= CSD_TYPE_SYNC;
> @@ -827,6 +829,10 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  #endif
>  			trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
>  
> +			/*
> +			 * Kick the remote CPU if this is the first work
> +			 * item enqueued.
> +			 */
>  			if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
>  				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
>  				nr_cpus++;
> @@ -843,8 +849,6 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  			send_call_function_single_ipi(last_cpu);
>  		else if (likely(nr_cpus > 1))
>  			send_call_function_ipi_mask(cfd->cpumask_ipi);
> -		else
> -			run_remote = false;
>  	}
>  
>  	/* Check if we need local execution. */
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] smp: Wait for enqueued work regardless of IPI sent
  2025-07-02 17:59           ` [PATCH] smp: Wait for enqueued work regardless of IPI sent Rik van Riel
  2025-07-03 13:52             ` Yury Norov
@ 2025-07-03 16:56             ` Thomas Gleixner
  2025-07-04  0:30               ` [PATCH v2] " Rik van Riel
  1 sibling, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2025-07-03 16:56 UTC (permalink / raw)
  To: Rik van Riel, Yury Norov
  Cc: Jann Horn, syzbot, bp, dave.hansen, hpa, linux-kernel, luto,
	mingo, neeraj.upadhyay, paulmck, peterz, syzkaller-bugs, x86,
	kernel-team, David Hildenbrand

On Wed, Jul 02 2025 at 13:59, Rik van Riel wrote:
> Thomas, please let me know if you already reverted Yury's patch,
> and want me to re-send this without the last hunk.

I did so immediately after saying so in my previous reply. It's gone in
tip and next.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] smp: Wait for enqueued work regardless of IPI sent
  2025-07-03 16:56             ` Thomas Gleixner
@ 2025-07-04  0:30               ` Rik van Riel
  2025-07-06 10:01                 ` [tip: smp/core] smp: Wait only if work was enqueued tip-bot2 for Rik van Riel
  0 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2025-07-04  0:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Yury Norov, Jann Horn, syzbot, bp, dave.hansen, hpa, linux-kernel,
	luto, mingo, neeraj.upadhyay, paulmck, peterz, syzkaller-bugs,
	x86, kernel-team, David Hildenbrand

On Thu, 03 Jul 2025 18:56:11 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, Jul 02 2025 at 13:59, Rik van Riel wrote:
> > Thomas, please let me know if you already reverted Yury's patch,
> > and want me to re-send this without the last hunk.  
> 
> I did so immediately after saying so in my previous reply. It's gone in
> tip and next.

Here is v2 of the patch, with the last hunk removed, and
the changelog adjusted to match the new context.

---8<---
From 2ae6417fa7ce16f1bfa574cbabba572436adbed9 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@surriel.com>
Date: Wed, 2 Jul 2025 13:52:54 -0400
Subject: [PATCH] smp: Wait only if work was enqueued

Whenever work is enqueued with a remote CPU, smp_call_function_many_cond()
may need to wait for that work to be completed. However, if no work is
enqueued with a remote CPU, because "func" told us to skip all CPUs,
there is no need to wait.

Set run_remote only if work was enqueued on remote CPUs.

Document the difference between "work enqueued", and "CPU needs to be
woken up"

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Jann Horn <jannh@google.com>
Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
---
 kernel/smp.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 84561258cd22..c5e1da7a88da 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -802,7 +802,6 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
-		run_remote = true;
 		cfd = this_cpu_ptr(&cfd_data);
 		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
 		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
@@ -816,6 +815,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 				continue;
 			}
 
+			/* Work is enqueued on a remote CPU. */
+			run_remote = true;
+
 			csd_lock(csd);
 			if (wait)
 				csd->node.u_flags |= CSD_TYPE_SYNC;
@@ -827,6 +829,10 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 #endif
 			trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
 
+			/*
+			 * Kick the remote CPU if this is the first work
+			 * item enqueued.
+			 */
 			if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
 				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 				nr_cpus++;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [tip: smp/core] smp: Wait only if work was enqueued
  2025-07-04  0:30               ` [PATCH v2] " Rik van Riel
@ 2025-07-06 10:01                 ` tip-bot2 for Rik van Riel
  0 siblings, 0 replies; 13+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-07-06 10:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Jann Horn, Rik van Riel, Thomas Gleixner, Yury Norov (NVIDIA),
	x86, linux-kernel

The following commit has been merged into the smp/core branch of tip:

Commit-ID:     946a7281982530d333eaee62bd1726f25908b3a9
Gitweb:        https://git.kernel.org/tip/946a7281982530d333eaee62bd1726f25908b3a9
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Wed, 02 Jul 2025 13:52:54 -04:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sun, 06 Jul 2025 11:57:39 +02:00

smp: Wait only if work was enqueued

Whenever work is enqueued for a remote CPU, smp_call_function_many_cond()
may need to wait for that work to be completed. However, if no work is
enqueued for a remote CPU, because the condition func() evaluated to false
for all CPUs, there is no need to wait.

Set run_remote only if work was enqueued on remote CPUs.

Document the difference between "work enqueued", and "CPU needs to be
woken up"

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://lore.kernel.org/all/20250703203019.11331ac3@fangorn

---
 kernel/smp.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 99d1fd0..c5e1da7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -802,7 +802,6 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
-		run_remote = true;
 		cfd = this_cpu_ptr(&cfd_data);
 		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
 		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
@@ -816,6 +815,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 				continue;
 			}
 
+			/* Work is enqueued on a remote CPU. */
+			run_remote = true;
+
 			csd_lock(csd);
 			if (wait)
 				csd->node.u_flags |= CSD_TYPE_SYNC;
@@ -827,6 +829,10 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 #endif
 			trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
 
+			/*
+			 * Kick the remote CPU if this is the first work
+			 * item enqueued.
+			 */
 			if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
 				__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 				nr_cpus++;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-06 10:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 13:50 [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func syzbot
2025-07-02 15:20 ` Rik van Riel
2025-07-02 16:53   ` Jann Horn
2025-07-02 17:00     ` Jann Horn
2025-07-02 17:12       ` Thomas Gleixner
2025-07-02 17:44         ` Yury Norov
2025-07-02 17:59           ` [PATCH] smp: Wait for enqueued work regardless of IPI sent Rik van Riel
2025-07-03 13:52             ` Yury Norov
2025-07-03 16:56             ` Thomas Gleixner
2025-07-04  0:30               ` [PATCH v2] " Rik van Riel
2025-07-06 10:01                 ` [tip: smp/core] smp: Wait only if work was enqueued tip-bot2 for Rik van Riel
2025-07-02 17:09     ` [syzbot] [kernel?] KASAN: slab-use-after-free Write in flush_tlb_func Rik van Riel
2025-07-02 17:23       ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).