* [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. @ 2025-07-20 23:56 James Flowers 2025-07-21 7:52 ` Philipp Stanner 0 siblings, 1 reply; 17+ messages in thread From: James Flowers @ 2025-07-20 23:56 UTC (permalink / raw) To: matthew.brost, dakr, phasta, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan Cc: James Flowers, dri-devel, linux-kernel, linux-kernel-mentees Fixes an issue where entities are added to the run queue in drm_sched_rq_update_fifo_locked after being killed, causing a slab-use-after-free error. Signed-off-by: James Flowers <bold.zone2373@fastmail.com> --- This issue was detected by syzkaller running on a Steam Deck OLED. Unfortunately I don't have a reproducer for it. I've included the KASAN reports below: ================================================================== BUG: KASAN: slab-use-after-free in rb_next+0xda/0x160 lib/rbtree.c:505 Read of size 8 at addr ffff8881805085e0 by task kworker/u32:12/192 CPU: 3 UID: 0 PID: 192 Comm: kworker/u32:12 Not tainted 6.14.0-flowejam-+ #1 Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 print_report+0xfc/0x1ff mm/kasan/report.c:521 kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 rb_next+0xda/0x160 lib/rbtree.c:505 drm_sched_rq_select_entity_fifo drivers/gpu/drm/scheduler/sched_main.c:332 [inline] [gpu_sched] drm_sched_select_entity+0x497/0x720 drivers/gpu/drm/scheduler/sched_main.c:1081 [gpu_sched] drm_sched_run_job_work+0x2e/0x710 drivers/gpu/drm/scheduler/sched_main.c:1206 [gpu_sched] process_one_work+0x9c0/0x17e0 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x734/0x1060 kernel/workqueue.c:3400 kthread+0x3fd/0x810 kernel/kthread.c:464 ret_from_fork+0x53/0x80 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Allocated by task 73472: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 do_dentry_open+0x743/0x1bf0 fs/open.c:956 vfs_open+0x87/0x3f0 fs/open.c:1086 do_open+0x72f/0xf80 fs/namei.c:3830 path_openat+0x2ec/0x770 fs/namei.c:3989 do_filp_open+0x1ff/0x420 fs/namei.c:4016 do_sys_openat2+0x181/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x149/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 73472: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4609 [inline] kfree+0x14f/0x4d0 mm/slub.c:4757 amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 __fput+0x402/0xb50 fs/file_table.c:464 task_work_run+0x155/0x250 kernel/task_work.c:227 get_signal+0x1be/0x19d0 kernel/signal.c:2809 arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x76/0x7e The buggy address belongs to the object at ffff888180508000 The buggy address is located 1504 bytes inside of The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x180508 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) page_type: f5(slab) raw: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 head: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 head: 0017ffffc0000003 ffffea0006014201 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888180508480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888180508500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888180508580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888180508600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888180508680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== ================================================================== BUG: KASAN: slab-use-after-free in rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] BUG: KASAN: slab-use-after-free in rb_erase+0x157c/0x1b10 lib/rbtree.c:443 Write of size 8 at addr ffff88816414c5d0 by task syz.2.3004/12376 CPU: 7 UID: 65534 PID: 12376 Comm: syz.2.3004 Not tainted 6.14.0-flowejam-+ #1 Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 print_report+0xfc/0x1ff mm/kasan/report.c:521 kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] rb_erase+0x157c/0x1b10 lib/rbtree.c:443 rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 __fput+0x402/0xb50 fs/file_table.c:464 task_work_run+0x155/0x250 kernel/task_work.c:227 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x841/0xf60 kernel/exit.c:938 do_group_exit+0xda/0x2b0 kernel/exit.c:1087 get_signal+0x171f/0x19d0 kernel/signal.c:3036 arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f2d90da36ed Code: Unable to access opcode bytes at 0x7f2d90da36c3. RSP: 002b:00007f2d91b710d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: 0000000000000000 RBX: 00007f2d90fe6088 RCX: 00007f2d90da36ed RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f2d90fe6088 RBP: 00007f2d90fe6080 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d90fe608c R13: 0000000000000000 R14: 0000000000000002 R15: 00007ffc34a67bd0 </TASK> Allocated by task 12381: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 do_dentry_open+0x743/0x1bf0 fs/open.c:956 vfs_open+0x87/0x3f0 fs/open.c:1086 do_open+0x72f/0xf80 fs/namei.c:3830 path_openat+0x2ec/0x770 fs/namei.c:3989 do_filp_open+0x1ff/0x420 fs/namei.c:4016 do_sys_openat2+0x181/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x149/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 12381: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4609 [inline] kfree+0x14f/0x4d0 mm/slub.c:4757 amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 __fput+0x402/0xb50 fs/file_table.c:464 task_work_run+0x155/0x250 kernel/task_work.c:227 get_signal+0x1be/0x19d0 kernel/signal.c:2809 arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x76/0x7e The buggy address belongs to the object at ffff88816414c000 The buggy address is located 1488 bytes inside of The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x164148 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) page_type: f5(slab) raw: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 head: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 head: 0017ffffc0000003 ffffea0005905201 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88816414c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88816414c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88816414c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88816414c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88816414c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== ================================================================== BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] BUG: KASAN: slab-use-after-free in rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 Read of size 8 at addr ffff88812ebcc5e0 by task syz.1.814/6553 CPU: 0 UID: 65534 PID: 6553 Comm: syz.1.814 Not tainted 6.14.0-flowejam-+ #1 Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 print_report+0xfc/0x1ff mm/kasan/report.c:521 kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 __fput+0x402/0xb50 fs/file_table.c:464 task_work_run+0x155/0x250 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fd23eba36ed Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc2943a358 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 RAX: 0000000000000000 RBX: 00007ffc2943a428 RCX: 00007fd23eba36ed RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 RBP: 00007fd23ede7ba0 R08: 0000000000000001 R09: 0000000c00000000 R10: 00007fd23ea00000 R11: 0000000000000246 R12: 00007fd23ede5fac R13: 00007fd23ede5fa0 R14: 0000000000059ad1 R15: 0000000000059a8e </TASK> Allocated by task 6559: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 do_dentry_open+0x743/0x1bf0 fs/open.c:956 vfs_open+0x87/0x3f0 fs/open.c:1086 do_open+0x72f/0xf80 fs/namei.c:3830 path_openat+0x2ec/0x770 fs/namei.c:3989 do_filp_open+0x1ff/0x420 fs/namei.c:4016 do_sys_openat2+0x181/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x149/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 6559: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4609 [inline] kfree+0x14f/0x4d0 mm/slub.c:4757 amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 __fput+0x402/0xb50 fs/file_table.c:464 task_work_run+0x155/0x250 kernel/task_work.c:227 get_signal+0x1be/0x19d0 kernel/signal.c:2809 arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x76/0x7e The buggy address belongs to the object at ffff88812ebcc000 The buggy address is located 1504 bytes inside of The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12ebc8 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) page_type: f5(slab) raw: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 head: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 head: 0017ffffc0000003 ffffea0004baf201 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88812ebcc480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812ebcc500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88812ebcc580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88812ebcc600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812ebcc680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== ================================================================== BUG: KASAN: slab-use-after-free in drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] BUG: KASAN: slab-use-after-free in rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] BUG: KASAN: slab-use-after-free in drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] Read of size 8 at addr ffff8881208445c8 by task syz.1.49115/146644 CPU: 7 UID: 65534 PID: 146644 Comm: syz.1.49115 Not tainted 6.14.0-flowejam-+ #1 Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 print_report+0xfc/0x1ff mm/kasan/report.c:521 kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] drm_sched_entity_push_job+0x509/0x5d0 drivers/gpu/drm/scheduler/sched_entity.c:623 [gpu_sched] amdgpu_job_submit+0x1a4/0x270 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:314 [amdgpu] amdgpu_vm_sdma_commit+0x1f9/0x7d0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c:122 [amdgpu] amdgpu_vm_pt_clear+0x540/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:422 [amdgpu] amdgpu_vm_init+0x9c2/0x12f0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2609 [amdgpu] amdgpu_driver_open_kms+0x274/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1418 [amdgpu] drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 do_dentry_open+0x743/0x1bf0 fs/open.c:956 vfs_open+0x87/0x3f0 fs/open.c:1086 do_open+0x72f/0xf80 fs/namei.c:3830 path_openat+0x2ec/0x770 fs/namei.c:3989 do_filp_open+0x1ff/0x420 fs/namei.c:4016 do_sys_openat2+0x181/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x149/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7feb303a36ed Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007feb3123c018 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 00007feb305e5fa0 RCX: 00007feb303a36ed RDX: 0000000000000002 RSI: 0000200000000140 RDI: ffffffffffffff9c RBP: 00007feb30447722 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000001 R14: 00007feb305e5fa0 R15: 00007ffcfd0a3460 </TASK> Allocated by task 146638: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 do_dentry_open+0x743/0x1bf0 fs/open.c:956 vfs_open+0x87/0x3f0 fs/open.c:1086 do_open+0x72f/0xf80 fs/namei.c:3830 path_openat+0x2ec/0x770 fs/namei.c:3989 do_filp_open+0x1ff/0x420 fs/namei.c:4016 do_sys_openat2+0x181/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x149/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 146638: kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4609 [inline] kfree+0x14f/0x4d0 mm/slub.c:4757 amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 __fput+0x402/0xb50 fs/file_table.c:464 task_work_run+0x155/0x250 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x76/0x7e The buggy address belongs to the object at ffff888120844000 The buggy address is located 1480 bytes inside of The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120840 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) page_type: f5(slab) raw: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 head: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 head: 0017ffffc0000003 ffffea0004821001 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888120844480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888120844500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888120844580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888120844600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888120844680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== drivers/gpu/drm/scheduler/sched_main.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index bfea608a7106..997a2cc1a635 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, entity->oldest_job_waiting = ts; - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, - drm_sched_entity_compare_before); + if (!entity->stopped) { + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, + drm_sched_entity_compare_before); + } } /** -- 2.49.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-20 23:56 [PATCH] drm/sched: Prevent stopped entities from being added to the run queue James Flowers @ 2025-07-21 7:52 ` Philipp Stanner 2025-07-21 8:16 ` Philipp Stanner 2025-08-14 10:42 ` Tvrtko Ursulin 0 siblings, 2 replies; 17+ messages in thread From: Philipp Stanner @ 2025-07-21 7:52 UTC (permalink / raw) To: James Flowers, matthew.brost, dakr, phasta, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan Cc: dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin +Cc Tvrtko, who's currently reworking FIFO and RR. On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > Fixes an issue where entities are added to the run queue in > drm_sched_rq_update_fifo_locked after being killed, causing a > slab-use-after-free error. > > Signed-off-by: James Flowers <bold.zone2373@fastmail.com> > --- > This issue was detected by syzkaller running on a Steam Deck OLED. > Unfortunately I don't have a reproducer for it. I've Well, now that's kind of an issue – if you don't have a reproducer, how can you know that your patch is correct? How can we? It would certainly be good to know what the fuzz testing framework does. > included the KASAN reports below: Anyways, KASAN reports look interesting. But those might be many different issues. Again, would be good to know what the fuzzer has been testing. Can you maybe split this fuzz test into sub-tests? I suspsect those might be different faults. Anyways, taking a first look… > > ================================================================== > BUG: KASAN: slab-use-after-free in rb_next+0xda/0x160 lib/rbtree.c:505 > Read of size 8 at addr ffff8881805085e0 by task kworker/u32:12/192 > CPU: 3 UID: 0 PID: 192 Comm: kworker/u32:12 Not tainted 6.14.0-flowejam-+ #1 > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > print_report+0xfc/0x1ff mm/kasan/report.c:521 > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > rb_next+0xda/0x160 lib/rbtree.c:505 > drm_sched_rq_select_entity_fifo drivers/gpu/drm/scheduler/sched_main.c:332 [inline] [gpu_sched] > drm_sched_select_entity+0x497/0x720 drivers/gpu/drm/scheduler/sched_main.c:1081 [gpu_sched] > drm_sched_run_job_work+0x2e/0x710 drivers/gpu/drm/scheduler/sched_main.c:1206 [gpu_sched] > process_one_work+0x9c0/0x17e0 kernel/workqueue.c:3238 > process_scheduled_works kernel/workqueue.c:3319 [inline] > worker_thread+0x734/0x1060 kernel/workqueue.c:3400 > kthread+0x3fd/0x810 kernel/kthread.c:464 > ret_from_fork+0x53/0x80 arch/x86/kernel/process.c:148 > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 > </TASK> > Allocated by task 73472: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > vfs_open+0x87/0x3f0 fs/open.c:1086 > do_open+0x72f/0xf80 fs/namei.c:3830 > path_openat+0x2ec/0x770 fs/namei.c:3989 > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > do_sys_open fs/open.c:1443 [inline] > __do_sys_openat fs/open.c:1459 [inline] > __se_sys_openat fs/open.c:1454 [inline] > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > Freed by task 73472: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > poison_slab_object mm/kasan/common.c:247 [inline] > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > kasan_slab_free include/linux/kasan.h:233 [inline] > slab_free_hook mm/slub.c:2353 [inline] > slab_free mm/slub.c:4609 [inline] > kfree+0x14f/0x4d0 mm/slub.c:4757 > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > __fput+0x402/0xb50 fs/file_table.c:464 > task_work_run+0x155/0x250 kernel/task_work.c:227 > get_signal+0x1be/0x19d0 kernel/signal.c:2809 > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > The buggy address belongs to the object at ffff888180508000 > The buggy address is located 1504 bytes inside of > The buggy address belongs to the physical page: > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x180508 > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > page_type: f5(slab) > raw: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 > raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 > head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000003 ffffea0006014201 ffffffffffffffff 0000000000000000 > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > Memory state around the buggy address: > ffff888180508480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff888180508500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff888180508580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff888180508600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff888180508680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ================================================================== > ================================================================== > BUG: KASAN: slab-use-after-free in rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] > BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] > BUG: KASAN: slab-use-after-free in rb_erase+0x157c/0x1b10 lib/rbtree.c:443 > Write of size 8 at addr ffff88816414c5d0 by task syz.2.3004/12376 > CPU: 7 UID: 65534 PID: 12376 Comm: syz.2.3004 Not tainted 6.14.0-flowejam-+ #1 > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > print_report+0xfc/0x1ff mm/kasan/report.c:521 > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] > __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] > rb_erase+0x157c/0x1b10 lib/rbtree.c:443 > rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] > drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] > drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] > drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] > drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] > drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] > amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] > amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] > amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > __fput+0x402/0xb50 fs/file_table.c:464 > task_work_run+0x155/0x250 kernel/task_work.c:227 > exit_task_work include/linux/task_work.h:40 [inline] > do_exit+0x841/0xf60 kernel/exit.c:938 > do_group_exit+0xda/0x2b0 kernel/exit.c:1087 > get_signal+0x171f/0x19d0 kernel/signal.c:3036 > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7f2d90da36ed > Code: Unable to access opcode bytes at 0x7f2d90da36c3. > RSP: 002b:00007f2d91b710d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca > RAX: 0000000000000000 RBX: 00007f2d90fe6088 RCX: 00007f2d90da36ed > RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f2d90fe6088 > RBP: 00007f2d90fe6080 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d90fe608c > R13: 0000000000000000 R14: 0000000000000002 R15: 00007ffc34a67bd0 > </TASK> > Allocated by task 12381: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > vfs_open+0x87/0x3f0 fs/open.c:1086 > do_open+0x72f/0xf80 fs/namei.c:3830 > path_openat+0x2ec/0x770 fs/namei.c:3989 > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > do_sys_open fs/open.c:1443 [inline] > __do_sys_openat fs/open.c:1459 [inline] > __se_sys_openat fs/open.c:1454 [inline] > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > Freed by task 12381: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > poison_slab_object mm/kasan/common.c:247 [inline] > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > kasan_slab_free include/linux/kasan.h:233 [inline] > slab_free_hook mm/slub.c:2353 [inline] > slab_free mm/slub.c:4609 [inline] > kfree+0x14f/0x4d0 mm/slub.c:4757 > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > __fput+0x402/0xb50 fs/file_table.c:464 > task_work_run+0x155/0x250 kernel/task_work.c:227 > get_signal+0x1be/0x19d0 kernel/signal.c:2809 > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > The buggy address belongs to the object at ffff88816414c000 > The buggy address is located 1488 bytes inside of > The buggy address belongs to the physical page: > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x164148 > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > page_type: f5(slab) > raw: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 > head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000003 ffffea0005905201 ffffffffffffffff 0000000000000000 > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > Memory state around the buggy address: > ffff88816414c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff88816414c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff88816414c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff88816414c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff88816414c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ================================================================== > ================================================================== > BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] > BUG: KASAN: slab-use-after-free in rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 > Read of size 8 at addr ffff88812ebcc5e0 by task syz.1.814/6553 > CPU: 0 UID: 65534 PID: 6553 Comm: syz.1.814 Not tainted 6.14.0-flowejam-+ #1 > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > print_report+0xfc/0x1ff mm/kasan/report.c:521 > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] > rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 > rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] > drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] > drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] > drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] > drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] > drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] > amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] > amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] > amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > __fput+0x402/0xb50 fs/file_table.c:464 > task_work_run+0x155/0x250 kernel/task_work.c:227 > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] > exit_to_user_mode_loop kernel/entry/common.c:114 [inline] > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7fd23eba36ed > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007ffc2943a358 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 > RAX: 0000000000000000 RBX: 00007ffc2943a428 RCX: 00007fd23eba36ed > RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 > RBP: 00007fd23ede7ba0 R08: 0000000000000001 R09: 0000000c00000000 > R10: 00007fd23ea00000 R11: 0000000000000246 R12: 00007fd23ede5fac > R13: 00007fd23ede5fa0 R14: 0000000000059ad1 R15: 0000000000059a8e > </TASK> > Allocated by task 6559: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > vfs_open+0x87/0x3f0 fs/open.c:1086 > do_open+0x72f/0xf80 fs/namei.c:3830 > path_openat+0x2ec/0x770 fs/namei.c:3989 > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > do_sys_open fs/open.c:1443 [inline] > __do_sys_openat fs/open.c:1459 [inline] > __se_sys_openat fs/open.c:1454 [inline] > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > Freed by task 6559: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > poison_slab_object mm/kasan/common.c:247 [inline] > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > kasan_slab_free include/linux/kasan.h:233 [inline] > slab_free_hook mm/slub.c:2353 [inline] > slab_free mm/slub.c:4609 [inline] > kfree+0x14f/0x4d0 mm/slub.c:4757 > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > __fput+0x402/0xb50 fs/file_table.c:464 > task_work_run+0x155/0x250 kernel/task_work.c:227 > get_signal+0x1be/0x19d0 kernel/signal.c:2809 > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > The buggy address belongs to the object at ffff88812ebcc000 > The buggy address is located 1504 bytes inside of > The buggy address belongs to the physical page: > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12ebc8 > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > page_type: f5(slab) > raw: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 > head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000003 ffffea0004baf201 ffffffffffffffff 0000000000000000 > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > Memory state around the buggy address: > ffff88812ebcc480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff88812ebcc500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff88812ebcc580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff88812ebcc600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff88812ebcc680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ================================================================== > ================================================================== > BUG: KASAN: slab-use-after-free in drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] > BUG: KASAN: slab-use-after-free in rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] > BUG: KASAN: slab-use-after-free in drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] > Read of size 8 at addr ffff8881208445c8 by task syz.1.49115/146644 > CPU: 7 UID: 65534 PID: 146644 Comm: syz.1.49115 Not tainted 6.14.0-flowejam-+ #1 > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > print_report+0xfc/0x1ff mm/kasan/report.c:521 > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] > rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] > drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] > drm_sched_entity_push_job+0x509/0x5d0 drivers/gpu/drm/scheduler/sched_entity.c:623 [gpu_sched] This might be a race between entity killing and the push_job. Let's look at your patch below… > amdgpu_job_submit+0x1a4/0x270 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:314 [amdgpu] > amdgpu_vm_sdma_commit+0x1f9/0x7d0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c:122 [amdgpu] > amdgpu_vm_pt_clear+0x540/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:422 [amdgpu] > amdgpu_vm_init+0x9c2/0x12f0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2609 [amdgpu] > amdgpu_driver_open_kms+0x274/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1418 [amdgpu] > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > vfs_open+0x87/0x3f0 fs/open.c:1086 > do_open+0x72f/0xf80 fs/namei.c:3830 > path_openat+0x2ec/0x770 fs/namei.c:3989 > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > do_sys_open fs/open.c:1443 [inline] > __do_sys_openat fs/open.c:1459 [inline] > __se_sys_openat fs/open.c:1454 [inline] > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7feb303a36ed > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007feb3123c018 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 > RAX: ffffffffffffffda RBX: 00007feb305e5fa0 RCX: 00007feb303a36ed > RDX: 0000000000000002 RSI: 0000200000000140 RDI: ffffffffffffff9c > RBP: 00007feb30447722 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > R13: 0000000000000001 R14: 00007feb305e5fa0 R15: 00007ffcfd0a3460 > </TASK> > Allocated by task 146638: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > vfs_open+0x87/0x3f0 fs/open.c:1086 > do_open+0x72f/0xf80 fs/namei.c:3830 > path_openat+0x2ec/0x770 fs/namei.c:3989 > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > do_sys_open fs/open.c:1443 [inline] > __do_sys_openat fs/open.c:1459 [inline] > __se_sys_openat fs/open.c:1454 [inline] > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > Freed by task 146638: > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > poison_slab_object mm/kasan/common.c:247 [inline] > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > kasan_slab_free include/linux/kasan.h:233 [inline] > slab_free_hook mm/slub.c:2353 [inline] > slab_free mm/slub.c:4609 [inline] > kfree+0x14f/0x4d0 mm/slub.c:4757 > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > __fput+0x402/0xb50 fs/file_table.c:464 > task_work_run+0x155/0x250 kernel/task_work.c:227 > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] > exit_to_user_mode_loop kernel/entry/common.c:114 [inline] > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > The buggy address belongs to the object at ffff888120844000 > The buggy address is located 1480 bytes inside of > The buggy address belongs to the physical page: > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120840 > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > page_type: f5(slab) > raw: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 > raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 > head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > head: 0017ffffc0000003 ffffea0004821001 ffffffffffffffff 0000000000000000 > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > Memory state around the buggy address: > ffff888120844480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff888120844500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff888120844580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff888120844600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff888120844680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ================================================================== > > drivers/gpu/drm/scheduler/sched_main.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index bfea608a7106..997a2cc1a635 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > entity->oldest_job_waiting = ts; > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > - drm_sched_entity_compare_before); > + if (!entity->stopped) { > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > + drm_sched_entity_compare_before); > + } If this is a race, then this patch here is broken, too, because you're checking the 'stopped' boolean as the callers of that function do, too – just later. :O Could still race, just less likely. The proper way to fix it would then be to address the issue where the locking is supposed to happen. Let's look at, for example, drm_sched_entity_push_job(): void drm_sched_entity_push_job(struct drm_sched_job *sched_job) { (Bla bla bla) ………… /* first job wakes up scheduler */ if (first) { struct drm_gpu_scheduler *sched; struct drm_sched_rq *rq; /* Add the entity to the run queue */ spin_lock(&entity->lock); if (entity->stopped) { <---- Aha! spin_unlock(&entity->lock); DRM_ERROR("Trying to push to a killed entity\n"); return; } rq = entity->rq; sched = rq->sched; spin_lock(&rq->lock); drm_sched_rq_add_entity(rq, entity); if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! spin_unlock(&rq->lock); spin_unlock(&entity->lock); But the locks are still being hold. So that "shouldn't be happening"(tm). Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() stop entities. The former holds appropriate locks, but drm_sched_fini() doesn't. So that looks like a hot candidate to me. Opinions? On the other hand, aren't drivers prohibited from calling drm_sched_entity_push_job() after calling drm_sched_fini()? If the fuzzer does that, then it's not the scheduler's fault. Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? Would be cool if Tvrtko and Christian take a look. Maybe we even have a fundamental design issue. Regards P. > } > > /** ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-21 7:52 ` Philipp Stanner @ 2025-07-21 8:16 ` Philipp Stanner 2025-07-21 10:14 ` Danilo Krummrich 2025-07-22 20:05 ` James 2025-08-14 10:42 ` Tvrtko Ursulin 1 sibling, 2 replies; 17+ messages in thread From: Philipp Stanner @ 2025-07-21 8:16 UTC (permalink / raw) To: phasta, James Flowers, matthew.brost, dakr, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan Cc: dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > +Cc Tvrtko, who's currently reworking FIFO and RR. > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > Fixes an issue where entities are added to the run queue in > > drm_sched_rq_update_fifo_locked after being killed, causing a > > slab-use-after-free error. > > > > Signed-off-by: James Flowers <bold.zone2373@fastmail.com> > > --- > > This issue was detected by syzkaller running on a Steam Deck OLED. > > Unfortunately I don't have a reproducer for it. I've > > Well, now that's kind of an issue – if you don't have a reproducer, how > can you know that your patch is correct? How can we? > > It would certainly be good to know what the fuzz testing framework > does. > > > included the KASAN reports below: > > > Anyways, KASAN reports look interesting. But those might be many > different issues. Again, would be good to know what the fuzzer has been > testing. Can you maybe split this fuzz test into sub-tests? I suspsect > those might be different faults. > > > Anyways, taking a first look… > > > > > > ================================================================== > > BUG: KASAN: slab-use-after-free in rb_next+0xda/0x160 lib/rbtree.c:505 > > Read of size 8 at addr ffff8881805085e0 by task kworker/u32:12/192 > > CPU: 3 UID: 0 PID: 192 Comm: kworker/u32:12 Not tainted 6.14.0-flowejam-+ #1 > > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > > Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:94 [inline] > > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > > print_report+0xfc/0x1ff mm/kasan/report.c:521 > > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > > rb_next+0xda/0x160 lib/rbtree.c:505 > > drm_sched_rq_select_entity_fifo drivers/gpu/drm/scheduler/sched_main.c:332 [inline] [gpu_sched] > > drm_sched_select_entity+0x497/0x720 drivers/gpu/drm/scheduler/sched_main.c:1081 [gpu_sched] > > drm_sched_run_job_work+0x2e/0x710 drivers/gpu/drm/scheduler/sched_main.c:1206 [gpu_sched] > > process_one_work+0x9c0/0x17e0 kernel/workqueue.c:3238 > > process_scheduled_works kernel/workqueue.c:3319 [inline] > > worker_thread+0x734/0x1060 kernel/workqueue.c:3400 > > kthread+0x3fd/0x810 kernel/kthread.c:464 > > ret_from_fork+0x53/0x80 arch/x86/kernel/process.c:148 > > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 > > </TASK> > > Allocated by task 73472: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > > vfs_open+0x87/0x3f0 fs/open.c:1086 > > do_open+0x72f/0xf80 fs/namei.c:3830 > > path_openat+0x2ec/0x770 fs/namei.c:3989 > > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > > do_sys_open fs/open.c:1443 [inline] > > __do_sys_openat fs/open.c:1459 [inline] > > __se_sys_openat fs/open.c:1454 [inline] > > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Freed by task 73472: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > > poison_slab_object mm/kasan/common.c:247 [inline] > > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > > kasan_slab_free include/linux/kasan.h:233 [inline] > > slab_free_hook mm/slub.c:2353 [inline] > > slab_free mm/slub.c:4609 [inline] > > kfree+0x14f/0x4d0 mm/slub.c:4757 > > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > > __fput+0x402/0xb50 fs/file_table.c:464 > > task_work_run+0x155/0x250 kernel/task_work.c:227 > > get_signal+0x1be/0x19d0 kernel/signal.c:2809 > > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The buggy address belongs to the object at ffff888180508000 > > The buggy address is located 1504 bytes inside of > > The buggy address belongs to the physical page: > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x180508 > > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > > page_type: f5(slab) > > raw: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 > > raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 > > head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000003 ffffea0006014201 ffffffffffffffff 0000000000000000 > > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > > ffff888180508480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff888180508500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > ffff888180508580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff888180508600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff888180508680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ================================================================== > > ================================================================== > > BUG: KASAN: slab-use-after-free in rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] > > BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] > > BUG: KASAN: slab-use-after-free in rb_erase+0x157c/0x1b10 lib/rbtree.c:443 > > Write of size 8 at addr ffff88816414c5d0 by task syz.2.3004/12376 > > CPU: 7 UID: 65534 PID: 12376 Comm: syz.2.3004 Not tainted 6.14.0-flowejam-+ #1 > > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:94 [inline] > > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > > print_report+0xfc/0x1ff mm/kasan/report.c:521 > > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > > rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] > > __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] > > rb_erase+0x157c/0x1b10 lib/rbtree.c:443 > > rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] > > drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] > > drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] > > drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] > > drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] > > drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] > > amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] > > amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] > > amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] > > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > > __fput+0x402/0xb50 fs/file_table.c:464 > > task_work_run+0x155/0x250 kernel/task_work.c:227 > > exit_task_work include/linux/task_work.h:40 [inline] > > do_exit+0x841/0xf60 kernel/exit.c:938 > > do_group_exit+0xda/0x2b0 kernel/exit.c:1087 > > get_signal+0x171f/0x19d0 kernel/signal.c:3036 > > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > RIP: 0033:0x7f2d90da36ed > > Code: Unable to access opcode bytes at 0x7f2d90da36c3. > > RSP: 002b:00007f2d91b710d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca > > RAX: 0000000000000000 RBX: 00007f2d90fe6088 RCX: 00007f2d90da36ed > > RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f2d90fe6088 > > RBP: 00007f2d90fe6080 R08: 0000000000000000 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d90fe608c > > R13: 0000000000000000 R14: 0000000000000002 R15: 00007ffc34a67bd0 > > </TASK> > > Allocated by task 12381: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > > vfs_open+0x87/0x3f0 fs/open.c:1086 > > do_open+0x72f/0xf80 fs/namei.c:3830 > > path_openat+0x2ec/0x770 fs/namei.c:3989 > > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > > do_sys_open fs/open.c:1443 [inline] > > __do_sys_openat fs/open.c:1459 [inline] > > __se_sys_openat fs/open.c:1454 [inline] > > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Freed by task 12381: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > > poison_slab_object mm/kasan/common.c:247 [inline] > > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > > kasan_slab_free include/linux/kasan.h:233 [inline] > > slab_free_hook mm/slub.c:2353 [inline] > > slab_free mm/slub.c:4609 [inline] > > kfree+0x14f/0x4d0 mm/slub.c:4757 > > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > > __fput+0x402/0xb50 fs/file_table.c:464 > > task_work_run+0x155/0x250 kernel/task_work.c:227 > > get_signal+0x1be/0x19d0 kernel/signal.c:2809 > > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The buggy address belongs to the object at ffff88816414c000 > > The buggy address is located 1488 bytes inside of > > The buggy address belongs to the physical page: > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x164148 > > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > > page_type: f5(slab) > > raw: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 > > raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 > > head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000003 ffffea0005905201 ffffffffffffffff 0000000000000000 > > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > > ffff88816414c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff88816414c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > ffff88816414c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff88816414c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff88816414c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ================================================================== > > ================================================================== > > BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] > > BUG: KASAN: slab-use-after-free in rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 > > Read of size 8 at addr ffff88812ebcc5e0 by task syz.1.814/6553 > > CPU: 0 UID: 65534 PID: 6553 Comm: syz.1.814 Not tainted 6.14.0-flowejam-+ #1 > > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:94 [inline] > > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > > print_report+0xfc/0x1ff mm/kasan/report.c:521 > > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > > __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] > > rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 > > rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] > > drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] > > drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] > > drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] > > drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] > > drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] > > amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] > > amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] > > amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] > > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > > __fput+0x402/0xb50 fs/file_table.c:464 > > task_work_run+0x155/0x250 kernel/task_work.c:227 > > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] > > exit_to_user_mode_loop kernel/entry/common.c:114 [inline] > > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > > syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 > > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > RIP: 0033:0x7fd23eba36ed > > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > > RSP: 002b:00007ffc2943a358 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 > > RAX: 0000000000000000 RBX: 00007ffc2943a428 RCX: 00007fd23eba36ed > > RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 > > RBP: 00007fd23ede7ba0 R08: 0000000000000001 R09: 0000000c00000000 > > R10: 00007fd23ea00000 R11: 0000000000000246 R12: 00007fd23ede5fac > > R13: 00007fd23ede5fa0 R14: 0000000000059ad1 R15: 0000000000059a8e > > </TASK> > > Allocated by task 6559: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > > vfs_open+0x87/0x3f0 fs/open.c:1086 > > do_open+0x72f/0xf80 fs/namei.c:3830 > > path_openat+0x2ec/0x770 fs/namei.c:3989 > > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > > do_sys_open fs/open.c:1443 [inline] > > __do_sys_openat fs/open.c:1459 [inline] > > __se_sys_openat fs/open.c:1454 [inline] > > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Freed by task 6559: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > > poison_slab_object mm/kasan/common.c:247 [inline] > > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > > kasan_slab_free include/linux/kasan.h:233 [inline] > > slab_free_hook mm/slub.c:2353 [inline] > > slab_free mm/slub.c:4609 [inline] > > kfree+0x14f/0x4d0 mm/slub.c:4757 > > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > > __fput+0x402/0xb50 fs/file_table.c:464 > > task_work_run+0x155/0x250 kernel/task_work.c:227 > > get_signal+0x1be/0x19d0 kernel/signal.c:2809 > > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 > > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] > > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 > > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The buggy address belongs to the object at ffff88812ebcc000 > > The buggy address is located 1504 bytes inside of > > The buggy address belongs to the physical page: > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12ebc8 > > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > > page_type: f5(slab) > > raw: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 > > raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 > > head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000003 ffffea0004baf201 ffffffffffffffff 0000000000000000 > > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > > ffff88812ebcc480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff88812ebcc500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > ffff88812ebcc580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff88812ebcc600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff88812ebcc680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ================================================================== > > ================================================================== > > BUG: KASAN: slab-use-after-free in drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] > > BUG: KASAN: slab-use-after-free in rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] > > BUG: KASAN: slab-use-after-free in drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] > > Read of size 8 at addr ffff8881208445c8 by task syz.1.49115/146644 > > CPU: 7 UID: 65534 PID: 146644 Comm: syz.1.49115 Not tainted 6.14.0-flowejam-+ #1 > > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:94 [inline] > > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 > > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 > > print_report+0xfc/0x1ff mm/kasan/report.c:521 > > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 > > drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] > > rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] > > drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] > > drm_sched_entity_push_job+0x509/0x5d0 drivers/gpu/drm/scheduler/sched_entity.c:623 [gpu_sched] > > This might be a race between entity killing and the push_job. Let's > look at your patch below… > > > amdgpu_job_submit+0x1a4/0x270 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:314 [amdgpu] > > amdgpu_vm_sdma_commit+0x1f9/0x7d0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c:122 [amdgpu] > > amdgpu_vm_pt_clear+0x540/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:422 [amdgpu] > > amdgpu_vm_init+0x9c2/0x12f0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2609 [amdgpu] > > amdgpu_driver_open_kms+0x274/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1418 [amdgpu] > > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > > vfs_open+0x87/0x3f0 fs/open.c:1086 > > do_open+0x72f/0xf80 fs/namei.c:3830 > > path_openat+0x2ec/0x770 fs/namei.c:3989 > > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > > do_sys_open fs/open.c:1443 [inline] > > __do_sys_openat fs/open.c:1459 [inline] > > __se_sys_openat fs/open.c:1454 [inline] > > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > RIP: 0033:0x7feb303a36ed > > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > > RSP: 002b:00007feb3123c018 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 > > RAX: ffffffffffffffda RBX: 00007feb305e5fa0 RCX: 00007feb303a36ed > > RDX: 0000000000000002 RSI: 0000200000000140 RDI: ffffffffffffff9c > > RBP: 00007feb30447722 R08: 0000000000000000 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > > R13: 0000000000000001 R14: 00007feb305e5fa0 R15: 00007ffcfd0a3460 > > </TASK> > > Allocated by task 146638: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 > > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] > > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] > > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] > > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 > > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 > > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 > > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 > > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 > > do_dentry_open+0x743/0x1bf0 fs/open.c:956 > > vfs_open+0x87/0x3f0 fs/open.c:1086 > > do_open+0x72f/0xf80 fs/namei.c:3830 > > path_openat+0x2ec/0x770 fs/namei.c:3989 > > do_filp_open+0x1ff/0x420 fs/namei.c:4016 > > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 > > do_sys_open fs/open.c:1443 [inline] > > __do_sys_openat fs/open.c:1459 [inline] > > __se_sys_openat fs/open.c:1454 [inline] > > __x64_sys_openat+0x149/0x210 fs/open.c:1454 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Freed by task 146638: > > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 > > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 > > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 > > poison_slab_object mm/kasan/common.c:247 [inline] > > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 > > kasan_slab_free include/linux/kasan.h:233 [inline] > > slab_free_hook mm/slub.c:2353 [inline] > > slab_free mm/slub.c:4609 [inline] > > kfree+0x14f/0x4d0 mm/slub.c:4757 > > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] > > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 > > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] > > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 > > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 > > __fput+0x402/0xb50 fs/file_table.c:464 > > task_work_run+0x155/0x250 kernel/task_work.c:227 > > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] > > exit_to_user_mode_loop kernel/entry/common.c:114 [inline] > > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] > > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] > > syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 > > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The buggy address belongs to the object at ffff888120844000 > > The buggy address is located 1480 bytes inside of > > The buggy address belongs to the physical page: > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120840 > > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) > > page_type: f5(slab) > > raw: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 > > raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 > > head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 > > head: 0017ffffc0000003 ffffea0004821001 ffffffffffffffff 0000000000000000 > > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > > ffff888120844480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff888120844500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > ffff888120844580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff888120844600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ffff888120844680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ================================================================== > > > > drivers/gpu/drm/scheduler/sched_main.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index bfea608a7106..997a2cc1a635 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > entity->oldest_job_waiting = ts; > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > - drm_sched_entity_compare_before); > > + if (!entity->stopped) { > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > + drm_sched_entity_compare_before); > > + } > > If this is a race, then this patch here is broken, too, because you're > checking the 'stopped' boolean as the callers of that function do, too > – just later. :O > > Could still race, just less likely. > > The proper way to fix it would then be to address the issue where the > locking is supposed to happen. Let's look at, for example, > drm_sched_entity_push_job(): > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > { > (Bla bla bla) > > ………… > > /* first job wakes up scheduler */ > if (first) { > struct drm_gpu_scheduler *sched; > struct drm_sched_rq *rq; > > /* Add the entity to the run queue */ > spin_lock(&entity->lock); > if (entity->stopped) { <---- Aha! > spin_unlock(&entity->lock); > > DRM_ERROR("Trying to push to a killed entity\n"); > return; > } > > rq = entity->rq; > sched = rq->sched; > > spin_lock(&rq->lock); > drm_sched_rq_add_entity(rq, entity); > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > spin_unlock(&rq->lock); > spin_unlock(&entity->lock); > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > stop entities. The former holds appropriate locks, but drm_sched_fini() > doesn't. So that looks like a hot candidate to me. Opinions? > > On the other hand, aren't drivers prohibited from calling > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > fuzzer does that, then it's not the scheduler's fault. > > Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? Ah no, forget about that. In drm_sched_fini(), you'd have to take the locks in reverse order as in drm_sched_entity_push/pop_job(), thereby replacing race with deadlock. I suspect that this is an issue in amdgpu. But let's wait for Christian. P. > > Would be cool if Tvrtko and Christian take a look. Maybe we even have a > fundamental design issue. > > > Regards > P. > > > > } > > > > /** > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-21 8:16 ` Philipp Stanner @ 2025-07-21 10:14 ` Danilo Krummrich 2025-07-21 18:07 ` Matthew Brost 2025-07-22 20:05 ` James 1 sibling, 1 reply; 17+ messages in thread From: Danilo Krummrich @ 2025-07-21 10:14 UTC (permalink / raw) To: Philipp Stanner Cc: phasta, James Flowers, matthew.brost, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: >> On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: >> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> > index bfea608a7106..997a2cc1a635 100644 >> > --- a/drivers/gpu/drm/scheduler/sched_main.c >> > +++ b/drivers/gpu/drm/scheduler/sched_main.c >> > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, >> > >> > entity->oldest_job_waiting = ts; >> > >> > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >> > - drm_sched_entity_compare_before); >> > + if (!entity->stopped) { >> > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >> > + drm_sched_entity_compare_before); >> > + } >> >> If this is a race, then this patch here is broken, too, because you're >> checking the 'stopped' boolean as the callers of that function do, too >> – just later. :O >> >> Could still race, just less likely. >> >> The proper way to fix it would then be to address the issue where the >> locking is supposed to happen. Let's look at, for example, >> drm_sched_entity_push_job(): >> >> >> void drm_sched_entity_push_job(struct drm_sched_job *sched_job) >> { >> (Bla bla bla) >> >> ………… >> >> /* first job wakes up scheduler */ >> if (first) { >> struct drm_gpu_scheduler *sched; >> struct drm_sched_rq *rq; >> >> /* Add the entity to the run queue */ >> spin_lock(&entity->lock); >> if (entity->stopped) { <---- Aha! >> spin_unlock(&entity->lock); >> >> DRM_ERROR("Trying to push to a killed entity\n"); >> return; >> } >> >> rq = entity->rq; >> sched = rq->sched; >> >> spin_lock(&rq->lock); >> drm_sched_rq_add_entity(rq, entity); >> >> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) >> drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! >> >> spin_unlock(&rq->lock); >> spin_unlock(&entity->lock); >> >> But the locks are still being hold. So that "shouldn't be happening"(tm). >> >> Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() >> stop entities. The former holds appropriate locks, but drm_sched_fini() >> doesn't. So that looks like a hot candidate to me. Opinions? >> >> On the other hand, aren't drivers prohibited from calling >> drm_sched_entity_push_job() after calling drm_sched_fini()? If the >> fuzzer does that, then it's not the scheduler's fault. Exactly, this is the first question to ask. And I think it's even more restrictive: In drm_sched_fini() for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { struct drm_sched_rq *rq = sched->sched_rq[i]; spin_lock(&rq->lock); list_for_each_entry(s_entity, &rq->entities, list) /* * Prevents reinsertion and marks job_queue as idle, * it will be removed from the rq in drm_sched_entity_fini() * eventually */ s_entity->stopped = true; spin_unlock(&rq->lock); kfree(sched->sched_rq[i]); } In drm_sched_entity_kill() static void drm_sched_entity_kill(struct drm_sched_entity *entity) { struct drm_sched_job *job; struct dma_fence *prev; if (!entity->rq) return; spin_lock(&entity->lock); entity->stopped = true; drm_sched_rq_remove_entity(entity->rq, entity); spin_unlock(&entity->lock); [...] } If this runs concurrently, this is a UAF as well. Personally, I have always been working with the assupmtion that entites have to be torn down *before* the scheduler, but those lifetimes are not documented properly. There are two solutions: (1) Strictly require all entities to be torn down before drm_sched_fini(), i.e. stick to the natural ownership and lifetime rules here (see below). (2) Actually protect *any* changes of the relevent fields of the entity structure with the entity lock. While (2) seems rather obvious, we run into lock inversion with this approach, as you note below as well. And I think drm_sched_fini() should not mess with entities anyways. The ownership here seems obvious: The scheduler *owns* a resource that is used by entities. Consequently, entities are not allowed to out-live the scheduler. Surely, the current implementation to just take the resource away from the entity under the hood can work as well with appropriate locking, but that's a mess. If the resource *really* needs to be shared for some reason (which I don't see), shared ownership, i.e. reference counting, is much less error prone. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-21 10:14 ` Danilo Krummrich @ 2025-07-21 18:07 ` Matthew Brost 2025-07-22 7:37 ` Philipp Stanner 0 siblings, 1 reply; 17+ messages in thread From: Matthew Brost @ 2025-07-21 18:07 UTC (permalink / raw) To: Danilo Krummrich Cc: Philipp Stanner, phasta, James Flowers, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > >> On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > >> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > >> > index bfea608a7106..997a2cc1a635 100644 > >> > --- a/drivers/gpu/drm/scheduler/sched_main.c > >> > +++ b/drivers/gpu/drm/scheduler/sched_main.c > >> > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > >> > > >> > entity->oldest_job_waiting = ts; > >> > > >> > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > >> > - drm_sched_entity_compare_before); > >> > + if (!entity->stopped) { > >> > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > >> > + drm_sched_entity_compare_before); > >> > + } > >> > >> If this is a race, then this patch here is broken, too, because you're > >> checking the 'stopped' boolean as the callers of that function do, too > >> – just later. :O > >> > >> Could still race, just less likely. > >> > >> The proper way to fix it would then be to address the issue where the > >> locking is supposed to happen. Let's look at, for example, > >> drm_sched_entity_push_job(): > >> > >> > >> void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > >> { > >> (Bla bla bla) > >> > >> ………… > >> > >> /* first job wakes up scheduler */ > >> if (first) { > >> struct drm_gpu_scheduler *sched; > >> struct drm_sched_rq *rq; > >> > >> /* Add the entity to the run queue */ > >> spin_lock(&entity->lock); > >> if (entity->stopped) { <---- Aha! > >> spin_unlock(&entity->lock); > >> > >> DRM_ERROR("Trying to push to a killed entity\n"); > >> return; > >> } > >> > >> rq = entity->rq; > >> sched = rq->sched; > >> > >> spin_lock(&rq->lock); > >> drm_sched_rq_add_entity(rq, entity); > >> > >> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > >> drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > >> > >> spin_unlock(&rq->lock); > >> spin_unlock(&entity->lock); > >> > >> But the locks are still being hold. So that "shouldn't be happening"(tm). > >> > >> Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > >> stop entities. The former holds appropriate locks, but drm_sched_fini() > >> doesn't. So that looks like a hot candidate to me. Opinions? > >> > >> On the other hand, aren't drivers prohibited from calling > >> drm_sched_entity_push_job() after calling drm_sched_fini()? If the > >> fuzzer does that, then it's not the scheduler's fault. > > Exactly, this is the first question to ask. > > And I think it's even more restrictive: > > In drm_sched_fini() > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > struct drm_sched_rq *rq = sched->sched_rq[i]; > > spin_lock(&rq->lock); > list_for_each_entry(s_entity, &rq->entities, list) > /* > * Prevents reinsertion and marks job_queue as idle, > * it will be removed from the rq in drm_sched_entity_fini() > * eventually > */ > s_entity->stopped = true; > spin_unlock(&rq->lock); > kfree(sched->sched_rq[i]); > } > > In drm_sched_entity_kill() > > static void drm_sched_entity_kill(struct drm_sched_entity *entity) > { > struct drm_sched_job *job; > struct dma_fence *prev; > > if (!entity->rq) > return; > > spin_lock(&entity->lock); > entity->stopped = true; > drm_sched_rq_remove_entity(entity->rq, entity); > spin_unlock(&entity->lock); > > [...] > } > > If this runs concurrently, this is a UAF as well. > > Personally, I have always been working with the assupmtion that entites have to > be torn down *before* the scheduler, but those lifetimes are not documented > properly. Yes, this is my assumption too. I would even take it further: an entity shouldn't be torn down until all jobs associated with it are freed as well. I think this would solve a lot of issues I've seen on the list related to UAF, teardown, etc. > > There are two solutions: > > (1) Strictly require all entities to be torn down before drm_sched_fini(), > i.e. stick to the natural ownership and lifetime rules here (see below). > > (2) Actually protect *any* changes of the relevent fields of the entity > structure with the entity lock. > > While (2) seems rather obvious, we run into lock inversion with this approach, > as you note below as well. And I think drm_sched_fini() should not mess with > entities anyways. > > The ownership here seems obvious: > > The scheduler *owns* a resource that is used by entities. Consequently, entities > are not allowed to out-live the scheduler. > > Surely, the current implementation to just take the resource away from the > entity under the hood can work as well with appropriate locking, but that's a > mess. > > If the resource *really* needs to be shared for some reason (which I don't see), > shared ownership, i.e. reference counting, is much less error prone. Yes, Xe solves all of this via reference counting (jobs refcount the entity). It's a bit easier in Xe since the scheduler and entities are the same object due to their 1:1 relationship. But even in non-1:1 relationships, an entity could refcount the scheduler. The teardown sequence would then be: all jobs complete on the entity → teardown the entity → all entities torn down → teardown the scheduler. Matt ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-21 18:07 ` Matthew Brost @ 2025-07-22 7:37 ` Philipp Stanner 2025-07-22 8:07 ` Matthew Brost 0 siblings, 1 reply; 17+ messages in thread From: Philipp Stanner @ 2025-07-22 7:37 UTC (permalink / raw) To: Matthew Brost, Danilo Krummrich Cc: phasta, James Flowers, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > - drm_sched_entity_compare_before); > > > > > + if (!entity->stopped) { > > > > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > + drm_sched_entity_compare_before); > > > > > + } > > > > > > > > If this is a race, then this patch here is broken, too, because you're > > > > checking the 'stopped' boolean as the callers of that function do, too > > > > – just later. :O > > > > > > > > Could still race, just less likely. > > > > > > > > The proper way to fix it would then be to address the issue where the > > > > locking is supposed to happen. Let's look at, for example, > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > > { > > > > (Bla bla bla) > > > > > > > > ………… > > > > > > > > /* first job wakes up scheduler */ > > > > if (first) { > > > > struct drm_gpu_scheduler *sched; > > > > struct drm_sched_rq *rq; > > > > > > > > /* Add the entity to the run queue */ > > > > spin_lock(&entity->lock); > > > > if (entity->stopped) { <---- Aha! > > > > spin_unlock(&entity->lock); > > > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > > return; > > > > } > > > > > > > > rq = entity->rq; > > > > sched = rq->sched; > > > > > > > > spin_lock(&rq->lock); > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > > > > > > > spin_unlock(&rq->lock); > > > > spin_unlock(&entity->lock); > > > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > > fuzzer does that, then it's not the scheduler's fault. > > > > Exactly, this is the first question to ask. > > > > And I think it's even more restrictive: > > > > In drm_sched_fini() > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > spin_lock(&rq->lock); > > list_for_each_entry(s_entity, &rq->entities, list) > > /* > > * Prevents reinsertion and marks job_queue as idle, > > * it will be removed from the rq in drm_sched_entity_fini() > > * eventually > > */ > > s_entity->stopped = true; > > spin_unlock(&rq->lock); > > kfree(sched->sched_rq[i]); > > } > > > > In drm_sched_entity_kill() > > > > static void drm_sched_entity_kill(struct drm_sched_entity *entity) > > { > > struct drm_sched_job *job; > > struct dma_fence *prev; > > > > if (!entity->rq) > > return; > > > > spin_lock(&entity->lock); > > entity->stopped = true; > > drm_sched_rq_remove_entity(entity->rq, entity); > > spin_unlock(&entity->lock); > > > > [...] > > } > > > > If this runs concurrently, this is a UAF as well. > > > > Personally, I have always been working with the assupmtion that entites have to > > be torn down *before* the scheduler, but those lifetimes are not documented > > properly. > > Yes, this is my assumption too. I would even take it further: an entity > shouldn't be torn down until all jobs associated with it are freed as > well. I think this would solve a lot of issues I've seen on the list > related to UAF, teardown, etc. That's kind of impossible with the new tear down design, because drm_sched_fini() ensures that all jobs are freed on teardown. And drm_sched_fini() wouldn't be called before all jobs are gone, effectively resulting in a chicken-egg-problem, or rather: the driver implementing its own solution for teardown. P. > > > > > There are two solutions: > > > > (1) Strictly require all entities to be torn down before drm_sched_fini(), > > i.e. stick to the natural ownership and lifetime rules here (see below). > > > > (2) Actually protect *any* changes of the relevent fields of the entity > > structure with the entity lock. > > > > While (2) seems rather obvious, we run into lock inversion with this approach, > > as you note below as well. And I think drm_sched_fini() should not mess with > > entities anyways. > > > > The ownership here seems obvious: > > > > The scheduler *owns* a resource that is used by entities. Consequently, entities > > are not allowed to out-live the scheduler. > > > > Surely, the current implementation to just take the resource away from the > > entity under the hood can work as well with appropriate locking, but that's a > > mess. > > > > If the resource *really* needs to be shared for some reason (which I don't see), > > shared ownership, i.e. reference counting, is much less error prone. > > Yes, Xe solves all of this via reference counting (jobs refcount the > entity). It's a bit easier in Xe since the scheduler and entities are > the same object due to their 1:1 relationship. But even in non-1:1 > relationships, an entity could refcount the scheduler. The teardown > sequence would then be: all jobs complete on the entity → teardown the > entity → all entities torn down → teardown the scheduler. > > Matt ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-22 7:37 ` Philipp Stanner @ 2025-07-22 8:07 ` Matthew Brost 2025-07-22 8:45 ` Matthew Brost 0 siblings, 1 reply; 17+ messages in thread From: Matthew Brost @ 2025-07-22 8:07 UTC (permalink / raw) To: phasta Cc: Danilo Krummrich, James Flowers, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > - drm_sched_entity_compare_before); > > > > > > + if (!entity->stopped) { > > > > > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > + drm_sched_entity_compare_before); > > > > > > + } > > > > > > > > > > If this is a race, then this patch here is broken, too, because you're > > > > > checking the 'stopped' boolean as the callers of that function do, too > > > > > – just later. :O > > > > > > > > > > Could still race, just less likely. > > > > > > > > > > The proper way to fix it would then be to address the issue where the > > > > > locking is supposed to happen. Let's look at, for example, > > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > > > { > > > > > (Bla bla bla) > > > > > > > > > > ………… > > > > > > > > > > /* first job wakes up scheduler */ > > > > > if (first) { > > > > > struct drm_gpu_scheduler *sched; > > > > > struct drm_sched_rq *rq; > > > > > > > > > > /* Add the entity to the run queue */ > > > > > spin_lock(&entity->lock); > > > > > if (entity->stopped) { <---- Aha! > > > > > spin_unlock(&entity->lock); > > > > > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > > > return; > > > > > } > > > > > > > > > > rq = entity->rq; > > > > > sched = rq->sched; > > > > > > > > > > spin_lock(&rq->lock); > > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > > > > > > > > > spin_unlock(&rq->lock); > > > > > spin_unlock(&entity->lock); > > > > > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > Exactly, this is the first question to ask. > > > > > > And I think it's even more restrictive: > > > > > > In drm_sched_fini() > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > spin_lock(&rq->lock); > > > list_for_each_entry(s_entity, &rq->entities, list) > > > /* > > > * Prevents reinsertion and marks job_queue as idle, > > > * it will be removed from the rq in drm_sched_entity_fini() > > > * eventually > > > */ > > > s_entity->stopped = true; > > > spin_unlock(&rq->lock); > > > kfree(sched->sched_rq[i]); > > > } > > > > > > In drm_sched_entity_kill() > > > > > > static void drm_sched_entity_kill(struct drm_sched_entity *entity) > > > { > > > struct drm_sched_job *job; > > > struct dma_fence *prev; > > > > > > if (!entity->rq) > > > return; > > > > > > spin_lock(&entity->lock); > > > entity->stopped = true; > > > drm_sched_rq_remove_entity(entity->rq, entity); > > > spin_unlock(&entity->lock); > > > > > > [...] > > > } > > > > > > If this runs concurrently, this is a UAF as well. > > > > > > Personally, I have always been working with the assupmtion that entites have to > > > be torn down *before* the scheduler, but those lifetimes are not documented > > > properly. > > > > Yes, this is my assumption too. I would even take it further: an entity > > shouldn't be torn down until all jobs associated with it are freed as > > well. I think this would solve a lot of issues I've seen on the list > > related to UAF, teardown, etc. > > That's kind of impossible with the new tear down design, because > drm_sched_fini() ensures that all jobs are freed on teardown. And > drm_sched_fini() wouldn't be called before all jobs are gone, > effectively resulting in a chicken-egg-problem, or rather: the driver > implementing its own solution for teardown. > I've read this four times and I'm still generally confused. "drm_sched_fini ensures that all jobs are freed on teardown" — Yes, that's how a refcounting-based solution works. drm_sched_fini would never be called if there were pending jobs. "drm_sched_fini() wouldn't be called before all jobs are gone" — See above. "effectively resulting in a chicken-and-egg problem" — A job is created after the scheduler, and it holds a reference to the scheduler until it's freed. I don't see how this idiom applies. "the driver implementing its own solution for teardown" — It’s just following the basic lifetime rules I outlined below. Perhaps Xe was ahead of its time, but the number of DRM scheduler blowups we've had is zero — maybe a strong indication that this design is correct. Matt > P. > > > > > > > > > > There are two solutions: > > > > > > (1) Strictly require all entities to be torn down before drm_sched_fini(), > > > i.e. stick to the natural ownership and lifetime rules here (see below). > > > > > > (2) Actually protect *any* changes of the relevent fields of the entity > > > structure with the entity lock. > > > > > > While (2) seems rather obvious, we run into lock inversion with this approach, > > > as you note below as well. And I think drm_sched_fini() should not mess with > > > entities anyways. > > > > > > The ownership here seems obvious: > > > > > > The scheduler *owns* a resource that is used by entities. Consequently, entities > > > are not allowed to out-live the scheduler. > > > > > > Surely, the current implementation to just take the resource away from the > > > entity under the hood can work as well with appropriate locking, but that's a > > > mess. > > > > > > If the resource *really* needs to be shared for some reason (which I don't see), > > > shared ownership, i.e. reference counting, is much less error prone. > > > > Yes, Xe solves all of this via reference counting (jobs refcount the > > entity). It's a bit easier in Xe since the scheduler and entities are > > the same object due to their 1:1 relationship. But even in non-1:1 > > relationships, an entity could refcount the scheduler. The teardown > > sequence would then be: all jobs complete on the entity → teardown the > > entity → all entities torn down → teardown the scheduler. > > > > Matt > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-22 8:07 ` Matthew Brost @ 2025-07-22 8:45 ` Matthew Brost 2025-07-23 6:56 ` Philipp Stanner 0 siblings, 1 reply; 17+ messages in thread From: Matthew Brost @ 2025-07-22 8:45 UTC (permalink / raw) To: phasta Cc: Danilo Krummrich, James Flowers, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote: > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > - drm_sched_entity_compare_before); > > > > > > > + if (!entity->stopped) { > > > > > > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > + drm_sched_entity_compare_before); > > > > > > > + } > > > > > > > > > > > > If this is a race, then this patch here is broken, too, because you're > > > > > > checking the 'stopped' boolean as the callers of that function do, too > > > > > > – just later. :O > > > > > > > > > > > > Could still race, just less likely. > > > > > > > > > > > > The proper way to fix it would then be to address the issue where the > > > > > > locking is supposed to happen. Let's look at, for example, > > > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > > > > { > > > > > > (Bla bla bla) > > > > > > > > > > > > ………… > > > > > > > > > > > > /* first job wakes up scheduler */ > > > > > > if (first) { > > > > > > struct drm_gpu_scheduler *sched; > > > > > > struct drm_sched_rq *rq; > > > > > > > > > > > > /* Add the entity to the run queue */ > > > > > > spin_lock(&entity->lock); > > > > > > if (entity->stopped) { <---- Aha! > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > > > > return; > > > > > > } > > > > > > > > > > > > rq = entity->rq; > > > > > > sched = rq->sched; > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > > > Exactly, this is the first question to ask. > > > > > > > > And I think it's even more restrictive: > > > > > > > > In drm_sched_fini() > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > > > spin_lock(&rq->lock); > > > > list_for_each_entry(s_entity, &rq->entities, list) > > > > /* > > > > * Prevents reinsertion and marks job_queue as idle, > > > > * it will be removed from the rq in drm_sched_entity_fini() > > > > * eventually > > > > */ > > > > s_entity->stopped = true; > > > > spin_unlock(&rq->lock); > > > > kfree(sched->sched_rq[i]); > > > > } > > > > > > > > In drm_sched_entity_kill() > > > > > > > > static void drm_sched_entity_kill(struct drm_sched_entity *entity) > > > > { > > > > struct drm_sched_job *job; > > > > struct dma_fence *prev; > > > > > > > > if (!entity->rq) > > > > return; > > > > > > > > spin_lock(&entity->lock); > > > > entity->stopped = true; > > > > drm_sched_rq_remove_entity(entity->rq, entity); > > > > spin_unlock(&entity->lock); > > > > > > > > [...] > > > > } > > > > > > > > If this runs concurrently, this is a UAF as well. > > > > > > > > Personally, I have always been working with the assupmtion that entites have to > > > > be torn down *before* the scheduler, but those lifetimes are not documented > > > > properly. > > > > > > Yes, this is my assumption too. I would even take it further: an entity > > > shouldn't be torn down until all jobs associated with it are freed as > > > well. I think this would solve a lot of issues I've seen on the list > > > related to UAF, teardown, etc. > > > > That's kind of impossible with the new tear down design, because > > drm_sched_fini() ensures that all jobs are freed on teardown. And > > drm_sched_fini() wouldn't be called before all jobs are gone, > > effectively resulting in a chicken-egg-problem, or rather: the driver > > implementing its own solution for teardown. > > > > I've read this four times and I'm still generally confused. > > "drm_sched_fini ensures that all jobs are freed on teardown" — Yes, > that's how a refcounting-based solution works. drm_sched_fini would > never be called if there were pending jobs. > > "drm_sched_fini() wouldn't be called before all jobs are gone" — See > above. > > "effectively resulting in a chicken-and-egg problem" — A job is created > after the scheduler, and it holds a reference to the scheduler until > it's freed. I don't see how this idiom applies. > > "the driver implementing its own solution for teardown" — It’s just > following the basic lifetime rules I outlined below. Perhaps Xe was > ahead of its time, but the number of DRM scheduler blowups we've had is > zero — maybe a strong indication that this design is correct. > Sorry—self-reply. To expand on this: the reason Xe implemented a refcount-based teardown solution is because the internals of the DRM scheduler during teardown looked wildly scary. A lower layer should not impose its will on upper layers. I think that’s the root cause of all the problems I've listed. In my opinion, we should document the lifetime rules I’ve outlined, fix all drivers accordingly, and assert these rules in the scheduler layer. Matt > Matt > > > P. > > > > > > > > > > > > > > > There are two solutions: > > > > > > > > (1) Strictly require all entities to be torn down before drm_sched_fini(), > > > > i.e. stick to the natural ownership and lifetime rules here (see below). > > > > > > > > (2) Actually protect *any* changes of the relevent fields of the entity > > > > structure with the entity lock. > > > > > > > > While (2) seems rather obvious, we run into lock inversion with this approach, > > > > as you note below as well. And I think drm_sched_fini() should not mess with > > > > entities anyways. > > > > > > > > The ownership here seems obvious: > > > > > > > > The scheduler *owns* a resource that is used by entities. Consequently, entities > > > > are not allowed to out-live the scheduler. > > > > > > > > Surely, the current implementation to just take the resource away from the > > > > entity under the hood can work as well with appropriate locking, but that's a > > > > mess. > > > > > > > > If the resource *really* needs to be shared for some reason (which I don't see), > > > > shared ownership, i.e. reference counting, is much less error prone. > > > > > > Yes, Xe solves all of this via reference counting (jobs refcount the > > > entity). It's a bit easier in Xe since the scheduler and entities are > > > the same object due to their 1:1 relationship. But even in non-1:1 > > > relationships, an entity could refcount the scheduler. The teardown > > > sequence would then be: all jobs complete on the entity → teardown the > > > entity → all entities torn down → teardown the scheduler. > > > > > > Matt > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-22 8:45 ` Matthew Brost @ 2025-07-23 6:56 ` Philipp Stanner 2025-07-24 4:13 ` Matthew Brost 0 siblings, 1 reply; 17+ messages in thread From: Philipp Stanner @ 2025-07-23 6:56 UTC (permalink / raw) To: Matthew Brost, phasta Cc: Danilo Krummrich, James Flowers, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Tue, 2025-07-22 at 01:45 -0700, Matthew Brost wrote: > On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote: > > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > - drm_sched_entity_compare_before); > > > > > > > > + if (!entity->stopped) { > > > > > > > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > + drm_sched_entity_compare_before); > > > > > > > > + } > > > > > > > > > > > > > > If this is a race, then this patch here is broken, too, because you're > > > > > > > checking the 'stopped' boolean as the callers of that function do, too > > > > > > > – just later. :O > > > > > > > > > > > > > > Could still race, just less likely. > > > > > > > > > > > > > > The proper way to fix it would then be to address the issue where the > > > > > > > locking is supposed to happen. Let's look at, for example, > > > > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > > > > > { > > > > > > > (Bla bla bla) > > > > > > > > > > > > > > ………… > > > > > > > > > > > > > > /* first job wakes up scheduler */ > > > > > > > if (first) { > > > > > > > struct drm_gpu_scheduler *sched; > > > > > > > struct drm_sched_rq *rq; > > > > > > > > > > > > > > /* Add the entity to the run queue */ > > > > > > > spin_lock(&entity->lock); > > > > > > > if (entity->stopped) { <---- Aha! > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > > > > > return; > > > > > > > } > > > > > > > > > > > > > > rq = entity->rq; > > > > > > > sched = rq->sched; > > > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > > > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > > > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > > > > > Exactly, this is the first question to ask. > > > > > > > > > > And I think it's even more restrictive: > > > > > > > > > > In drm_sched_fini() > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > > > > > spin_lock(&rq->lock); > > > > > list_for_each_entry(s_entity, &rq->entities, list) > > > > > /* > > > > > * Prevents reinsertion and marks job_queue as idle, > > > > > * it will be removed from the rq in drm_sched_entity_fini() > > > > > * eventually > > > > > */ > > > > > s_entity->stopped = true; > > > > > spin_unlock(&rq->lock); > > > > > kfree(sched->sched_rq[i]); > > > > > } > > > > > > > > > > In drm_sched_entity_kill() > > > > > > > > > > static void drm_sched_entity_kill(struct drm_sched_entity *entity) > > > > > { > > > > > struct drm_sched_job *job; > > > > > struct dma_fence *prev; > > > > > > > > > > if (!entity->rq) > > > > > return; > > > > > > > > > > spin_lock(&entity->lock); > > > > > entity->stopped = true; > > > > > drm_sched_rq_remove_entity(entity->rq, entity); > > > > > spin_unlock(&entity->lock); > > > > > > > > > > [...] > > > > > } > > > > > > > > > > If this runs concurrently, this is a UAF as well. > > > > > > > > > > Personally, I have always been working with the assupmtion that entites have to > > > > > be torn down *before* the scheduler, but those lifetimes are not documented > > > > > properly. > > > > > > > > Yes, this is my assumption too. I would even take it further: an entity > > > > shouldn't be torn down until all jobs associated with it are freed as > > > > well. I think this would solve a lot of issues I've seen on the list > > > > related to UAF, teardown, etc. > > > > > > That's kind of impossible with the new tear down design, because > > > drm_sched_fini() ensures that all jobs are freed on teardown. And > > > drm_sched_fini() wouldn't be called before all jobs are gone, > > > effectively resulting in a chicken-egg-problem, or rather: the driver > > > implementing its own solution for teardown. > > > > > > > I've read this four times and I'm still generally confused. > > > > "drm_sched_fini ensures that all jobs are freed on teardown" — Yes, > > that's how a refcounting-based solution works. drm_sched_fini would > > never be called if there were pending jobs. > > > > "drm_sched_fini() wouldn't be called before all jobs are gone" — See > > above. > > > > "effectively resulting in a chicken-and-egg problem" — A job is created > > after the scheduler, and it holds a reference to the scheduler until > > it's freed. I don't see how this idiom applies. > > > > "the driver implementing its own solution for teardown" — It’s just > > following the basic lifetime rules I outlined below. Perhaps Xe was > > ahead of its time, but the number of DRM scheduler blowups we've had is > > zero — maybe a strong indication that this design is correct. > > > > Sorry—self-reply. > > To expand on this: the reason Xe implemented a refcount-based teardown > solution is because the internals of the DRM scheduler during teardown > looked wildly scary. A lower layer should not impose its will on upper > layers. I think that’s the root cause of all the problems I've listed. > > In my opinion, we should document the lifetime rules I’ve outlined, fix > all drivers accordingly, and assert these rules in the scheduler layer. Everyone had a separate solution for that. Nouveau used a waitqueue. That's what happens when there's no centralized mechanism for solving a problem. Did you see the series we recently merged which repairs the memory leaks of drm/sched? It had been around for quite some time. https://lore.kernel.org/dri-devel/20250701132142.76899-3-phasta@kernel.org/ P. > > Matt > > > Matt > > > > > P. > > > > > > > > > > > > > > > > > > > > There are two solutions: > > > > > > > > > > (1) Strictly require all entities to be torn down before drm_sched_fini(), > > > > > i.e. stick to the natural ownership and lifetime rules here (see below). > > > > > > > > > > (2) Actually protect *any* changes of the relevent fields of the entity > > > > > structure with the entity lock. > > > > > > > > > > While (2) seems rather obvious, we run into lock inversion with this approach, > > > > > as you note below as well. And I think drm_sched_fini() should not mess with > > > > > entities anyways. > > > > > > > > > > The ownership here seems obvious: > > > > > > > > > > The scheduler *owns* a resource that is used by entities. Consequently, entities > > > > > are not allowed to out-live the scheduler. > > > > > > > > > > Surely, the current implementation to just take the resource away from the > > > > > entity under the hood can work as well with appropriate locking, but that's a > > > > > mess. > > > > > > > > > > If the resource *really* needs to be shared for some reason (which I don't see), > > > > > shared ownership, i.e. reference counting, is much less error prone. > > > > > > > > Yes, Xe solves all of this via reference counting (jobs refcount the > > > > entity). It's a bit easier in Xe since the scheduler and entities are > > > > the same object due to their 1:1 relationship. But even in non-1:1 > > > > relationships, an entity could refcount the scheduler. The teardown > > > > sequence would then be: all jobs complete on the entity → teardown the > > > > entity → all entities torn down → teardown the scheduler. > > > > > > > > Matt > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-23 6:56 ` Philipp Stanner @ 2025-07-24 4:13 ` Matthew Brost 2025-07-24 4:17 ` Matthew Brost 0 siblings, 1 reply; 17+ messages in thread From: Matthew Brost @ 2025-07-24 4:13 UTC (permalink / raw) To: phasta Cc: Danilo Krummrich, James Flowers, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Wed, Jul 23, 2025 at 08:56:01AM +0200, Philipp Stanner wrote: > On Tue, 2025-07-22 at 01:45 -0700, Matthew Brost wrote: > > On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote: > > > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > > > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > > > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > > > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > > - drm_sched_entity_compare_before); > > > > > > > > > + if (!entity->stopped) { > > > > > > > > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > > + drm_sched_entity_compare_before); > > > > > > > > > + } > > > > > > > > > > > > > > > > If this is a race, then this patch here is broken, too, because you're > > > > > > > > checking the 'stopped' boolean as the callers of that function do, too > > > > > > > > – just later. :O > > > > > > > > > > > > > > > > Could still race, just less likely. > > > > > > > > > > > > > > > > The proper way to fix it would then be to address the issue where the > > > > > > > > locking is supposed to happen. Let's look at, for example, > > > > > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > > > > > > { > > > > > > > > (Bla bla bla) > > > > > > > > > > > > > > > > ………… > > > > > > > > > > > > > > > > /* first job wakes up scheduler */ > > > > > > > > if (first) { > > > > > > > > struct drm_gpu_scheduler *sched; > > > > > > > > struct drm_sched_rq *rq; > > > > > > > > > > > > > > > > /* Add the entity to the run queue */ > > > > > > > > spin_lock(&entity->lock); > > > > > > > > if (entity->stopped) { <---- Aha! > > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > > > > > > return; > > > > > > > > } > > > > > > > > > > > > > > > > rq = entity->rq; > > > > > > > > sched = rq->sched; > > > > > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > > > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > > > > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > > > > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > > > > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > > > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > > > > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > > > > > > > Exactly, this is the first question to ask. > > > > > > > > > > > > And I think it's even more restrictive: > > > > > > > > > > > > In drm_sched_fini() > > > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > list_for_each_entry(s_entity, &rq->entities, list) > > > > > > /* > > > > > > * Prevents reinsertion and marks job_queue as idle, > > > > > > * it will be removed from the rq in drm_sched_entity_fini() > > > > > > * eventually > > > > > > */ > > > > > > s_entity->stopped = true; > > > > > > spin_unlock(&rq->lock); > > > > > > kfree(sched->sched_rq[i]); > > > > > > } > > > > > > > > > > > > In drm_sched_entity_kill() > > > > > > > > > > > > static void drm_sched_entity_kill(struct drm_sched_entity *entity) > > > > > > { > > > > > > struct drm_sched_job *job; > > > > > > struct dma_fence *prev; > > > > > > > > > > > > if (!entity->rq) > > > > > > return; > > > > > > > > > > > > spin_lock(&entity->lock); > > > > > > entity->stopped = true; > > > > > > drm_sched_rq_remove_entity(entity->rq, entity); > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > [...] > > > > > > } > > > > > > > > > > > > If this runs concurrently, this is a UAF as well. > > > > > > > > > > > > Personally, I have always been working with the assupmtion that entites have to > > > > > > be torn down *before* the scheduler, but those lifetimes are not documented > > > > > > properly. > > > > > > > > > > Yes, this is my assumption too. I would even take it further: an entity > > > > > shouldn't be torn down until all jobs associated with it are freed as > > > > > well. I think this would solve a lot of issues I've seen on the list > > > > > related to UAF, teardown, etc. > > > > > > > > That's kind of impossible with the new tear down design, because > > > > drm_sched_fini() ensures that all jobs are freed on teardown. And > > > > drm_sched_fini() wouldn't be called before all jobs are gone, > > > > effectively resulting in a chicken-egg-problem, or rather: the driver > > > > implementing its own solution for teardown. > > > > > > > > > > I've read this four times and I'm still generally confused. > > > > > > "drm_sched_fini ensures that all jobs are freed on teardown" — Yes, > > > that's how a refcounting-based solution works. drm_sched_fini would > > > never be called if there were pending jobs. > > > > > > "drm_sched_fini() wouldn't be called before all jobs are gone" — See > > > above. > > > > > > "effectively resulting in a chicken-and-egg problem" — A job is created > > > after the scheduler, and it holds a reference to the scheduler until > > > it's freed. I don't see how this idiom applies. > > > > > > "the driver implementing its own solution for teardown" — It’s just > > > following the basic lifetime rules I outlined below. Perhaps Xe was > > > ahead of its time, but the number of DRM scheduler blowups we've had is > > > zero — maybe a strong indication that this design is correct. > > > > > > > Sorry—self-reply. > > > > To expand on this: the reason Xe implemented a refcount-based teardown > > solution is because the internals of the DRM scheduler during teardown > > looked wildly scary. A lower layer should not impose its will on upper > > layers. I think that’s the root cause of all the problems I've listed. > > > > In my opinion, we should document the lifetime rules I’ve outlined, fix > > all drivers accordingly, and assert these rules in the scheduler layer. > > > Everyone had a separate solution for that. Nouveau used a waitqueue. > That's what happens when there's no centralized mechanism for solving a > problem. > Right, this is essentially my point — I think refcounting on the driver side is what the long-term solution really needs to be. To recap the basic rules: - Entities should not be finalized or freed until all jobs associated with them are freed. - Schedulers should not be finalized or freed until all associated entities are finalized. - Jobs should hold a reference to the entity. - Entities should hold a reference to the scheduler. I understand this won’t happen overnight — or perhaps ever — but adopting this model would solve a lot of problems across the subsystem and reduce a significant amount of complexity in the DRM scheduler. I’ll also acknowledge that part of this is my fault — years ago, I worked around problems (implemented above ref count model) in the scheduler related to teardown rather than proposing a common, unified solution, and clear lifetime rules. For drivers with a 1:1 entity-to-scheduler relationship, teardown becomes fairly simple: set the TDR timeout to zero and naturally let the remaining jobs flush out via TDR + the timedout_job callback, which signals the job’s fence. Free job, is called after that. For non-1:1 setups, we could introduce something like drm_sched_entity_kill, which would move all jobs on the pending list of a given entity to a kill list. A worker could then process that kill list — calling timedout_job and signaling the associated fences. Similarly, any jobs that had unresolved dependencies could be immediately added to the kill list. The kill list would have to be checked in drm_sched_free_job_work too. This would ensure that all jobs submitted would go through the full lifecycle: - run_job is called - free_job is called - If the fence returned from run_job needs to be artificially signaled, timedout_job is called We can add the infrastructure for this and once all driver adhere this model, clean up ugliness in the scheduler related to teardown and all races here. > Did you see the series we recently merged which repairs the memory > leaks of drm/sched? It had been around for quite some time. > > https://lore.kernel.org/dri-devel/20250701132142.76899-3-phasta@kernel.org/ > I would say this is just hacking around the fundamental issues with the lifetime of these objects. Do you see anything in Nouveau that would prevent the approach I described above from working? Also, what if jobs have dependencies that aren't even on the pending list yet? This further illustrates the problems with trying to finalize objects while child objects (entities, job) are still around. Matt > > P. > > > > > Matt > > > > > Matt > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > There are two solutions: > > > > > > > > > > > > (1) Strictly require all entities to be torn down before drm_sched_fini(), > > > > > > i.e. stick to the natural ownership and lifetime rules here (see below). > > > > > > > > > > > > (2) Actually protect *any* changes of the relevent fields of the entity > > > > > > structure with the entity lock. > > > > > > > > > > > > While (2) seems rather obvious, we run into lock inversion with this approach, > > > > > > as you note below as well. And I think drm_sched_fini() should not mess with > > > > > > entities anyways. > > > > > > > > > > > > The ownership here seems obvious: > > > > > > > > > > > > The scheduler *owns* a resource that is used by entities. Consequently, entities > > > > > > are not allowed to out-live the scheduler. > > > > > > > > > > > > Surely, the current implementation to just take the resource away from the > > > > > > entity under the hood can work as well with appropriate locking, but that's a > > > > > > mess. > > > > > > > > > > > > If the resource *really* needs to be shared for some reason (which I don't see), > > > > > > shared ownership, i.e. reference counting, is much less error prone. > > > > > > > > > > Yes, Xe solves all of this via reference counting (jobs refcount the > > > > > entity). It's a bit easier in Xe since the scheduler and entities are > > > > > the same object due to their 1:1 relationship. But even in non-1:1 > > > > > relationships, an entity could refcount the scheduler. The teardown > > > > > sequence would then be: all jobs complete on the entity → teardown the > > > > > entity → all entities torn down → teardown the scheduler. > > > > > > > > > > Matt > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-24 4:13 ` Matthew Brost @ 2025-07-24 4:17 ` Matthew Brost 0 siblings, 0 replies; 17+ messages in thread From: Matthew Brost @ 2025-07-24 4:17 UTC (permalink / raw) To: phasta Cc: Danilo Krummrich, James Flowers, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan, dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Wed, Jul 23, 2025 at 09:13:34PM -0700, Matthew Brost wrote: > On Wed, Jul 23, 2025 at 08:56:01AM +0200, Philipp Stanner wrote: > > On Tue, 2025-07-22 at 01:45 -0700, Matthew Brost wrote: > > > On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote: > > > > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > > > > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > > > > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > > > > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > > > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > > > > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > > > - drm_sched_entity_compare_before); > > > > > > > > > > + if (!entity->stopped) { > > > > > > > > > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > > > + drm_sched_entity_compare_before); > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > If this is a race, then this patch here is broken, too, because you're > > > > > > > > > checking the 'stopped' boolean as the callers of that function do, too > > > > > > > > > – just later. :O > > > > > > > > > > > > > > > > > > Could still race, just less likely. > > > > > > > > > > > > > > > > > > The proper way to fix it would then be to address the issue where the > > > > > > > > > locking is supposed to happen. Let's look at, for example, > > > > > > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > > > > > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > > > > > > > { > > > > > > > > > (Bla bla bla) > > > > > > > > > > > > > > > > > > ………… > > > > > > > > > > > > > > > > > > /* first job wakes up scheduler */ > > > > > > > > > if (first) { > > > > > > > > > struct drm_gpu_scheduler *sched; > > > > > > > > > struct drm_sched_rq *rq; > > > > > > > > > > > > > > > > > > /* Add the entity to the run queue */ > > > > > > > > > spin_lock(&entity->lock); > > > > > > > > > if (entity->stopped) { <---- Aha! > > > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > > > > > > > return; > > > > > > > > > } > > > > > > > > > > > > > > > > > > rq = entity->rq; > > > > > > > > > sched = rq->sched; > > > > > > > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > > > > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > > > > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > > > > > > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > > > > > > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > > > > > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > > > > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > > > > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > > > > > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > > > > > > > > > Exactly, this is the first question to ask. > > > > > > > > > > > > > > And I think it's even more restrictive: > > > > > > > > > > > > > > In drm_sched_fini() > > > > > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > > > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > list_for_each_entry(s_entity, &rq->entities, list) > > > > > > > /* > > > > > > > * Prevents reinsertion and marks job_queue as idle, > > > > > > > * it will be removed from the rq in drm_sched_entity_fini() > > > > > > > * eventually > > > > > > > */ > > > > > > > s_entity->stopped = true; > > > > > > > spin_unlock(&rq->lock); > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > } > > > > > > > > > > > > > > In drm_sched_entity_kill() > > > > > > > > > > > > > > static void drm_sched_entity_kill(struct drm_sched_entity *entity) > > > > > > > { > > > > > > > struct drm_sched_job *job; > > > > > > > struct dma_fence *prev; > > > > > > > > > > > > > > if (!entity->rq) > > > > > > > return; > > > > > > > > > > > > > > spin_lock(&entity->lock); > > > > > > > entity->stopped = true; > > > > > > > drm_sched_rq_remove_entity(entity->rq, entity); > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > [...] > > > > > > > } > > > > > > > > > > > > > > If this runs concurrently, this is a UAF as well. > > > > > > > > > > > > > > Personally, I have always been working with the assupmtion that entites have to > > > > > > > be torn down *before* the scheduler, but those lifetimes are not documented > > > > > > > properly. > > > > > > > > > > > > Yes, this is my assumption too. I would even take it further: an entity > > > > > > shouldn't be torn down until all jobs associated with it are freed as > > > > > > well. I think this would solve a lot of issues I've seen on the list > > > > > > related to UAF, teardown, etc. > > > > > > > > > > That's kind of impossible with the new tear down design, because > > > > > drm_sched_fini() ensures that all jobs are freed on teardown. And > > > > > drm_sched_fini() wouldn't be called before all jobs are gone, > > > > > effectively resulting in a chicken-egg-problem, or rather: the driver > > > > > implementing its own solution for teardown. > > > > > > > > > > > > > I've read this four times and I'm still generally confused. > > > > > > > > "drm_sched_fini ensures that all jobs are freed on teardown" — Yes, > > > > that's how a refcounting-based solution works. drm_sched_fini would > > > > never be called if there were pending jobs. > > > > > > > > "drm_sched_fini() wouldn't be called before all jobs are gone" — See > > > > above. > > > > > > > > "effectively resulting in a chicken-and-egg problem" — A job is created > > > > after the scheduler, and it holds a reference to the scheduler until > > > > it's freed. I don't see how this idiom applies. > > > > > > > > "the driver implementing its own solution for teardown" — It’s just > > > > following the basic lifetime rules I outlined below. Perhaps Xe was > > > > ahead of its time, but the number of DRM scheduler blowups we've had is > > > > zero — maybe a strong indication that this design is correct. > > > > > > > > > > Sorry—self-reply. > > > > > > To expand on this: the reason Xe implemented a refcount-based teardown > > > solution is because the internals of the DRM scheduler during teardown > > > looked wildly scary. A lower layer should not impose its will on upper > > > layers. I think that’s the root cause of all the problems I've listed. > > > > > > In my opinion, we should document the lifetime rules I’ve outlined, fix > > > all drivers accordingly, and assert these rules in the scheduler layer. > > > > > > Everyone had a separate solution for that. Nouveau used a waitqueue. > > That's what happens when there's no centralized mechanism for solving a > > problem. > > > > Right, this is essentially my point — I think refcounting on the driver > side is what the long-term solution really needs to be. > > To recap the basic rules: > > - Entities should not be finalized or freed until all jobs associated > with them are freed. > - Schedulers should not be finalized or freed until all associated > entities are finalized. > - Jobs should hold a reference to the entity. > - Entities should hold a reference to the scheduler. > > I understand this won’t happen overnight — or perhaps ever — but > adopting this model would solve a lot of problems across the subsystem > and reduce a significant amount of complexity in the DRM scheduler. I’ll > also acknowledge that part of this is my fault — years ago, I worked > around problems (implemented above ref count model) in the scheduler > related to teardown rather than proposing a common, unified solution, > and clear lifetime rules. > > For drivers with a 1:1 entity-to-scheduler relationship, teardown > becomes fairly simple: set the TDR timeout to zero and naturally let the > remaining jobs flush out via TDR + the timedout_job callback, which > signals the job’s fence. Free job, is called after that. > > For non-1:1 setups, we could introduce something like > drm_sched_entity_kill, which would move all jobs on the pending list of > a given entity to a kill list. A worker could then process that kill > list — calling timedout_job and signaling the associated fences. > Similarly, any jobs that had unresolved dependencies could be > immediately added to the kill list. The kill list would have to be s/added to the kill list/added to the kill list after calling run_job/ Matt > checked in drm_sched_free_job_work too. > > This would ensure that all jobs submitted would go through the full > lifecycle: > > - run_job is called > - free_job is called > - If the fence returned from run_job needs to be artificially signaled, > timedout_job is called > > We can add the infrastructure for this and once all driver adhere this > model, clean up ugliness in the scheduler related to teardown and all > races here. > > > Did you see the series we recently merged which repairs the memory > > leaks of drm/sched? It had been around for quite some time. > > > > https://lore.kernel.org/dri-devel/20250701132142.76899-3-phasta@kernel.org/ > > > > I would say this is just hacking around the fundamental issues with the > lifetime of these objects. Do you see anything in Nouveau that would > prevent the approach I described above from working? > > Also, what if jobs have dependencies that aren't even on the pending > list yet? This further illustrates the problems with trying to finalize > objects while child objects (entities, job) are still around. > > Matt > > > > > P. > > > > > > > > Matt > > > > > > > Matt > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There are two solutions: > > > > > > > > > > > > > > (1) Strictly require all entities to be torn down before drm_sched_fini(), > > > > > > > i.e. stick to the natural ownership and lifetime rules here (see below). > > > > > > > > > > > > > > (2) Actually protect *any* changes of the relevent fields of the entity > > > > > > > structure with the entity lock. > > > > > > > > > > > > > > While (2) seems rather obvious, we run into lock inversion with this approach, > > > > > > > as you note below as well. And I think drm_sched_fini() should not mess with > > > > > > > entities anyways. > > > > > > > > > > > > > > The ownership here seems obvious: > > > > > > > > > > > > > > The scheduler *owns* a resource that is used by entities. Consequently, entities > > > > > > > are not allowed to out-live the scheduler. > > > > > > > > > > > > > > Surely, the current implementation to just take the resource away from the > > > > > > > entity under the hood can work as well with appropriate locking, but that's a > > > > > > > mess. > > > > > > > > > > > > > > If the resource *really* needs to be shared for some reason (which I don't see), > > > > > > > shared ownership, i.e. reference counting, is much less error prone. > > > > > > > > > > > > Yes, Xe solves all of this via reference counting (jobs refcount the > > > > > > entity). It's a bit easier in Xe since the scheduler and entities are > > > > > > the same object due to their 1:1 relationship. But even in non-1:1 > > > > > > relationships, an entity could refcount the scheduler. The teardown > > > > > > sequence would then be: all jobs complete on the entity → teardown the > > > > > > entity → all entities torn down → teardown the scheduler. > > > > > > > > > > > > Matt > > > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-21 8:16 ` Philipp Stanner 2025-07-21 10:14 ` Danilo Krummrich @ 2025-07-22 20:05 ` James 2025-07-23 14:41 ` Philipp Stanner 1 sibling, 1 reply; 17+ messages in thread From: James @ 2025-07-22 20:05 UTC (permalink / raw) To: phasta, matthew.brost, dakr, Christian König, maarten.lankhorst, mripard, tzimmermann, airlied, simona, Shuah Khan Cc: dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin On Mon, Jul 21, 2025, at 1:16 AM, Philipp Stanner wrote: > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: >> +Cc Tvrtko, who's currently reworking FIFO and RR. >> >> On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: >> > Fixes an issue where entities are added to the run queue in >> > drm_sched_rq_update_fifo_locked after being killed, causing a >> > slab-use-after-free error. >> > >> > Signed-off-by: James Flowers <bold.zone2373@fastmail.com> >> > --- >> > This issue was detected by syzkaller running on a Steam Deck OLED. >> > Unfortunately I don't have a reproducer for it. I've >> >> Well, now that's kind of an issue – if you don't have a reproducer, how >> can you know that your patch is correct? How can we? >> >> It would certainly be good to know what the fuzz testing framework >> does. >> >> > included the KASAN reports below: >> >> >> Anyways, KASAN reports look interesting. But those might be many >> different issues. Again, would be good to know what the fuzzer has been >> testing. Can you maybe split this fuzz test into sub-tests? I suspsect >> those might be different faults. >> >> >> Anyways, taking a first look… >> >> >> > >> > ================================================================== >> > BUG: KASAN: slab-use-after-free in rb_next+0xda/0x160 lib/rbtree.c:505 >> > Read of size 8 at addr ffff8881805085e0 by task kworker/u32:12/192 >> > CPU: 3 UID: 0 PID: 192 Comm: kworker/u32:12 Not tainted 6.14.0-flowejam-+ #1 >> > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> > Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] >> > Call Trace: >> > <TASK> >> > __dump_stack lib/dump_stack.c:94 [inline] >> > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> > print_report+0xfc/0x1ff mm/kasan/report.c:521 >> > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> > rb_next+0xda/0x160 lib/rbtree.c:505 >> > drm_sched_rq_select_entity_fifo drivers/gpu/drm/scheduler/sched_main.c:332 [inline] [gpu_sched] >> > drm_sched_select_entity+0x497/0x720 drivers/gpu/drm/scheduler/sched_main.c:1081 [gpu_sched] >> > drm_sched_run_job_work+0x2e/0x710 drivers/gpu/drm/scheduler/sched_main.c:1206 [gpu_sched] >> > process_one_work+0x9c0/0x17e0 kernel/workqueue.c:3238 >> > process_scheduled_works kernel/workqueue.c:3319 [inline] >> > worker_thread+0x734/0x1060 kernel/workqueue.c:3400 >> > kthread+0x3fd/0x810 kernel/kthread.c:464 >> > ret_from_fork+0x53/0x80 arch/x86/kernel/process.c:148 >> > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 >> > </TASK> >> > Allocated by task 73472: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> > do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> > vfs_open+0x87/0x3f0 fs/open.c:1086 >> > do_open+0x72f/0xf80 fs/namei.c:3830 >> > path_openat+0x2ec/0x770 fs/namei.c:3989 >> > do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> > do_sys_open fs/open.c:1443 [inline] >> > __do_sys_openat fs/open.c:1459 [inline] >> > __se_sys_openat fs/open.c:1454 [inline] >> > __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> > do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > Freed by task 73472: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> > poison_slab_object mm/kasan/common.c:247 [inline] >> > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> > kasan_slab_free include/linux/kasan.h:233 [inline] >> > slab_free_hook mm/slub.c:2353 [inline] >> > slab_free mm/slub.c:4609 [inline] >> > kfree+0x14f/0x4d0 mm/slub.c:4757 >> > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> > __fput+0x402/0xb50 fs/file_table.c:464 >> > task_work_run+0x155/0x250 kernel/task_work.c:227 >> > get_signal+0x1be/0x19d0 kernel/signal.c:2809 >> > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > The buggy address belongs to the object at ffff888180508000 >> > The buggy address is located 1504 bytes inside of >> > The buggy address belongs to the physical page: >> > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x180508 >> > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> > page_type: f5(slab) >> > raw: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 >> > raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 >> > head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000003 ffffea0006014201 ffffffffffffffff 0000000000000000 >> > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> > page dumped because: kasan: bad access detected >> > Memory state around the buggy address: >> > ffff888180508480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff888180508500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > > ffff888180508580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ^ >> > ffff888180508600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff888180508680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ================================================================== >> > ================================================================== >> > BUG: KASAN: slab-use-after-free in rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] >> > BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] >> > BUG: KASAN: slab-use-after-free in rb_erase+0x157c/0x1b10 lib/rbtree.c:443 >> > Write of size 8 at addr ffff88816414c5d0 by task syz.2.3004/12376 >> > CPU: 7 UID: 65534 PID: 12376 Comm: syz.2.3004 Not tainted 6.14.0-flowejam-+ #1 >> > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> > Call Trace: >> > <TASK> >> > __dump_stack lib/dump_stack.c:94 [inline] >> > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> > print_report+0xfc/0x1ff mm/kasan/report.c:521 >> > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> > rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] >> > __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] >> > rb_erase+0x157c/0x1b10 lib/rbtree.c:443 >> > rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] >> > drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] >> > drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] >> > drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] >> > drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] >> > drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] >> > amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] >> > amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] >> > amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] >> > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> > __fput+0x402/0xb50 fs/file_table.c:464 >> > task_work_run+0x155/0x250 kernel/task_work.c:227 >> > exit_task_work include/linux/task_work.h:40 [inline] >> > do_exit+0x841/0xf60 kernel/exit.c:938 >> > do_group_exit+0xda/0x2b0 kernel/exit.c:1087 >> > get_signal+0x171f/0x19d0 kernel/signal.c:3036 >> > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > RIP: 0033:0x7f2d90da36ed >> > Code: Unable to access opcode bytes at 0x7f2d90da36c3. >> > RSP: 002b:00007f2d91b710d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca >> > RAX: 0000000000000000 RBX: 00007f2d90fe6088 RCX: 00007f2d90da36ed >> > RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f2d90fe6088 >> > RBP: 00007f2d90fe6080 R08: 0000000000000000 R09: 0000000000000000 >> > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d90fe608c >> > R13: 0000000000000000 R14: 0000000000000002 R15: 00007ffc34a67bd0 >> > </TASK> >> > Allocated by task 12381: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> > do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> > vfs_open+0x87/0x3f0 fs/open.c:1086 >> > do_open+0x72f/0xf80 fs/namei.c:3830 >> > path_openat+0x2ec/0x770 fs/namei.c:3989 >> > do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> > do_sys_open fs/open.c:1443 [inline] >> > __do_sys_openat fs/open.c:1459 [inline] >> > __se_sys_openat fs/open.c:1454 [inline] >> > __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> > do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > Freed by task 12381: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> > poison_slab_object mm/kasan/common.c:247 [inline] >> > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> > kasan_slab_free include/linux/kasan.h:233 [inline] >> > slab_free_hook mm/slub.c:2353 [inline] >> > slab_free mm/slub.c:4609 [inline] >> > kfree+0x14f/0x4d0 mm/slub.c:4757 >> > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> > __fput+0x402/0xb50 fs/file_table.c:464 >> > task_work_run+0x155/0x250 kernel/task_work.c:227 >> > get_signal+0x1be/0x19d0 kernel/signal.c:2809 >> > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > The buggy address belongs to the object at ffff88816414c000 >> > The buggy address is located 1488 bytes inside of >> > The buggy address belongs to the physical page: >> > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x164148 >> > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> > page_type: f5(slab) >> > raw: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 >> > raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 >> > head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000003 ffffea0005905201 ffffffffffffffff 0000000000000000 >> > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> > page dumped because: kasan: bad access detected >> > Memory state around the buggy address: >> > ffff88816414c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff88816414c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > > ffff88816414c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ^ >> > ffff88816414c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff88816414c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ================================================================== >> > ================================================================== >> > BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] >> > BUG: KASAN: slab-use-after-free in rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 >> > Read of size 8 at addr ffff88812ebcc5e0 by task syz.1.814/6553 >> > CPU: 0 UID: 65534 PID: 6553 Comm: syz.1.814 Not tainted 6.14.0-flowejam-+ #1 >> > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> > Call Trace: >> > <TASK> >> > __dump_stack lib/dump_stack.c:94 [inline] >> > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> > print_report+0xfc/0x1ff mm/kasan/report.c:521 >> > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> > __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] >> > rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 >> > rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] >> > drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] >> > drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] >> > drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] >> > drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] >> > drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] >> > amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] >> > amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] >> > amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] >> > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> > __fput+0x402/0xb50 fs/file_table.c:464 >> > task_work_run+0x155/0x250 kernel/task_work.c:227 >> > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] >> > exit_to_user_mode_loop kernel/entry/common.c:114 [inline] >> > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> > syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 >> > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > RIP: 0033:0x7fd23eba36ed >> > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 >> > RSP: 002b:00007ffc2943a358 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 >> > RAX: 0000000000000000 RBX: 00007ffc2943a428 RCX: 00007fd23eba36ed >> > RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 >> > RBP: 00007fd23ede7ba0 R08: 0000000000000001 R09: 0000000c00000000 >> > R10: 00007fd23ea00000 R11: 0000000000000246 R12: 00007fd23ede5fac >> > R13: 00007fd23ede5fa0 R14: 0000000000059ad1 R15: 0000000000059a8e >> > </TASK> >> > Allocated by task 6559: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> > do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> > vfs_open+0x87/0x3f0 fs/open.c:1086 >> > do_open+0x72f/0xf80 fs/namei.c:3830 >> > path_openat+0x2ec/0x770 fs/namei.c:3989 >> > do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> > do_sys_open fs/open.c:1443 [inline] >> > __do_sys_openat fs/open.c:1459 [inline] >> > __se_sys_openat fs/open.c:1454 [inline] >> > __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> > do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > Freed by task 6559: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> > poison_slab_object mm/kasan/common.c:247 [inline] >> > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> > kasan_slab_free include/linux/kasan.h:233 [inline] >> > slab_free_hook mm/slub.c:2353 [inline] >> > slab_free mm/slub.c:4609 [inline] >> > kfree+0x14f/0x4d0 mm/slub.c:4757 >> > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> > __fput+0x402/0xb50 fs/file_table.c:464 >> > task_work_run+0x155/0x250 kernel/task_work.c:227 >> > get_signal+0x1be/0x19d0 kernel/signal.c:2809 >> > arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> > exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> > syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > The buggy address belongs to the object at ffff88812ebcc000 >> > The buggy address is located 1504 bytes inside of >> > The buggy address belongs to the physical page: >> > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12ebc8 >> > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> > page_type: f5(slab) >> > raw: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 >> > raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 >> > head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000003 ffffea0004baf201 ffffffffffffffff 0000000000000000 >> > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> > page dumped because: kasan: bad access detected >> > Memory state around the buggy address: >> > ffff88812ebcc480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff88812ebcc500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > > ffff88812ebcc580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ^ >> > ffff88812ebcc600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff88812ebcc680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ================================================================== >> > ================================================================== >> > BUG: KASAN: slab-use-after-free in drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] >> > BUG: KASAN: slab-use-after-free in rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] >> > BUG: KASAN: slab-use-after-free in drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] >> > Read of size 8 at addr ffff8881208445c8 by task syz.1.49115/146644 >> > CPU: 7 UID: 65534 PID: 146644 Comm: syz.1.49115 Not tainted 6.14.0-flowejam-+ #1 >> > Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> > Call Trace: >> > <TASK> >> > __dump_stack lib/dump_stack.c:94 [inline] >> > dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> > print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> > print_report+0xfc/0x1ff mm/kasan/report.c:521 >> > kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> > drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] >> > rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] >> > drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] >> > drm_sched_entity_push_job+0x509/0x5d0 drivers/gpu/drm/scheduler/sched_entity.c:623 [gpu_sched] >> >> This might be a race between entity killing and the push_job. Let's >> look at your patch below… >> >> > amdgpu_job_submit+0x1a4/0x270 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:314 [amdgpu] >> > amdgpu_vm_sdma_commit+0x1f9/0x7d0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c:122 [amdgpu] >> > amdgpu_vm_pt_clear+0x540/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:422 [amdgpu] >> > amdgpu_vm_init+0x9c2/0x12f0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2609 [amdgpu] >> > amdgpu_driver_open_kms+0x274/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1418 [amdgpu] >> > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> > do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> > vfs_open+0x87/0x3f0 fs/open.c:1086 >> > do_open+0x72f/0xf80 fs/namei.c:3830 >> > path_openat+0x2ec/0x770 fs/namei.c:3989 >> > do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> > do_sys_open fs/open.c:1443 [inline] >> > __do_sys_openat fs/open.c:1459 [inline] >> > __se_sys_openat fs/open.c:1454 [inline] >> > __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> > do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > RIP: 0033:0x7feb303a36ed >> > Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 >> > RSP: 002b:00007feb3123c018 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 >> > RAX: ffffffffffffffda RBX: 00007feb305e5fa0 RCX: 00007feb303a36ed >> > RDX: 0000000000000002 RSI: 0000200000000140 RDI: ffffffffffffff9c >> > RBP: 00007feb30447722 R08: 0000000000000000 R09: 0000000000000000 >> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 >> > R13: 0000000000000001 R14: 00007feb305e5fa0 R15: 00007ffcfd0a3460 >> > </TASK> >> > Allocated by task 146638: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> > __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> > kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> > kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> > amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> > drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> > drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> > drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> > drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> > chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> > do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> > vfs_open+0x87/0x3f0 fs/open.c:1086 >> > do_open+0x72f/0xf80 fs/namei.c:3830 >> > path_openat+0x2ec/0x770 fs/namei.c:3989 >> > do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> > do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> > do_sys_open fs/open.c:1443 [inline] >> > __do_sys_openat fs/open.c:1459 [inline] >> > __se_sys_openat fs/open.c:1454 [inline] >> > __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> > do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> > do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > Freed by task 146638: >> > kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> > kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> > kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> > poison_slab_object mm/kasan/common.c:247 [inline] >> > __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> > kasan_slab_free include/linux/kasan.h:233 [inline] >> > slab_free_hook mm/slub.c:2353 [inline] >> > slab_free mm/slub.c:4609 [inline] >> > kfree+0x14f/0x4d0 mm/slub.c:4757 >> > amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> > drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> > drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> > drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> > drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> > __fput+0x402/0xb50 fs/file_table.c:464 >> > task_work_run+0x155/0x250 kernel/task_work.c:227 >> > resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] >> > exit_to_user_mode_loop kernel/entry/common.c:114 [inline] >> > exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> > __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> > syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 >> > do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> > entry_SYSCALL_64_after_hwframe+0x76/0x7e >> > The buggy address belongs to the object at ffff888120844000 >> > The buggy address is located 1480 bytes inside of >> > The buggy address belongs to the physical page: >> > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120840 >> > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> > flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> > page_type: f5(slab) >> > raw: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 >> > raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 >> > head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> > head: 0017ffffc0000003 ffffea0004821001 ffffffffffffffff 0000000000000000 >> > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> > page dumped because: kasan: bad access detected >> > Memory state around the buggy address: >> > ffff888120844480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff888120844500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > > ffff888120844580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ^ >> > ffff888120844600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ffff888120844680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> > ================================================================== >> > >> > drivers/gpu/drm/scheduler/sched_main.c | 6 ++++-- >> > 1 file changed, 4 insertions(+), 2 deletions(-) >> > >> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> > index bfea608a7106..997a2cc1a635 100644 >> > --- a/drivers/gpu/drm/scheduler/sched_main.c >> > +++ b/drivers/gpu/drm/scheduler/sched_main.c >> > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, >> > >> > entity->oldest_job_waiting = ts; >> > >> > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >> > - drm_sched_entity_compare_before); >> > + if (!entity->stopped) { >> > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >> > + drm_sched_entity_compare_before); >> > + } >> >> If this is a race, then this patch here is broken, too, because you're >> checking the 'stopped' boolean as the callers of that function do, too >> – just later. :O >> >> Could still race, just less likely. >> >> The proper way to fix it would then be to address the issue where the >> locking is supposed to happen. Let's look at, for example, >> drm_sched_entity_push_job(): >> >> >> void drm_sched_entity_push_job(struct drm_sched_job *sched_job) >> { >> (Bla bla bla) >> >> ………… >> >> /* first job wakes up scheduler */ >> if (first) { >> struct drm_gpu_scheduler *sched; >> struct drm_sched_rq *rq; >> >> /* Add the entity to the run queue */ >> spin_lock(&entity->lock); >> if (entity->stopped) { <---- Aha! >> spin_unlock(&entity->lock); >> >> DRM_ERROR("Trying to push to a killed entity\n"); >> return; >> } >> >> rq = entity->rq; >> sched = rq->sched; >> >> spin_lock(&rq->lock); >> drm_sched_rq_add_entity(rq, entity); >> >> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) >> drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! >> >> spin_unlock(&rq->lock); >> spin_unlock(&entity->lock); >> >> But the locks are still being hold. So that "shouldn't be happening"(tm). >> >> Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() >> stop entities. The former holds appropriate locks, but drm_sched_fini() >> doesn't. So that looks like a hot candidate to me. Opinions? >> >> On the other hand, aren't drivers prohibited from calling >> drm_sched_entity_push_job() after calling drm_sched_fini()? If the >> fuzzer does that, then it's not the scheduler's fault. >> >> Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? > > Ah no, forget about that. > > In drm_sched_fini(), you'd have to take the locks in reverse order as > in drm_sched_entity_push/pop_job(), thereby replacing race with > deadlock. > > I suspect that this is an issue in amdgpu. But let's wait for > Christian. > > > P. > > >> >> Would be cool if Tvrtko and Christian take a look. Maybe we even have a >> fundamental design issue. >> >> >> Regards >> P. >> >> >> > } >> > >> > /** >> Thanks for taking a look at this. I did try to get a reproducer using syzkaller, without success. I can attempt it myself but I expect it will take me some time, if I'm able to at all with this bug. I did run some of the igt-gpu-tools tests (amdgpu and drm ones), and there was no difference after the changes on my system. After this change I wasn't running into the UAF errors after 100k+ executions but I see what you mean, Philipp - perhaps it's missing the root issue. FYI, as an experiment I forced the use of RR with "drm_sched_policy = DRM_SCHED_POLICY_RR", and I'm not seeing any slab-use-after-frees, so maybe the problem is with the FIFO implementation? For now, the closest thing to a reproducer I can provide is my syzkaller config, in case anyone else is able to try this with a Steam Deck OLED. I've included this below along with an example program run by syzkaller (in generated C code and a Syz language version). --------------------------------------------------- { "target": "linux/amd64", "http": "127.0.0.1:56741", "sshkey" : "/path", "workdir": "/path", "kernel_obj": "/path", "kernel_src": "/path", "syzkaller": "/path", "sandbox": "setuid", "type": "isolated", "enable_syscalls": ["openat$drirender128", "ioctl$DRM_*", "close"], "disable_syscalls": ["ioctl$DRM_IOCTL_SYNCOBJ_*"], "reproduce": false, "vm": { "targets" : [ "10.0.0.1" ], "pstore": false, "target_dir" : "/path", "target_reboot" : true } } --------------------------------------------------- Generated C program: // autogenerated by syzkaller (https://github.com/google/syzkaller) #define _GNU_SOURCE #include <endian.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> uint64_t r[15] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0x0, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff}; int main(void) { syscall(__NR_mmap, /*addr=*/0x1ffffffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/(intptr_t)-1, /*offset=*/0ul); syscall(__NR_mmap, /*addr=*/0x200000000000ul, /*len=*/0x1000000ul, /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/7ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/(intptr_t)-1, /*offset=*/0ul); syscall(__NR_mmap, /*addr=*/0x200001000000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/(intptr_t)-1, /*offset=*/0ul); const char* reason; (void)reason; intptr_t res = 0; if (write(1, "executing program\n", sizeof("executing program\n") - 1)) {} memcpy((void*)0x200000000200, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x200000000200ul, /*flags=O_SYNC*/0x101000, /*mode=*/0); if (res != -1) r[0] = res; *(uint32_t*)0x200000002440 = 0; *(uint32_t*)0x200000002444 = 0x80000; syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0xc00c642d, /*arg=*/0x200000002440ul); memcpy((void*)0x2000000001c0, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x2000000001c0ul, /*flags=O_NOFOLLOW|O_CREAT|O_CLOEXEC*/0xa0040, /*mode=*/0); if (res != -1) r[1] = res; syscall(__NR_ioctl, /*fd=*/r[1], /*cmd=*/0x4b47, /*arg=*/0ul); memcpy((void*)0x200000000300, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x200000000300ul, /*flags=O_SYNC|O_NONBLOCK|O_LARGEFILE*/0x109800, /*mode=*/0); if (res != -1) r[2] = res; memcpy((void*)0x200000000100, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x200000000100ul, /*flags=*/0, /*mode=*/0); if (res != -1) r[3] = res; syscall(__NR_ioctl, /*fd=*/r[3], /*cmd=*/0x80f86406, /*arg=*/0ul); memcpy((void*)0x200000000440, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x200000000440ul, /*flags=O_NONBLOCK|O_CREAT*/0x840, /*mode=*/0); if (res != -1) r[4] = res; *(uint64_t*)0x200000000fc0 = 0; *(uint32_t*)0x200000000fc8 = 0; *(uint32_t*)0x200000000fcc = 0; res = syscall(__NR_ioctl, /*fd=*/r[4], /*cmd=*/0xc06864a1, /*arg=*/0x200000000fc0ul); if (res != -1) r[5] = *(uint32_t*)0x200000000fd0; syscall(__NR_close, /*fd=*/r[2]); *(uint32_t*)0x200000000000 = 0; *(uint32_t*)0x200000000004 = 0; syscall(__NR_ioctl, /*fd=*/r[4], /*cmd=*/0xc00c642d, /*arg=*/0x200000000000ul); *(uint32_t*)0x200000000040 = 0; *(uint32_t*)0x200000000044 = 0; res = syscall(__NR_ioctl, /*fd=*/r[3], /*cmd=*/0xc00c642d, /*arg=*/0x200000000040ul); if (res != -1) r[6] = *(uint32_t*)0x200000000048; *(uint32_t*)0x200000000080 = 0; res = syscall(__NR_ioctl, /*fd=*/r[2], /*cmd=*/0xc010640b, /*arg=*/0x200000000080ul); if (res != -1) r[7] = *(uint32_t*)0x200000000084; *(uint32_t*)0x2000000000c0 = 0; res = syscall(__NR_ioctl, /*fd=*/r[4], /*cmd=*/0xc010640b, /*arg=*/0x2000000000c0ul); if (res != -1) r[8] = *(uint32_t*)0x2000000000c4; memcpy((void*)0x2000000001c0, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x2000000001c0ul, /*flags=O_NOFOLLOW|O_CREAT|O_CLOEXEC*/0xa0040, /*mode=*/0); if (res != -1) r[9] = res; syscall(__NR_ioctl, /*fd=*/r[9], /*cmd=*/0x5421, /*arg=*/0ul); *(uint32_t*)0x200000000180 = r[5]; *(uint32_t*)0x200000000184 = 0x2534dd8; *(uint32_t*)0x200000000188 = 7; *(uint32_t*)0x20000000018c = 3; *(uint32_t*)0x200000000190 = 2; *(uint32_t*)0x2000000001a4 = 0x97; *(uint32_t*)0x2000000001a8 = 0x74c83423; *(uint32_t*)0x2000000001ac = 4; *(uint32_t*)0x2000000001b0 = 8; *(uint32_t*)0x2000000001b4 = 6; *(uint32_t*)0x2000000001b8 = 0x7f; *(uint32_t*)0x2000000001bc = 0; *(uint32_t*)0x2000000001c0 = 9; *(uint64_t*)0x2000000001c8 = 3; *(uint64_t*)0x2000000001d0 = 0; *(uint64_t*)0x2000000001d8 = 1; *(uint64_t*)0x2000000001e0 = 1; res = syscall(__NR_ioctl, /*fd=*/r[3], /*cmd=*/0xc06864ce, /*arg=*/0x200000000180ul); if (res != -1) r[10] = *(uint32_t*)0x200000000198; *(uint32_t*)0x200000000200 = r[5]; *(uint32_t*)0x200000000204 = 1; *(uint32_t*)0x200000000208 = 1; *(uint32_t*)0x20000000020c = 0; *(uint32_t*)0x200000000210 = 1; *(uint32_t*)0x200000000214 = r[7]; *(uint32_t*)0x200000000218 = r[8]; *(uint32_t*)0x20000000021c = 0; *(uint32_t*)0x200000000220 = r[10]; *(uint32_t*)0x200000000224 = 9; *(uint32_t*)0x200000000228 = 7; *(uint32_t*)0x20000000022c = 2; *(uint32_t*)0x200000000230 = 2; *(uint32_t*)0x200000000234 = 0x400; *(uint32_t*)0x200000000238 = 0x367; *(uint32_t*)0x20000000023c = 7; *(uint32_t*)0x200000000240 = 8; *(uint64_t*)0x200000000248 = 0x3e; *(uint64_t*)0x200000000250 = 3; *(uint64_t*)0x200000000258 = 9; *(uint64_t*)0x200000000260 = 6; syscall(__NR_ioctl, /*fd=*/r[6], /*cmd=*/0xc06864b8, /*arg=*/0x200000000200ul); res = syscall(__NR_ioctl, /*fd=*/r[6], /*cmd=*/0xc0086420, /*arg=*/0x200000000140ul); if (res != -1) r[11] = *(uint32_t*)0x200000000140; *(uint32_t*)0x200000000280 = r[11]; *(uint32_t*)0x200000000284 = 0x26; syscall(__NR_ioctl, /*fd=*/r[3], /*cmd=*/0x4008642a, /*arg=*/0x200000000280ul); memcpy((void*)0x200000000300, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x200000000300ul, /*flags=O_SYNC|O_NONBLOCK|O_LARGEFILE*/0x109800, /*mode=*/0); if (res != -1) r[12] = res; memcpy((void*)0x200000000100, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x200000000100ul, /*flags=*/0, /*mode=*/0); if (res != -1) r[13] = res; syscall(__NR_ioctl, /*fd=*/r[13], /*cmd=*/0x80f86406, /*arg=*/0ul); memcpy((void*)0x200000000440, "/dev/dri/renderD128\000", 20); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x200000000440ul, /*flags=O_NONBLOCK|O_CREAT*/0x840, /*mode=*/0); if (res != -1) r[14] = res; *(uint64_t*)0x200000000fc0 = 0; *(uint32_t*)0x200000000fc8 = 0; *(uint32_t*)0x200000000fcc = 0; syscall(__NR_ioctl, /*fd=*/r[14], /*cmd=*/0xc06864a1, /*arg=*/0x200000000fc0ul); syscall(__NR_close, /*fd=*/r[12]); *(uint32_t*)0x200000000000 = 0; *(uint32_t*)0x200000000004 = 0; syscall(__NR_ioctl, /*fd=*/r[14], /*cmd=*/0xc00c642d, /*arg=*/0x200000000000ul); *(uint32_t*)0x200000000040 = 0; *(uint32_t*)0x200000000044 = 0; syscall(__NR_ioctl, /*fd=*/r[13], /*cmd=*/0xc00c642d, /*arg=*/0x200000000040ul); *(uint32_t*)0x200000000080 = 0; syscall(__NR_ioctl, /*fd=*/r[12], /*cmd=*/0xc010640b, /*arg=*/0x200000000080ul); *(uint32_t*)0x2000000000c0 = 0; syscall(__NR_ioctl, /*fd=*/r[14], /*cmd=*/0xc010640b, /*arg=*/0x2000000000c0ul); return 0; } --------------------------------------------------- Syzkaller program (Syz language): r0 = openat$drirender128(0xffffffffffffff9c, &(0x7f0000000200), 0x101000, 0x0) ioctl$DRM_IOCTL_PRIME_HANDLE_TO_FD(r0, 0xc00c642d, &(0x7f0000002440)={0x0, 0x80000}) r1 = openat$drirender128(0xffffffffffffff9c, &(0x7f00000001c0), 0xa0040, 0x0) ioctl$DRM_IOCTL_RES_CTX(r1, 0x4b47, 0x0) r2 = openat$drirender128(0xffffffffffffff9c, &(0x7f0000000300), 0x109800, 0x0) r3 = openat$drirender128(0xffffffffffffff9c, &(0x7f0000000100), 0x0, 0x0) ioctl$DRM_IOCTL_GET_STATS(r3, 0x80f86406, 0x0) r4 = openat$drirender128(0xffffffffffffff9c, &(0x7f0000000440), 0x840, 0x0) ioctl$DRM_IOCTL_MODE_GETCRTC(r4, 0xc06864a1, &(0x7f0000000fc0)={0x0, 0x0, 0x0, <r5=>0x0}) close(r2) ioctl$DRM_IOCTL_PRIME_HANDLE_TO_FD(r4, 0xc00c642d, &(0x7f0000000000)) ioctl$DRM_IOCTL_PRIME_HANDLE_TO_FD(r3, 0xc00c642d, &(0x7f0000000040)={0x0, 0x0, <r6=>0xffffffffffffffff}) ioctl$DRM_IOCTL_GEM_OPEN(r2, 0xc010640b, &(0x7f0000000080)={0x0, <r7=>0x0}) ioctl$DRM_IOCTL_GEM_OPEN(r4, 0xc010640b, &(0x7f00000000c0)={0x0, <r8=>0x0}) r9 = openat$drirender128(0xffffffffffffff9c, &(0x7f00000001c0), 0xa0040, 0x0) ioctl$DRM_IOCTL_RES_CTX(r9, 0x5421, 0x0) ioctl$DRM_IOCTL_MODE_GETFB2(r3, 0xc06864ce, &(0x7f0000000180)={r5, 0x2534dd8, 0x7, 0x3, 0x2, [0x0, <r10=>0x0], [0x97, 0x74c83423, 0x4, 0x8], [0x6, 0x7f, 0x0, 0x9], [0x3, 0x0, 0x1, 0x1]}) ioctl$DRM_IOCTL_MODE_ADDFB2(r6, 0xc06864b8, &(0x7f0000000200)={r5, 0x1, 0x1, 0x0, 0x1, [r7, r8, 0x0, r10], [0x9, 0x7, 0x2, 0x2], [0x400, 0x367, 0x7, 0x8], [0x3e, 0x3, 0x9, 0x6]}) ioctl$DRM_IOCTL_ADD_CTX(r6, 0xc0086420, &(0x7f0000000140)={<r11=>0x0}) ioctl$DRM_IOCTL_LOCK(r3, 0x4008642a, &(0x7f0000000280)={r11, 0x26}) r12 = openat$drirender128(0xffffffffffffff9c, &(0x7f0000000300), 0x109800, 0x0) r13 = openat$drirender128(0xffffffffffffff9c, &(0x7f0000000100), 0x0, 0x0) ioctl$DRM_IOCTL_GET_STATS(r13, 0x80f86406, 0x0) r14 = openat$drirender128(0xffffffffffffff9c, &(0x7f0000000440), 0x840, 0x0) ioctl$DRM_IOCTL_MODE_GETCRTC(r14, 0xc06864a1, &(0x7f0000000fc0)={0x0}) close(r12) ioctl$DRM_IOCTL_PRIME_HANDLE_TO_FD(r14, 0xc00c642d, &(0x7f0000000000)) ioctl$DRM_IOCTL_PRIME_HANDLE_TO_FD(r13, 0xc00c642d, &(0x7f0000000040)) ioctl$DRM_IOCTL_GEM_OPEN(r12, 0xc010640b, &(0x7f0000000080)) ioctl$DRM_IOCTL_GEM_OPEN(r14, 0xc010640b, &(0x7f00000000c0)) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-22 20:05 ` James @ 2025-07-23 14:41 ` Philipp Stanner 0 siblings, 0 replies; 17+ messages in thread From: Philipp Stanner @ 2025-07-23 14:41 UTC (permalink / raw) To: James, phasta, matthew.brost, dakr, Christian König, maarten.lankhorst, mripard, tzimmermann, airlied, simona, Shuah Khan Cc: dri-devel, linux-kernel, linux-kernel-mentees, Tvrtko Ursulin Hello, On Tue, 2025-07-22 at 13:05 -0700, James wrote: > On Mon, Jul 21, 2025, at 1:16 AM, Philipp Stanner wrote: > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > +Cc Tvrtko, who's currently reworking FIFO and RR. > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > Fixes an issue where entities are added to the run queue in > > > > drm_sched_rq_update_fifo_locked after being killed, causing a > > > > slab-use-after-free error. > > > > > > > > Signed-off-by: James Flowers <bold.zone2373@fastmail.com> > > > > --- > > > > This issue was detected by syzkaller running on a Steam Deck OLED. > > > > Unfortunately I don't have a reproducer for it. I've > > > > > > Well, now that's kind of an issue – if you don't have a reproducer, how > > > can you know that your patch is correct? How can we? > > > > > > It would certainly be good to know what the fuzz testing framework > > > does. > > > > > > > included the KASAN reports below: > > > > > > > > > Anyways, KASAN reports look interesting. But those might be many > > > different issues. Again, would be good to know what the fuzzer has been > > > testing. Can you maybe split this fuzz test into sub-tests? I suspsect > > > those might be different faults. > > > > > > > > > Anyways, taking a first look… > > > > > > > > > > > > > > ================================================================== > > > > BUG: KASAN: slab-use-after-free in rb_next+0xda/0x160 lib/rbtree.c:505 > > > > Read of size 8 at addr ffff8881805085e0 by task kworker/u32:12/192 [SNIP] > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > index bfea608a7106..997a2cc1a635 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > - drm_sched_entity_compare_before); > > > > + if (!entity->stopped) { > > > > + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > + drm_sched_entity_compare_before); > > > > + } > > > > > > If this is a race, then this patch here is broken, too, because you're > > > checking the 'stopped' boolean as the callers of that function do, too > > > – just later. :O > > > > > > Could still race, just less likely. > > > > > > The proper way to fix it would then be to address the issue where the > > > locking is supposed to happen. Let's look at, for example, > > > drm_sched_entity_push_job(): > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > { > > > (Bla bla bla) > > > > > > ………… > > > > > > /* first job wakes up scheduler */ > > > if (first) { > > > struct drm_gpu_scheduler *sched; > > > struct drm_sched_rq *rq; > > > > > > /* Add the entity to the run queue */ > > > spin_lock(&entity->lock); > > > if (entity->stopped) { <---- Aha! > > > spin_unlock(&entity->lock); > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > return; > > > } > > > > > > rq = entity->rq; > > > sched = rq->sched; > > > > > > spin_lock(&rq->lock); > > > drm_sched_rq_add_entity(rq, entity); > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > > > > > spin_unlock(&rq->lock); > > > spin_unlock(&entity->lock); > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > On the other hand, aren't drivers prohibited from calling > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? > > > > Ah no, forget about that. > > > > In drm_sched_fini(), you'd have to take the locks in reverse order as > > in drm_sched_entity_push/pop_job(), thereby replacing race with > > deadlock. > > > > I suspect that this is an issue in amdgpu. But let's wait for > > Christian. > > > > > > P. > > > > > > > > > > Would be cool if Tvrtko and Christian take a look. Maybe we even have a > > > fundamental design issue. > > > > > > > > > Regards > > > P. > > > > > > > > > > } > > > > > > > > /** > > > > > Thanks for taking a look at this. I did try to get a reproducer using syzkaller, without success. I can attempt it myself but I expect it will take me some time, if I'm able to at all with this bug. I did run some of the igt-gpu-tools tests (amdgpu and drm ones), and there was no difference after the changes on my system. After this change I wasn't running into the UAF errors after 100k+ executions but I see what you mean, Philipp - perhaps it's missing the root issue. > > FYI, as an experiment I forced the use of RR with "drm_sched_policy = DRM_SCHED_POLICY_RR", and I'm not seeing any slab-use-after-frees, so maybe the problem is with the FIFO implementation? I can't imagine that. The issue your encountering is most likely a race caused by the driver tearing down entities after the scheduler, so different scheduler runtime behavior might hide ("fix") the race (that's the nature of races, actually: sometimes they're there, sometimes not). RR running with different time patterns than FIFO doesn't mean that FIFO has a bug. > > For now, the closest thing to a reproducer I can provide is my syzkaller config, in case anyone else is able to try this with a Steam Deck OLED. I've included this below along with an example program run by syzkaller (in generated C code and a Syz language version). Thanks for investigating this. My recommendation for now is that you write a reproducer program, possibly inspired by the syzkaller code you showed. Reproduce it cleanly and (optionally) try a fix. Then another mail would be good, especially with the amdgpu maintainers on Cc since I suspect that this is a driver issue. Don't get me wrong, a UAF definitely needs to be fixed; but since it's not occurring outside of fuzzing currently and as we can't reproduce it, we can't really do much about it until that's the case. I will in the mean time provide a patch pimping up the memory life time documentation for scheduler objects. Thx P. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-07-21 7:52 ` Philipp Stanner 2025-07-21 8:16 ` Philipp Stanner @ 2025-08-14 10:42 ` Tvrtko Ursulin 2025-08-14 11:45 ` Tvrtko Ursulin 1 sibling, 1 reply; 17+ messages in thread From: Tvrtko Ursulin @ 2025-08-14 10:42 UTC (permalink / raw) To: phasta, James Flowers, matthew.brost, dakr, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan Cc: dri-devel, linux-kernel, linux-kernel-mentees On 21/07/2025 08:52, Philipp Stanner wrote: > +Cc Tvrtko, who's currently reworking FIFO and RR. > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: >> Fixes an issue where entities are added to the run queue in >> drm_sched_rq_update_fifo_locked after being killed, causing a >> slab-use-after-free error. >> >> Signed-off-by: James Flowers <bold.zone2373@fastmail.com> >> --- >> This issue was detected by syzkaller running on a Steam Deck OLED. >> Unfortunately I don't have a reproducer for it. I've > > Well, now that's kind of an issue – if you don't have a reproducer, how > can you know that your patch is correct? How can we? > > It would certainly be good to know what the fuzz testing framework > does. > >> included the KASAN reports below: > > > Anyways, KASAN reports look interesting. But those might be many > different issues. Again, would be good to know what the fuzzer has been > testing. Can you maybe split this fuzz test into sub-tests? I suspsect > those might be different faults. > > > Anyways, taking a first look… > > >> >> ================================================================== >> BUG: KASAN: slab-use-after-free in rb_next+0xda/0x160 lib/rbtree.c:505 >> Read of size 8 at addr ffff8881805085e0 by task kworker/u32:12/192 >> CPU: 3 UID: 0 PID: 192 Comm: kworker/u32:12 Not tainted 6.14.0-flowejam-+ #1 >> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] >> Call Trace: >> <TASK> >> __dump_stack lib/dump_stack.c:94 [inline] >> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> print_report+0xfc/0x1ff mm/kasan/report.c:521 >> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> rb_next+0xda/0x160 lib/rbtree.c:505 >> drm_sched_rq_select_entity_fifo drivers/gpu/drm/scheduler/sched_main.c:332 [inline] [gpu_sched] >> drm_sched_select_entity+0x497/0x720 drivers/gpu/drm/scheduler/sched_main.c:1081 [gpu_sched] >> drm_sched_run_job_work+0x2e/0x710 drivers/gpu/drm/scheduler/sched_main.c:1206 [gpu_sched] >> process_one_work+0x9c0/0x17e0 kernel/workqueue.c:3238 >> process_scheduled_works kernel/workqueue.c:3319 [inline] >> worker_thread+0x734/0x1060 kernel/workqueue.c:3400 >> kthread+0x3fd/0x810 kernel/kthread.c:464 >> ret_from_fork+0x53/0x80 arch/x86/kernel/process.c:148 >> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 >> </TASK> >> Allocated by task 73472: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> vfs_open+0x87/0x3f0 fs/open.c:1086 >> do_open+0x72f/0xf80 fs/namei.c:3830 >> path_openat+0x2ec/0x770 fs/namei.c:3989 >> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> do_sys_open fs/open.c:1443 [inline] >> __do_sys_openat fs/open.c:1459 [inline] >> __se_sys_openat fs/open.c:1454 [inline] >> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> Freed by task 73472: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> poison_slab_object mm/kasan/common.c:247 [inline] >> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> kasan_slab_free include/linux/kasan.h:233 [inline] >> slab_free_hook mm/slub.c:2353 [inline] >> slab_free mm/slub.c:4609 [inline] >> kfree+0x14f/0x4d0 mm/slub.c:4757 >> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> __fput+0x402/0xb50 fs/file_table.c:464 >> task_work_run+0x155/0x250 kernel/task_work.c:227 >> get_signal+0x1be/0x19d0 kernel/signal.c:2809 >> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> The buggy address belongs to the object at ffff888180508000 >> The buggy address is located 1504 bytes inside of >> The buggy address belongs to the physical page: >> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x180508 >> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> page_type: f5(slab) >> raw: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 >> raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 >> head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000003 ffffea0006014201 ffffffffffffffff 0000000000000000 >> head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> page dumped because: kasan: bad access detected >> Memory state around the buggy address: >> ffff888180508480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff888180508500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff888180508580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ^ >> ffff888180508600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff888180508680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ================================================================== >> ================================================================== >> BUG: KASAN: slab-use-after-free in rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] >> BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] >> BUG: KASAN: slab-use-after-free in rb_erase+0x157c/0x1b10 lib/rbtree.c:443 >> Write of size 8 at addr ffff88816414c5d0 by task syz.2.3004/12376 >> CPU: 7 UID: 65534 PID: 12376 Comm: syz.2.3004 Not tainted 6.14.0-flowejam-+ #1 >> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> Call Trace: >> <TASK> >> __dump_stack lib/dump_stack.c:94 [inline] >> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> print_report+0xfc/0x1ff mm/kasan/report.c:521 >> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] >> __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] >> rb_erase+0x157c/0x1b10 lib/rbtree.c:443 >> rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] >> drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] >> drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] >> drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] >> drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] >> drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] >> amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] >> amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] >> amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] >> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> __fput+0x402/0xb50 fs/file_table.c:464 >> task_work_run+0x155/0x250 kernel/task_work.c:227 >> exit_task_work include/linux/task_work.h:40 [inline] >> do_exit+0x841/0xf60 kernel/exit.c:938 >> do_group_exit+0xda/0x2b0 kernel/exit.c:1087 >> get_signal+0x171f/0x19d0 kernel/signal.c:3036 >> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> RIP: 0033:0x7f2d90da36ed >> Code: Unable to access opcode bytes at 0x7f2d90da36c3. >> RSP: 002b:00007f2d91b710d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca >> RAX: 0000000000000000 RBX: 00007f2d90fe6088 RCX: 00007f2d90da36ed >> RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f2d90fe6088 >> RBP: 00007f2d90fe6080 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d90fe608c >> R13: 0000000000000000 R14: 0000000000000002 R15: 00007ffc34a67bd0 >> </TASK> >> Allocated by task 12381: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> vfs_open+0x87/0x3f0 fs/open.c:1086 >> do_open+0x72f/0xf80 fs/namei.c:3830 >> path_openat+0x2ec/0x770 fs/namei.c:3989 >> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> do_sys_open fs/open.c:1443 [inline] >> __do_sys_openat fs/open.c:1459 [inline] >> __se_sys_openat fs/open.c:1454 [inline] >> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> Freed by task 12381: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> poison_slab_object mm/kasan/common.c:247 [inline] >> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> kasan_slab_free include/linux/kasan.h:233 [inline] >> slab_free_hook mm/slub.c:2353 [inline] >> slab_free mm/slub.c:4609 [inline] >> kfree+0x14f/0x4d0 mm/slub.c:4757 >> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> __fput+0x402/0xb50 fs/file_table.c:464 >> task_work_run+0x155/0x250 kernel/task_work.c:227 >> get_signal+0x1be/0x19d0 kernel/signal.c:2809 >> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> The buggy address belongs to the object at ffff88816414c000 >> The buggy address is located 1488 bytes inside of >> The buggy address belongs to the physical page: >> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x164148 >> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> page_type: f5(slab) >> raw: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 >> raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 >> head: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000003 ffffea0005905201 ffffffffffffffff 0000000000000000 >> head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> page dumped because: kasan: bad access detected >> Memory state around the buggy address: >> ffff88816414c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff88816414c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff88816414c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ^ >> ffff88816414c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff88816414c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ================================================================== >> ================================================================== >> BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] >> BUG: KASAN: slab-use-after-free in rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 >> Read of size 8 at addr ffff88812ebcc5e0 by task syz.1.814/6553 >> CPU: 0 UID: 65534 PID: 6553 Comm: syz.1.814 Not tainted 6.14.0-flowejam-+ #1 >> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> Call Trace: >> <TASK> >> __dump_stack lib/dump_stack.c:94 [inline] >> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> print_report+0xfc/0x1ff mm/kasan/report.c:521 >> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] >> rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 >> rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] >> drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/sched_main.c:154 [inline] [gpu_sched] >> drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/sched_main.c:243 [gpu_sched] >> drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/sched_entity.c:237 [gpu_sched] >> drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 [inline] [gpu_sched] >> drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/sched_entity.c:331 [gpu_sched] >> amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 [inline] [amdgpu] >> amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2752 [amdgpu] >> amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1526 [amdgpu] >> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> __fput+0x402/0xb50 fs/file_table.c:464 >> task_work_run+0x155/0x250 kernel/task_work.c:227 >> resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] >> exit_to_user_mode_loop kernel/entry/common.c:114 [inline] >> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 >> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> RIP: 0033:0x7fd23eba36ed >> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 >> RSP: 002b:00007ffc2943a358 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 >> RAX: 0000000000000000 RBX: 00007ffc2943a428 RCX: 00007fd23eba36ed >> RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 >> RBP: 00007fd23ede7ba0 R08: 0000000000000001 R09: 0000000c00000000 >> R10: 00007fd23ea00000 R11: 0000000000000246 R12: 00007fd23ede5fac >> R13: 00007fd23ede5fa0 R14: 0000000000059ad1 R15: 0000000000059a8e >> </TASK> >> Allocated by task 6559: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> vfs_open+0x87/0x3f0 fs/open.c:1086 >> do_open+0x72f/0xf80 fs/namei.c:3830 >> path_openat+0x2ec/0x770 fs/namei.c:3989 >> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> do_sys_open fs/open.c:1443 [inline] >> __do_sys_openat fs/open.c:1459 [inline] >> __se_sys_openat fs/open.c:1454 [inline] >> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> Freed by task 6559: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> poison_slab_object mm/kasan/common.c:247 [inline] >> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> kasan_slab_free include/linux/kasan.h:233 [inline] >> slab_free_hook mm/slub.c:2353 [inline] >> slab_free mm/slub.c:4609 [inline] >> kfree+0x14f/0x4d0 mm/slub.c:4757 >> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> __fput+0x402/0xb50 fs/file_table.c:464 >> task_work_run+0x155/0x250 kernel/task_work.c:227 >> get_signal+0x1be/0x19d0 kernel/signal.c:2809 >> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> The buggy address belongs to the object at ffff88812ebcc000 >> The buggy address is located 1504 bytes inside of >> The buggy address belongs to the physical page: >> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12ebc8 >> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> page_type: f5(slab) >> raw: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 >> raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 >> head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000003 ffffea0004baf201 ffffffffffffffff 0000000000000000 >> head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> page dumped because: kasan: bad access detected >> Memory state around the buggy address: >> ffff88812ebcc480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff88812ebcc500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff88812ebcc580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ^ >> ffff88812ebcc600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff88812ebcc680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ================================================================== >> ================================================================== >> BUG: KASAN: slab-use-after-free in drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] >> BUG: KASAN: slab-use-after-free in rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] >> BUG: KASAN: slab-use-after-free in drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] >> Read of size 8 at addr ffff8881208445c8 by task syz.1.49115/146644 >> CPU: 7 UID: 65534 PID: 146644 Comm: syz.1.49115 Not tainted 6.14.0-flowejam-+ #1 >> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >> Call Trace: >> <TASK> >> __dump_stack lib/dump_stack.c:94 [inline] >> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >> print_report+0xfc/0x1ff mm/kasan/report.c:521 >> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >> drm_sched_entity_compare_before drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] >> rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] >> drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/scheduler/sched_main.c:175 [gpu_sched] >> drm_sched_entity_push_job+0x509/0x5d0 drivers/gpu/drm/scheduler/sched_entity.c:623 [gpu_sched] > > This might be a race between entity killing and the push_job. Let's > look at your patch below… > >> amdgpu_job_submit+0x1a4/0x270 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:314 [amdgpu] >> amdgpu_vm_sdma_commit+0x1f9/0x7d0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c:122 [amdgpu] >> amdgpu_vm_pt_clear+0x540/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c:422 [amdgpu] >> amdgpu_vm_init+0x9c2/0x12f0 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2609 [amdgpu] >> amdgpu_driver_open_kms+0x274/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1418 [amdgpu] >> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> vfs_open+0x87/0x3f0 fs/open.c:1086 >> do_open+0x72f/0xf80 fs/namei.c:3830 >> path_openat+0x2ec/0x770 fs/namei.c:3989 >> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> do_sys_open fs/open.c:1443 [inline] >> __do_sys_openat fs/open.c:1459 [inline] >> __se_sys_openat fs/open.c:1454 [inline] >> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> RIP: 0033:0x7feb303a36ed >> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 >> RSP: 002b:00007feb3123c018 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 >> RAX: ffffffffffffffda RBX: 00007feb305e5fa0 RCX: 00007feb303a36ed >> RDX: 0000000000000002 RSI: 0000200000000140 RDI: ffffffffffffff9c >> RBP: 00007feb30447722 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 >> R13: 0000000000000001 R14: 00007feb305e5fa0 R15: 00007ffcfd0a3460 >> </TASK> >> Allocated by task 146638: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1402 [amdgpu] >> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >> vfs_open+0x87/0x3f0 fs/open.c:1086 >> do_open+0x72f/0xf80 fs/namei.c:3830 >> path_openat+0x2ec/0x770 fs/namei.c:3989 >> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >> do_sys_open fs/open.c:1443 [inline] >> __do_sys_openat fs/open.c:1459 [inline] >> __se_sys_openat fs/open.c:1454 [inline] >> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> Freed by task 146638: >> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >> poison_slab_object mm/kasan/common.c:247 [inline] >> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >> kasan_slab_free include/linux/kasan.h:233 [inline] >> slab_free_hook mm/slub.c:2353 [inline] >> slab_free mm/slub.c:4609 [inline] >> kfree+0x14f/0x4d0 mm/slub.c:4757 >> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1538 [amdgpu] >> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >> __fput+0x402/0xb50 fs/file_table.c:464 >> task_work_run+0x155/0x250 kernel/task_work.c:227 >> resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] >> exit_to_user_mode_loop kernel/entry/common.c:114 [inline] >> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >> syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 >> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> The buggy address belongs to the object at ffff888120844000 >> The buggy address is located 1480 bytes inside of >> The buggy address belongs to the physical page: >> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120840 >> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >> page_type: f5(slab) >> raw: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 >> raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 >> head: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >> head: 0017ffffc0000003 ffffea0004821001 ffffffffffffffff 0000000000000000 >> head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 >> page dumped because: kasan: bad access detected >> Memory state around the buggy address: >> ffff888120844480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff888120844500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff888120844580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ^ >> ffff888120844600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ffff888120844680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >> ================================================================== >> >> drivers/gpu/drm/scheduler/sched_main.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index bfea608a7106..997a2cc1a635 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, >> >> entity->oldest_job_waiting = ts; >> >> - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >> - drm_sched_entity_compare_before); >> + if (!entity->stopped) { >> + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >> + drm_sched_entity_compare_before); >> + } > > If this is a race, then this patch here is broken, too, because you're > checking the 'stopped' boolean as the callers of that function do, too > – just later. :O > > Could still race, just less likely. > > The proper way to fix it would then be to address the issue where the > locking is supposed to happen. Let's look at, for example, > drm_sched_entity_push_job(): > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > { > (Bla bla bla) > > ………… > > /* first job wakes up scheduler */ > if (first) { > struct drm_gpu_scheduler *sched; > struct drm_sched_rq *rq; > > /* Add the entity to the run queue */ > spin_lock(&entity->lock); > if (entity->stopped) { <---- Aha! > spin_unlock(&entity->lock); > > DRM_ERROR("Trying to push to a killed entity\n"); > return; > } > > rq = entity->rq; > sched = rq->sched; > > spin_lock(&rq->lock); > drm_sched_rq_add_entity(rq, entity); > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); <---- bumm! > > spin_unlock(&rq->lock); > spin_unlock(&entity->lock); > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > stop entities. The former holds appropriate locks, but drm_sched_fini() > doesn't. So that looks like a hot candidate to me. Opinions? > > On the other hand, aren't drivers prohibited from calling > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > fuzzer does that, then it's not the scheduler's fault. > > Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? > > Would be cool if Tvrtko and Christian take a look. Maybe we even have a > fundamental design issue. It would be nice to have a reproducer and from this thread I did not manage to figure out if the syzkaller snipper James posted was it, or not quite it. In either case, I think one race I see relates to the early exit !entity->rq check before setting entity->stopped in drm_sched_entity_kill(). If the entity was not submitted at all yet (at the time of process exit / entity kill), entity->stopped will therefore not be set. A parallel job submit can then re-add the entity to the tree, as process exit / file close / entity kill is continuing and is about to kfree the entity (in the case of amdgpu report there are two entities embedded in file_priv). One way to make this more robust is to make the entity->rq check in drm_sched_entity_kill() stronger. Or actually to remove it altogether. But I think it also requires checking for entity->stopped in drm_sched_entity_select_rq() and propagating the error code all the way out from drm_sched_job_arm(). That was entity->stopped is properly serialized and acted upon early enough to avoid dereferencing a freed entity and avoid creating jobs not attached to anything (but only have a warning from push job). Disclaimer I haven't tried to experiment with this yet, so I may be missing something. At least writing a reproducer for the race I described sounds easy so unless someone shouts I am talking nonsense I can do that and also sketch out a fix. *If* the theory will hold water after I write the test case. Regards, Tvrtko > >> } >> >> /** > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-08-14 10:42 ` Tvrtko Ursulin @ 2025-08-14 11:45 ` Tvrtko Ursulin 2025-08-14 11:49 ` Philipp Stanner 0 siblings, 1 reply; 17+ messages in thread From: Tvrtko Ursulin @ 2025-08-14 11:45 UTC (permalink / raw) To: phasta, James Flowers, matthew.brost, dakr, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan Cc: dri-devel, linux-kernel, linux-kernel-mentees On 14/08/2025 11:42, Tvrtko Ursulin wrote: > > On 21/07/2025 08:52, Philipp Stanner wrote: >> +Cc Tvrtko, who's currently reworking FIFO and RR. >> >> On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: >>> Fixes an issue where entities are added to the run queue in >>> drm_sched_rq_update_fifo_locked after being killed, causing a >>> slab-use-after-free error. >>> >>> Signed-off-by: James Flowers <bold.zone2373@fastmail.com> >>> --- >>> This issue was detected by syzkaller running on a Steam Deck OLED. >>> Unfortunately I don't have a reproducer for it. I've >> >> Well, now that's kind of an issue – if you don't have a reproducer, how >> can you know that your patch is correct? How can we? >> >> It would certainly be good to know what the fuzz testing framework >> does. >> >>> included the KASAN reports below: >> >> >> Anyways, KASAN reports look interesting. But those might be many >> different issues. Again, would be good to know what the fuzzer has been >> testing. Can you maybe split this fuzz test into sub-tests? I suspsect >> those might be different faults. >> >> >> Anyways, taking a first look… >> >> >>> >>> ================================================================== >>> BUG: KASAN: slab-use-after-free in rb_next+0xda/0x160 lib/rbtree.c:505 >>> Read of size 8 at addr ffff8881805085e0 by task kworker/u32:12/192 >>> CPU: 3 UID: 0 PID: 192 Comm: kworker/u32:12 Not tainted 6.14.0- >>> flowejam-+ #1 >>> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >>> Workqueue: sdma0 drm_sched_run_job_work [gpu_sched] >>> Call Trace: >>> <TASK> >>> __dump_stack lib/dump_stack.c:94 [inline] >>> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >>> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >>> print_report+0xfc/0x1ff mm/kasan/report.c:521 >>> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >>> rb_next+0xda/0x160 lib/rbtree.c:505 >>> drm_sched_rq_select_entity_fifo drivers/gpu/drm/scheduler/ >>> sched_main.c:332 [inline] [gpu_sched] >>> drm_sched_select_entity+0x497/0x720 drivers/gpu/drm/scheduler/ >>> sched_main.c:1081 [gpu_sched] >>> drm_sched_run_job_work+0x2e/0x710 drivers/gpu/drm/scheduler/ >>> sched_main.c:1206 [gpu_sched] >>> process_one_work+0x9c0/0x17e0 kernel/workqueue.c:3238 >>> process_scheduled_works kernel/workqueue.c:3319 [inline] >>> worker_thread+0x734/0x1060 kernel/workqueue.c:3400 >>> kthread+0x3fd/0x810 kernel/kthread.c:464 >>> ret_from_fork+0x53/0x80 arch/x86/kernel/process.c:148 >>> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 >>> </TASK> >>> Allocated by task 73472: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >>> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >>> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >>> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >>> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1402 [amdgpu] >>> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >>> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >>> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >>> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >>> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >>> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >>> vfs_open+0x87/0x3f0 fs/open.c:1086 >>> do_open+0x72f/0xf80 fs/namei.c:3830 >>> path_openat+0x2ec/0x770 fs/namei.c:3989 >>> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >>> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >>> do_sys_open fs/open.c:1443 [inline] >>> __do_sys_openat fs/open.c:1459 [inline] >>> __se_sys_openat fs/open.c:1454 [inline] >>> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >>> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> Freed by task 73472: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >>> poison_slab_object mm/kasan/common.c:247 [inline] >>> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >>> kasan_slab_free include/linux/kasan.h:233 [inline] >>> slab_free_hook mm/slub.c:2353 [inline] >>> slab_free mm/slub.c:4609 [inline] >>> kfree+0x14f/0x4d0 mm/slub.c:4757 >>> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1538 [amdgpu] >>> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >>> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >>> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >>> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >>> __fput+0x402/0xb50 fs/file_table.c:464 >>> task_work_run+0x155/0x250 kernel/task_work.c:227 >>> get_signal+0x1be/0x19d0 kernel/signal.c:2809 >>> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >>> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >>> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >>> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >>> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >>> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> The buggy address belongs to the object at ffff888180508000 >>> The buggy address is located 1504 bytes inside of >>> The buggy address belongs to the physical page: >>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 >>> pfn:0x180508 >>> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >>> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >>> page_type: f5(slab) >>> raw: 0017ffffc0000040 ffff888100043180 dead000000000100 dead000000000122 >>> raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >>> head: 0017ffffc0000040 ffff888100043180 dead000000000100 >>> dead000000000122 >>> head: 0000000000000000 0000000080020002 00000000f5000000 >>> 0000000000000000 >>> head: 0017ffffc0000003 ffffea0006014201 ffffffffffffffff >>> 0000000000000000 >>> head: 0000000000000008 0000000000000000 00000000ffffffff >>> 0000000000000000 >>> page dumped because: kasan: bad access detected >>> Memory state around the buggy address: >>> ffff888180508480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff888180508500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> ffff888180508580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ^ >>> ffff888180508600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff888180508680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ================================================================== >>> ================================================================== >>> BUG: KASAN: slab-use-after-free in rb_set_parent_color include/linux/ >>> rbtree_augmented.h:191 [inline] >>> BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/ >>> linux/rbtree_augmented.h:312 [inline] >>> BUG: KASAN: slab-use-after-free in rb_erase+0x157c/0x1b10 lib/ >>> rbtree.c:443 >>> Write of size 8 at addr ffff88816414c5d0 by task syz.2.3004/12376 >>> CPU: 7 UID: 65534 PID: 12376 Comm: syz.2.3004 Not tainted 6.14.0- >>> flowejam-+ #1 >>> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >>> Call Trace: >>> <TASK> >>> __dump_stack lib/dump_stack.c:94 [inline] >>> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >>> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >>> print_report+0xfc/0x1ff mm/kasan/report.c:521 >>> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >>> rb_set_parent_color include/linux/rbtree_augmented.h:191 [inline] >>> __rb_erase_augmented include/linux/rbtree_augmented.h:312 [inline] >>> rb_erase+0x157c/0x1b10 lib/rbtree.c:443 >>> rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] >>> drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/ >>> sched_main.c:154 [inline] [gpu_sched] >>> drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/ >>> sched_main.c:243 [gpu_sched] >>> drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/ >>> sched_entity.c:237 [gpu_sched] >>> drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 >>> [inline] [gpu_sched] >>> drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/ >>> sched_entity.c:331 [gpu_sched] >>> amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 >>> [inline] [amdgpu] >>> amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_vm.c:2752 [amdgpu] >>> amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1526 [amdgpu] >>> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >>> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >>> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >>> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >>> __fput+0x402/0xb50 fs/file_table.c:464 >>> task_work_run+0x155/0x250 kernel/task_work.c:227 >>> exit_task_work include/linux/task_work.h:40 [inline] >>> do_exit+0x841/0xf60 kernel/exit.c:938 >>> do_group_exit+0xda/0x2b0 kernel/exit.c:1087 >>> get_signal+0x171f/0x19d0 kernel/signal.c:3036 >>> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >>> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >>> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >>> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >>> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >>> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> RIP: 0033:0x7f2d90da36ed >>> Code: Unable to access opcode bytes at 0x7f2d90da36c3. >>> RSP: 002b:00007f2d91b710d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca >>> RAX: 0000000000000000 RBX: 00007f2d90fe6088 RCX: 00007f2d90da36ed >>> RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f2d90fe6088 >>> RBP: 00007f2d90fe6080 R08: 0000000000000000 R09: 0000000000000000 >>> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2d90fe608c >>> R13: 0000000000000000 R14: 0000000000000002 R15: 00007ffc34a67bd0 >>> </TASK> >>> Allocated by task 12381: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >>> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >>> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >>> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >>> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1402 [amdgpu] >>> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >>> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >>> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >>> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >>> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >>> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >>> vfs_open+0x87/0x3f0 fs/open.c:1086 >>> do_open+0x72f/0xf80 fs/namei.c:3830 >>> path_openat+0x2ec/0x770 fs/namei.c:3989 >>> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >>> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >>> do_sys_open fs/open.c:1443 [inline] >>> __do_sys_openat fs/open.c:1459 [inline] >>> __se_sys_openat fs/open.c:1454 [inline] >>> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >>> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> Freed by task 12381: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >>> poison_slab_object mm/kasan/common.c:247 [inline] >>> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >>> kasan_slab_free include/linux/kasan.h:233 [inline] >>> slab_free_hook mm/slub.c:2353 [inline] >>> slab_free mm/slub.c:4609 [inline] >>> kfree+0x14f/0x4d0 mm/slub.c:4757 >>> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1538 [amdgpu] >>> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >>> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >>> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >>> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >>> __fput+0x402/0xb50 fs/file_table.c:464 >>> task_work_run+0x155/0x250 kernel/task_work.c:227 >>> get_signal+0x1be/0x19d0 kernel/signal.c:2809 >>> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >>> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >>> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >>> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >>> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >>> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> The buggy address belongs to the object at ffff88816414c000 >>> The buggy address is located 1488 bytes inside of >>> The buggy address belongs to the physical page: >>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 >>> pfn:0x164148 >>> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >>> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >>> page_type: f5(slab) >>> raw: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 0000000000000000 >>> raw: 0000000000000000 0000000080020002 00000000f5000000 0000000000000000 >>> head: 0017ffffc0000040 ffff88810005c8c0 dead000000000122 >>> 0000000000000000 >>> head: 0000000000000000 0000000080020002 00000000f5000000 >>> 0000000000000000 >>> head: 0017ffffc0000003 ffffea0005905201 ffffffffffffffff >>> 0000000000000000 >>> head: 0000000000000008 0000000000000000 00000000ffffffff >>> 0000000000000000 >>> page dumped because: kasan: bad access detected >>> Memory state around the buggy address: >>> ffff88816414c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff88816414c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> ffff88816414c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ^ >>> ffff88816414c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff88816414c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ================================================================== >>> ================================================================== >>> BUG: KASAN: slab-use-after-free in __rb_erase_augmented include/ >>> linux/rbtree_augmented.h:259 [inline] >>> BUG: KASAN: slab-use-after-free in rb_erase+0xf5d/0x1b10 lib/ >>> rbtree.c:443 >>> Read of size 8 at addr ffff88812ebcc5e0 by task syz.1.814/6553 >>> CPU: 0 UID: 65534 PID: 6553 Comm: syz.1.814 Not tainted 6.14.0- >>> flowejam-+ #1 >>> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >>> Call Trace: >>> <TASK> >>> __dump_stack lib/dump_stack.c:94 [inline] >>> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >>> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >>> print_report+0xfc/0x1ff mm/kasan/report.c:521 >>> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >>> __rb_erase_augmented include/linux/rbtree_augmented.h:259 [inline] >>> rb_erase+0xf5d/0x1b10 lib/rbtree.c:443 >>> rb_erase_cached include/linux/rbtree.h:126 [inline] [gpu_sched] >>> drm_sched_rq_remove_fifo_locked drivers/gpu/drm/scheduler/ >>> sched_main.c:154 [inline] [gpu_sched] >>> drm_sched_rq_remove_entity+0x2d3/0x480 drivers/gpu/drm/scheduler/ >>> sched_main.c:243 [gpu_sched] >>> drm_sched_entity_kill.part.0+0x82/0x5e0 drivers/gpu/drm/scheduler/ >>> sched_entity.c:237 [gpu_sched] >>> drm_sched_entity_kill drivers/gpu/drm/scheduler/sched_entity.c:232 >>> [inline] [gpu_sched] >>> drm_sched_entity_fini+0x4c/0x290 drivers/gpu/drm/scheduler/ >>> sched_entity.c:331 [gpu_sched] >>> amdgpu_vm_fini_entities drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:529 >>> [inline] [amdgpu] >>> amdgpu_vm_fini+0x862/0x1180 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_vm.c:2752 [amdgpu] >>> amdgpu_driver_postclose_kms+0x3db/0x6b0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1526 [amdgpu] >>> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >>> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >>> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >>> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >>> __fput+0x402/0xb50 fs/file_table.c:464 >>> task_work_run+0x155/0x250 kernel/task_work.c:227 >>> resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] >>> exit_to_user_mode_loop kernel/entry/common.c:114 [inline] >>> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >>> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >>> syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 >>> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> RIP: 0033:0x7fd23eba36ed >>> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 >>> 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> >>> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 >>> RSP: 002b:00007ffc2943a358 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 >>> RAX: 0000000000000000 RBX: 00007ffc2943a428 RCX: 00007fd23eba36ed >>> RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 >>> RBP: 00007fd23ede7ba0 R08: 0000000000000001 R09: 0000000c00000000 >>> R10: 00007fd23ea00000 R11: 0000000000000246 R12: 00007fd23ede5fac >>> R13: 00007fd23ede5fa0 R14: 0000000000059ad1 R15: 0000000000059a8e >>> </TASK> >>> Allocated by task 6559: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >>> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >>> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >>> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >>> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1402 [amdgpu] >>> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >>> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >>> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >>> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >>> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >>> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >>> vfs_open+0x87/0x3f0 fs/open.c:1086 >>> do_open+0x72f/0xf80 fs/namei.c:3830 >>> path_openat+0x2ec/0x770 fs/namei.c:3989 >>> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >>> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >>> do_sys_open fs/open.c:1443 [inline] >>> __do_sys_openat fs/open.c:1459 [inline] >>> __se_sys_openat fs/open.c:1454 [inline] >>> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >>> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> Freed by task 6559: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >>> poison_slab_object mm/kasan/common.c:247 [inline] >>> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >>> kasan_slab_free include/linux/kasan.h:233 [inline] >>> slab_free_hook mm/slub.c:2353 [inline] >>> slab_free mm/slub.c:4609 [inline] >>> kfree+0x14f/0x4d0 mm/slub.c:4757 >>> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1538 [amdgpu] >>> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >>> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >>> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >>> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >>> __fput+0x402/0xb50 fs/file_table.c:464 >>> task_work_run+0x155/0x250 kernel/task_work.c:227 >>> get_signal+0x1be/0x19d0 kernel/signal.c:2809 >>> arch_do_signal_or_restart+0x96/0x3a0 arch/x86/kernel/signal.c:337 >>> exit_to_user_mode_loop kernel/entry/common.c:111 [inline] >>> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >>> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >>> syscall_exit_to_user_mode+0x1fc/0x290 kernel/entry/common.c:218 >>> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> The buggy address belongs to the object at ffff88812ebcc000 >>> The buggy address is located 1504 bytes inside of >>> The buggy address belongs to the physical page: >>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 >>> pfn:0x12ebc8 >>> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >>> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >>> page_type: f5(slab) >>> raw: 0017ffffc0000040 ffff888100058780 dead000000000122 0000000000000000 >>> raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >>> head: 0017ffffc0000040 ffff888100058780 dead000000000122 >>> 0000000000000000 >>> head: 0000000000000000 0000000000020002 00000000f5000000 >>> 0000000000000000 >>> head: 0017ffffc0000003 ffffea0004baf201 ffffffffffffffff >>> 0000000000000000 >>> head: 0000000000000008 0000000000000000 00000000ffffffff >>> 0000000000000000 >>> page dumped because: kasan: bad access detected >>> Memory state around the buggy address: >>> ffff88812ebcc480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff88812ebcc500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> ffff88812ebcc580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ^ >>> ffff88812ebcc600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff88812ebcc680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ================================================================== >>> ================================================================== >>> BUG: KASAN: slab-use-after-free in drm_sched_entity_compare_before >>> drivers/gpu/drm/scheduler/sched_main.c:147 [inline] [gpu_sched] >>> BUG: KASAN: slab-use-after-free in rb_add_cached include/linux/ >>> rbtree.h:174 [inline] [gpu_sched] >>> BUG: KASAN: slab-use-after-free in >>> drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/ >>> scheduler/sched_main.c:175 [gpu_sched] >>> Read of size 8 at addr ffff8881208445c8 by task syz.1.49115/146644 >>> CPU: 7 UID: 65534 PID: 146644 Comm: syz.1.49115 Not tainted 6.14.0- >>> flowejam-+ #1 >>> Hardware name: Valve Galileo/Galileo, BIOS F7G0112 08/01/2024 >>> Call Trace: >>> <TASK> >>> __dump_stack lib/dump_stack.c:94 [inline] >>> dump_stack_lvl+0xd2/0x130 lib/dump_stack.c:120 >>> print_address_description.constprop.0+0x88/0x380 mm/kasan/report.c:408 >>> print_report+0xfc/0x1ff mm/kasan/report.c:521 >>> kasan_report+0xdd/0x1b0 mm/kasan/report.c:634 >>> drm_sched_entity_compare_before drivers/gpu/drm/scheduler/ >>> sched_main.c:147 [inline] [gpu_sched] >>> rb_add_cached include/linux/rbtree.h:174 [inline] [gpu_sched] >>> drm_sched_rq_update_fifo_locked+0x47b/0x540 drivers/gpu/drm/ >>> scheduler/sched_main.c:175 [gpu_sched] >>> drm_sched_entity_push_job+0x509/0x5d0 drivers/gpu/drm/scheduler/ >>> sched_entity.c:623 [gpu_sched] >> >> This might be a race between entity killing and the push_job. Let's >> look at your patch below… >> >>> amdgpu_job_submit+0x1a4/0x270 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_job.c:314 [amdgpu] >>> amdgpu_vm_sdma_commit+0x1f9/0x7d0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_vm_sdma.c:122 [amdgpu] >>> amdgpu_vm_pt_clear+0x540/0x6b0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_vm_pt.c:422 [amdgpu] >>> amdgpu_vm_init+0x9c2/0x12f0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_vm.c:2609 [amdgpu] >>> amdgpu_driver_open_kms+0x274/0x660 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1418 [amdgpu] >>> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >>> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >>> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >>> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >>> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >>> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >>> vfs_open+0x87/0x3f0 fs/open.c:1086 >>> do_open+0x72f/0xf80 fs/namei.c:3830 >>> path_openat+0x2ec/0x770 fs/namei.c:3989 >>> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >>> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >>> do_sys_open fs/open.c:1443 [inline] >>> __do_sys_openat fs/open.c:1459 [inline] >>> __se_sys_openat fs/open.c:1454 [inline] >>> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >>> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> RIP: 0033:0x7feb303a36ed >>> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 >>> 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> >>> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 >>> RSP: 002b:00007feb3123c018 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 >>> RAX: ffffffffffffffda RBX: 00007feb305e5fa0 RCX: 00007feb303a36ed >>> RDX: 0000000000000002 RSI: 0000200000000140 RDI: ffffffffffffff9c >>> RBP: 00007feb30447722 R08: 0000000000000000 R09: 0000000000000000 >>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 >>> R13: 0000000000000001 R14: 00007feb305e5fa0 R15: 00007ffcfd0a3460 >>> </TASK> >>> Allocated by task 146638: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >>> __kasan_kmalloc+0x9a/0xb0 mm/kasan/common.c:394 >>> kmalloc_noprof include/linux/slab.h:901 [inline] [amdgpu] >>> kzalloc_noprof include/linux/slab.h:1037 [inline] [amdgpu] >>> amdgpu_driver_open_kms+0x151/0x660 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1402 [amdgpu] >>> drm_file_alloc+0x5d0/0xa00 drivers/gpu/drm/drm_file.c:171 >>> drm_open_helper+0x1fe/0x540 drivers/gpu/drm/drm_file.c:323 >>> drm_open+0x1a7/0x400 drivers/gpu/drm/drm_file.c:376 >>> drm_stub_open+0x21a/0x390 drivers/gpu/drm/drm_drv.c:1149 >>> chrdev_open+0x23b/0x6b0 fs/char_dev.c:414 >>> do_dentry_open+0x743/0x1bf0 fs/open.c:956 >>> vfs_open+0x87/0x3f0 fs/open.c:1086 >>> do_open+0x72f/0xf80 fs/namei.c:3830 >>> path_openat+0x2ec/0x770 fs/namei.c:3989 >>> do_filp_open+0x1ff/0x420 fs/namei.c:4016 >>> do_sys_openat2+0x181/0x1e0 fs/open.c:1428 >>> do_sys_open fs/open.c:1443 [inline] >>> __do_sys_openat fs/open.c:1459 [inline] >>> __se_sys_openat fs/open.c:1454 [inline] >>> __x64_sys_openat+0x149/0x210 fs/open.c:1454 >>> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >>> do_syscall_64+0x92/0x180 arch/x86/entry/common.c:83 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> Freed by task 146638: >>> kasan_save_stack+0x30/0x50 mm/kasan/common.c:47 >>> kasan_save_track+0x14/0x30 mm/kasan/common.c:68 >>> kasan_save_free_info+0x3b/0x70 mm/kasan/generic.c:576 >>> poison_slab_object mm/kasan/common.c:247 [inline] >>> __kasan_slab_free+0x52/0x70 mm/kasan/common.c:264 >>> kasan_slab_free include/linux/kasan.h:233 [inline] >>> slab_free_hook mm/slub.c:2353 [inline] >>> slab_free mm/slub.c:4609 [inline] >>> kfree+0x14f/0x4d0 mm/slub.c:4757 >>> amdgpu_driver_postclose_kms+0x43d/0x6b0 drivers/gpu/drm/amd/amdgpu/ >>> amdgpu_kms.c:1538 [amdgpu] >>> drm_file_free.part.0+0x72d/0xbc0 drivers/gpu/drm/drm_file.c:255 >>> drm_file_free drivers/gpu/drm/drm_file.c:228 [inline] >>> drm_close_helper.isra.0+0x197/0x230 drivers/gpu/drm/drm_file.c:278 >>> drm_release+0x1b0/0x3d0 drivers/gpu/drm/drm_file.c:426 >>> __fput+0x402/0xb50 fs/file_table.c:464 >>> task_work_run+0x155/0x250 kernel/task_work.c:227 >>> resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] >>> exit_to_user_mode_loop kernel/entry/common.c:114 [inline] >>> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] >>> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] >>> syscall_exit_to_user_mode+0x26b/0x290 kernel/entry/common.c:218 >>> do_syscall_64+0x9f/0x180 arch/x86/entry/common.c:89 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> The buggy address belongs to the object at ffff888120844000 >>> The buggy address is located 1480 bytes inside of >>> The buggy address belongs to the physical page: >>> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 >>> pfn:0x120840 >>> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 >>> flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >>> page_type: f5(slab) >>> raw: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 dead000000000002 >>> raw: 0000000000000000 0000000000020002 00000000f5000000 0000000000000000 >>> head: 0017ffffc0000040 ffff88810005c8c0 ffffea0005744c00 >>> dead000000000002 >>> head: 0000000000000000 0000000000020002 00000000f5000000 >>> 0000000000000000 >>> head: 0017ffffc0000003 ffffea0004821001 ffffffffffffffff >>> 0000000000000000 >>> head: 0000000000000008 0000000000000000 00000000ffffffff >>> 0000000000000000 >>> page dumped because: kasan: bad access detected >>> Memory state around the buggy address: >>> ffff888120844480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff888120844500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>>> ffff888120844580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ^ >>> ffff888120844600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ffff888120844680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >>> ================================================================== >>> >>> drivers/gpu/drm/scheduler/sched_main.c | 6 ++++-- >>> 1 file changed, 4 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/ >>> drm/scheduler/sched_main.c >>> index bfea608a7106..997a2cc1a635 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -172,8 +172,10 @@ void drm_sched_rq_update_fifo_locked(struct >>> drm_sched_entity *entity, >>> entity->oldest_job_waiting = ts; >>> - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >>> - drm_sched_entity_compare_before); >>> + if (!entity->stopped) { >>> + rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, >>> + drm_sched_entity_compare_before); >>> + } >> >> If this is a race, then this patch here is broken, too, because you're >> checking the 'stopped' boolean as the callers of that function do, too >> – just later. :O >> >> Could still race, just less likely. >> >> The proper way to fix it would then be to address the issue where the >> locking is supposed to happen. Let's look at, for example, >> drm_sched_entity_push_job(): >> >> >> void drm_sched_entity_push_job(struct drm_sched_job *sched_job) >> { >> (Bla bla bla) >> >> ………… >> >> /* first job wakes up scheduler */ >> if (first) { >> struct drm_gpu_scheduler *sched; >> struct drm_sched_rq *rq; >> >> /* Add the entity to the run queue */ >> spin_lock(&entity->lock); >> if (entity->stopped) { <---- Aha! >> spin_unlock(&entity->lock); >> >> DRM_ERROR("Trying to push to a killed entity\n"); >> return; >> } >> >> rq = entity->rq; >> sched = rq->sched; >> >> spin_lock(&rq->lock); >> drm_sched_rq_add_entity(rq, entity); >> >> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) >> drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); >> <---- bumm! >> >> spin_unlock(&rq->lock); >> spin_unlock(&entity->lock); >> >> But the locks are still being hold. So that "shouldn't be happening"(tm). >> >> Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() >> stop entities. The former holds appropriate locks, but drm_sched_fini() >> doesn't. So that looks like a hot candidate to me. Opinions? >> >> On the other hand, aren't drivers prohibited from calling >> drm_sched_entity_push_job() after calling drm_sched_fini()? If the >> fuzzer does that, then it's not the scheduler's fault. >> >> Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? >> >> Would be cool if Tvrtko and Christian take a look. Maybe we even have a >> fundamental design issue. > > It would be nice to have a reproducer and from this thread I did not > manage to figure out if the syzkaller snipper James posted was it, or > not quite it. > > In either case, I think one race I see relates to the early exit ! > entity->rq check before setting entity->stopped in drm_sched_entity_kill(). > > If the entity was not submitted at all yet (at the time of process > exit / entity kill), entity->stopped will therefore not be set. A > parallel job submit can then re-add the entity to the tree, as process > exit / file close / entity kill is continuing and is about to kfree the > entity (in the case of amdgpu report there are two entities embedded in > file_priv). > > One way to make this more robust is to make the entity->rq check in > drm_sched_entity_kill() stronger. Or actually to remove it altogether. > But I think it also requires checking for entity->stopped in > drm_sched_entity_select_rq() and propagating the error code all the way > out from drm_sched_job_arm(). > > That was entity->stopped is properly serialized and acted upon early > enough to avoid dereferencing a freed entity and avoid creating jobs not > attached to anything (but only have a warning from push job). > > Disclaimer I haven't tried to experiment with this yet, so I may be > missing something. At least writing a reproducer for the race I > described sounds easy so unless someone shouts I am talking nonsense I > can do that and also sketch out a fix. *If* the theory will hold water > after I write the test case. Nah I was talking nonsense. Forgot entity->rq is assigned on entity init and jobs cannot be created unless it is set. Okay, I have no theories as to what bug syzkaller found. Regards, Tvrtko > >> >>> } >>> /** >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-08-14 11:45 ` Tvrtko Ursulin @ 2025-08-14 11:49 ` Philipp Stanner 2025-08-14 12:17 ` Tvrtko Ursulin 0 siblings, 1 reply; 17+ messages in thread From: Philipp Stanner @ 2025-08-14 11:49 UTC (permalink / raw) To: Tvrtko Ursulin, phasta, James Flowers, matthew.brost, dakr, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan Cc: dri-devel, linux-kernel, linux-kernel-mentees On Thu, 2025-08-14 at 12:45 +0100, Tvrtko Ursulin wrote: > > On 14/08/2025 11:42, Tvrtko Ursulin wrote: > > > > On 21/07/2025 08:52, Philipp Stanner wrote: > > > +Cc Tvrtko, who's currently reworking FIFO and RR. > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > Fixes an issue where entities are added to the run queue in > > > > drm_sched_rq_update_fifo_locked after being killed, causing a > > > > slab-use-after-free error. > > > > > > > > Signed-off-by: James Flowers <bold.zone2373@fastmail.com> > > > > --- > > > > This issue was detected by syzkaller running on a Steam Deck OLED. > > > > Unfortunately I don't have a reproducer for it. I've > > > > > > Well, now that's kind of an issue – if you don't have a reproducer, how > > > can you know that your patch is correct? How can we? > > > > > > It would certainly be good to know what the fuzz testing framework > > > does. > > > > > > > included the KASAN reports below: > > > > > > > > > Anyways, KASAN reports look interesting. But those might be many > > > different issues. Again, would be good to know what the fuzzer has been > > > testing. Can you maybe split this fuzz test into sub-tests? I suspsect > > > those might be different faults. > > > > > > > > > Anyways, taking a first look… > > > > > > [SNIP] > > > > > > > > ================================================================== > > > > > > If this is a race, then this patch here is broken, too, because you're > > > checking the 'stopped' boolean as the callers of that function do, too > > > – just later. :O > > > > > > Could still race, just less likely. > > > > > > The proper way to fix it would then be to address the issue where the > > > locking is supposed to happen. Let's look at, for example, > > > drm_sched_entity_push_job(): > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > { > > > (Bla bla bla) > > > > > > ………… > > > > > > /* first job wakes up scheduler */ > > > if (first) { > > > struct drm_gpu_scheduler *sched; > > > struct drm_sched_rq *rq; > > > > > > /* Add the entity to the run queue */ > > > spin_lock(&entity->lock); > > > if (entity->stopped) { <---- Aha! > > > spin_unlock(&entity->lock); > > > > > > DRM_ERROR("Trying to push to a killed entity\n"); > > > return; > > > } > > > > > > rq = entity->rq; > > > sched = rq->sched; > > > > > > spin_lock(&rq->lock); > > > drm_sched_rq_add_entity(rq, entity); > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); > > > <---- bumm! > > > > > > spin_unlock(&rq->lock); > > > spin_unlock(&entity->lock); > > > > > > But the locks are still being hold. So that "shouldn't be happening"(tm). > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() > > > stop entities. The former holds appropriate locks, but drm_sched_fini() > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > On the other hand, aren't drivers prohibited from calling > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? > > > > > > Would be cool if Tvrtko and Christian take a look. Maybe we even have a > > > fundamental design issue. > > > > It would be nice to have a reproducer and from this thread I did not > > manage to figure out if the syzkaller snipper James posted was it, or > > not quite it. > > > > In either case, I think one race I see relates to the early exit ! > > entity->rq check before setting entity->stopped in drm_sched_entity_kill(). > > > > If the entity was not submitted at all yet (at the time of process > > exit / entity kill), entity->stopped will therefore not be set. A > > parallel job submit can then re-add the entity to the tree, as process > > exit / file close / entity kill is continuing and is about to kfree the > > entity (in the case of amdgpu report there are two entities embedded in > > file_priv). > > > > One way to make this more robust is to make the entity->rq check in > > drm_sched_entity_kill() stronger. Or actually to remove it altogether. > > But I think it also requires checking for entity->stopped in > > drm_sched_entity_select_rq() and propagating the error code all the way > > out from drm_sched_job_arm(). > > > > That was entity->stopped is properly serialized and acted upon early > > enough to avoid dereferencing a freed entity and avoid creating jobs not > > attached to anything (but only have a warning from push job). > > > > Disclaimer I haven't tried to experiment with this yet, so I may be > > missing something. At least writing a reproducer for the race I > > described sounds easy so unless someone shouts I am talking nonsense I > > can do that and also sketch out a fix. *If* the theory will hold water > > after I write the test case. > > Nah I was talking nonsense. Forgot entity->rq is assigned on entity init > and jobs cannot be created unless it is set. > > Okay, I have no theories as to what bug syzkaller found. I just was about to answer. I agree that the rq check should be fine. As you can see in the thread, I suspect that this is a race between drm_sched_entity_push_job() and drm_sched_fini(). See here: https://lore.kernel.org/dri-devel/20250813085654.102504-2-phasta@kernel.org/ I think as long as there's no reproducer there is not much to do for us here. A long term goal, though, is to enforce the life time rules. Entities must be torn down before their scheduler. Checking this for all drivers will be quite some work, though.. P. > > Regards, > > Tvrtko > > > > > > > > > > } > > > > /** > > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] drm/sched: Prevent stopped entities from being added to the run queue. 2025-08-14 11:49 ` Philipp Stanner @ 2025-08-14 12:17 ` Tvrtko Ursulin 0 siblings, 0 replies; 17+ messages in thread From: Tvrtko Ursulin @ 2025-08-14 12:17 UTC (permalink / raw) To: phasta, James Flowers, matthew.brost, dakr, ckoenig.leichtzumerken, maarten.lankhorst, mripard, tzimmermann, airlied, simona, skhan Cc: dri-devel, linux-kernel, linux-kernel-mentees On 14/08/2025 12:49, Philipp Stanner wrote: > On Thu, 2025-08-14 at 12:45 +0100, Tvrtko Ursulin wrote: >> >> On 14/08/2025 11:42, Tvrtko Ursulin wrote: >>> >>> On 21/07/2025 08:52, Philipp Stanner wrote: >>>> +Cc Tvrtko, who's currently reworking FIFO and RR. >>>> >>>> On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: >>>>> Fixes an issue where entities are added to the run queue in >>>>> drm_sched_rq_update_fifo_locked after being killed, causing a >>>>> slab-use-after-free error. >>>>> >>>>> Signed-off-by: James Flowers <bold.zone2373@fastmail.com> >>>>> --- >>>>> This issue was detected by syzkaller running on a Steam Deck OLED. >>>>> Unfortunately I don't have a reproducer for it. I've >>>> >>>> Well, now that's kind of an issue – if you don't have a reproducer, how >>>> can you know that your patch is correct? How can we? >>>> >>>> It would certainly be good to know what the fuzz testing framework >>>> does. >>>> >>>>> included the KASAN reports below: >>>> >>>> >>>> Anyways, KASAN reports look interesting. But those might be many >>>> different issues. Again, would be good to know what the fuzzer has been >>>> testing. Can you maybe split this fuzz test into sub-tests? I suspsect >>>> those might be different faults. >>>> >>>> >>>> Anyways, taking a first look… >>>> >>>> > > > [SNIP] > >>>>> >>>>> ================================================================== >>>> >>>> If this is a race, then this patch here is broken, too, because you're >>>> checking the 'stopped' boolean as the callers of that function do, too >>>> – just later. :O >>>> >>>> Could still race, just less likely. >>>> >>>> The proper way to fix it would then be to address the issue where the >>>> locking is supposed to happen. Let's look at, for example, >>>> drm_sched_entity_push_job(): >>>> >>>> >>>> void drm_sched_entity_push_job(struct drm_sched_job *sched_job) >>>> { >>>> (Bla bla bla) >>>> >>>> ………… >>>> >>>> /* first job wakes up scheduler */ >>>> if (first) { >>>> struct drm_gpu_scheduler *sched; >>>> struct drm_sched_rq *rq; >>>> >>>> /* Add the entity to the run queue */ >>>> spin_lock(&entity->lock); >>>> if (entity->stopped) { <---- Aha! >>>> spin_unlock(&entity->lock); >>>> >>>> DRM_ERROR("Trying to push to a killed entity\n"); >>>> return; >>>> } >>>> >>>> rq = entity->rq; >>>> sched = rq->sched; >>>> >>>> spin_lock(&rq->lock); >>>> drm_sched_rq_add_entity(rq, entity); >>>> >>>> if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) >>>> drm_sched_rq_update_fifo_locked(entity, rq, submit_ts); >>>> <---- bumm! >>>> >>>> spin_unlock(&rq->lock); >>>> spin_unlock(&entity->lock); >>>> >>>> But the locks are still being hold. So that "shouldn't be happening"(tm). >>>> >>>> Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini() >>>> stop entities. The former holds appropriate locks, but drm_sched_fini() >>>> doesn't. So that looks like a hot candidate to me. Opinions? >>>> >>>> On the other hand, aren't drivers prohibited from calling >>>> drm_sched_entity_push_job() after calling drm_sched_fini()? If the >>>> fuzzer does that, then it's not the scheduler's fault. >>>> >>>> Could you test adding spin_lock(&entity->lock) to drm_sched_fini()? >>>> >>>> Would be cool if Tvrtko and Christian take a look. Maybe we even have a >>>> fundamental design issue. >>> >>> It would be nice to have a reproducer and from this thread I did not >>> manage to figure out if the syzkaller snipper James posted was it, or >>> not quite it. >>> >>> In either case, I think one race I see relates to the early exit ! >>> entity->rq check before setting entity->stopped in drm_sched_entity_kill(). >>> >>> If the entity was not submitted at all yet (at the time of process >>> exit / entity kill), entity->stopped will therefore not be set. A >>> parallel job submit can then re-add the entity to the tree, as process >>> exit / file close / entity kill is continuing and is about to kfree the >>> entity (in the case of amdgpu report there are two entities embedded in >>> file_priv). >>> >>> One way to make this more robust is to make the entity->rq check in >>> drm_sched_entity_kill() stronger. Or actually to remove it altogether. >>> But I think it also requires checking for entity->stopped in >>> drm_sched_entity_select_rq() and propagating the error code all the way >>> out from drm_sched_job_arm(). >>> >>> That was entity->stopped is properly serialized and acted upon early >>> enough to avoid dereferencing a freed entity and avoid creating jobs not >>> attached to anything (but only have a warning from push job). >>> >>> Disclaimer I haven't tried to experiment with this yet, so I may be >>> missing something. At least writing a reproducer for the race I >>> described sounds easy so unless someone shouts I am talking nonsense I >>> can do that and also sketch out a fix. *If* the theory will hold water >>> after I write the test case. >> >> Nah I was talking nonsense. Forgot entity->rq is assigned on entity init >> and jobs cannot be created unless it is set. >> >> Okay, I have no theories as to what bug syzkaller found. > > I just was about to answer. > > I agree that the rq check should be fine. > > As you can see in the thread, I suspect that this is a race between > drm_sched_entity_push_job() and drm_sched_fini(). > > See here: > https://lore.kernel.org/dri-devel/20250813085654.102504-2-phasta@kernel.org/ Yeah I read it. Problem with the amdgpu angle and this KASAN report is that to me it looked the UAF is about the two VM update entities embedded in struct file priv. And the schedulers used to initialize those are not torn down until driver unload. So I didn't think syzkaller would have hit that and was looking for alternative ideas. Regards, Tvrtko > I think as long as there's no reproducer there is not much to do for us > here. A long term goal, though, is to enforce the life time rules. > Entities must be torn down before their scheduler. Checking this for > all drivers will be quite some work, though.. > > > P. > > >> >> Regards, >> >> Tvrtko >> >>> >>>> >>>>> } >>>>> /** >>>> >>> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-08-14 12:17 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-20 23:56 [PATCH] drm/sched: Prevent stopped entities from being added to the run queue James Flowers 2025-07-21 7:52 ` Philipp Stanner 2025-07-21 8:16 ` Philipp Stanner 2025-07-21 10:14 ` Danilo Krummrich 2025-07-21 18:07 ` Matthew Brost 2025-07-22 7:37 ` Philipp Stanner 2025-07-22 8:07 ` Matthew Brost 2025-07-22 8:45 ` Matthew Brost 2025-07-23 6:56 ` Philipp Stanner 2025-07-24 4:13 ` Matthew Brost 2025-07-24 4:17 ` Matthew Brost 2025-07-22 20:05 ` James 2025-07-23 14:41 ` Philipp Stanner 2025-08-14 10:42 ` Tvrtko Ursulin 2025-08-14 11:45 ` Tvrtko Ursulin 2025-08-14 11:49 ` Philipp Stanner 2025-08-14 12:17 ` Tvrtko Ursulin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).