* [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages @ 2025-06-03 15:31 syzbot 2025-06-03 16:22 ` David Hildenbrand 2025-06-21 21:52 ` syzbot 0 siblings, 2 replies; 17+ messages in thread From: syzbot @ 2025-06-03 15:31 UTC (permalink / raw) To: akpm, david, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs Hello, syzbot found the following issue on: HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000 kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 userspace arch: arm64 Unfortunately, I don't have any reproducer for this issue yet. Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) ------------[ cut here ]------------ kernel BUG at mm/gup.c:70! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: iou_exit io_ring_exit_work pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 sp : ffff800097f17640 x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000 x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000 x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276 x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001 x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000 x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00 x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68 x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061 Call trace: sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P) unpin_user_page+0x80/0x10c mm/gup.c:191 io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113 io_buffer_unmap io_uring/rsrc.c:140 [inline] io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513 io_put_rsrc_node io_uring/rsrc.h:103 [inline] io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197 io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607 io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723 io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962 process_one_work+0x7e8/0x156c kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x958/0xed8 kernel/workqueue.c:3400 kthread+0x5fc/0x75c kernel/kthread.c:464 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847 Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000) ---[ end trace 0000000000000000 ]--- --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot. If the report is already addressed, let syzbot know by replying with: #syz fix: exact-commit-title If you want to overwrite report's subsystems, reply with: #syz set subsystems: new-subsystem (See the list of subsystem names on the web dashboard) If the report is a duplicate of another one, reply with: #syz dup: exact-subject-of-another-report If you want to undo deduplication, reply with: #syz undup ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-03 15:31 [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages syzbot @ 2025-06-03 16:22 ` David Hildenbrand 2025-06-03 17:20 ` Jens Axboe 2025-06-21 21:52 ` syzbot 1 sibling, 1 reply; 17+ messages in thread From: David Hildenbrand @ 2025-06-03 16:22 UTC (permalink / raw) To: syzbot, akpm, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs, Jens Axboe On 03.06.25 17:31, syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci > git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci > console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000 > kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a > dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 > compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 > userspace arch: arm64 > > Unfortunately, I don't have any reproducer for this issue yet. > > Downloadable assets: > disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz > vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz > kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com > > head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 > page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) > ------------[ cut here ]------------ > kernel BUG at mm/gup.c:70! > Internal error: Oops - BUG: 00000000f2000800 [#1] SMP > Modules linked in: > > CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 > Workqueue: iou_exit io_ring_exit_work > pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 > lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 > sp : ffff800097f17640 > x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000 > x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000 > x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c > x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276 > x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001 > x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000 > x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00 > x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001 > x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68 > x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061 > Call trace: > sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P) > unpin_user_page+0x80/0x10c mm/gup.c:191 > io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113 > io_buffer_unmap io_uring/rsrc.c:140 [inline] > io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513 > io_put_rsrc_node io_uring/rsrc.h:103 [inline] > io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197 > io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607 > io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723 > io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962 > process_one_work+0x7e8/0x156c kernel/workqueue.c:3238 > process_scheduled_works kernel/workqueue.c:3319 [inline] > worker_thread+0x958/0xed8 kernel/workqueue.c:3400 > kthread+0x5fc/0x75c kernel/kthread.c:464 > ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847 > Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000) > ---[ end trace 0000000000000000 ]--- So we lost a PAE bit for a pinned folio. [ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400 [ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1 The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped). The page we are using for unpinning is not mapped (mapcount:0). pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page). [ 97.640414][ T115] memcg:ffff0000f36b6000 [ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff) [ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 [ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 [ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 [ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 [ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff [ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 [ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) So we effectively only test the head page. Here we don't have the bit set for that page. In gup_fast() we perform a similar sanity check, which didn't trigger at the time we pinned the folio. io_uring ends up calling io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might indeed trigger. What could trigger this (in weird scenarios, though) is if we used pin_user_page() to obtain a page, then did folio = page_folio(page) and called unpin_user_page(&folio->page) instead of using unpin_folio(). Or using any other page that we didn't pin. It would be a corner case, though. Staring at io_release_ubuf(), that's also not immediately what's happening. There is this coalescing code in io_sqe_buffer_register()->io_check_coalesce_buffer(), maybe ... something is going wrong there? Otherwise, I could only envision (a) some random memory overwrite clearing the bit or (b) some weird race between GUP-fast and PAE clearing that we didn't run into so far. But these sanity checks have been around for a loooong time at this point. Unfortunately, no reproducer :( -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-03 16:22 ` David Hildenbrand @ 2025-06-03 17:20 ` Jens Axboe 2025-06-03 17:25 ` David Hildenbrand 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2025-06-03 17:20 UTC (permalink / raw) To: David Hildenbrand, syzbot, akpm, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 6/3/25 10:22 AM, David Hildenbrand wrote: > On 03.06.25 17:31, syzbot wrote: >> Hello, >> >> syzbot found the following issue on: >> >> HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci >> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >> console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000 >> kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a >> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >> userspace arch: arm64 >> >> Unfortunately, I don't have any reproducer for this issue yet. >> >> Downloadable assets: >> disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz >> vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz >> kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz >> >> IMPORTANT: if you fix the issue, please add the following tag to the commit: >> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com >> >> head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 >> page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) >> ------------[ cut here ]------------ >> kernel BUG at mm/gup.c:70! >> Internal error: Oops - BUG: 00000000f2000800 [#1] SMP >> Modules linked in: >> >> CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 >> Workqueue: iou_exit io_ring_exit_work >> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 >> lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 >> sp : ffff800097f17640 >> x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000 >> x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000 >> x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c >> x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276 >> x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001 >> x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000 >> x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00 >> x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001 >> x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68 >> x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061 >> Call trace: >> sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P) >> unpin_user_page+0x80/0x10c mm/gup.c:191 >> io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113 >> io_buffer_unmap io_uring/rsrc.c:140 [inline] >> io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513 >> io_put_rsrc_node io_uring/rsrc.h:103 [inline] >> io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197 >> io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607 >> io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723 >> io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962 >> process_one_work+0x7e8/0x156c kernel/workqueue.c:3238 >> process_scheduled_works kernel/workqueue.c:3319 [inline] >> worker_thread+0x958/0xed8 kernel/workqueue.c:3400 >> kthread+0x5fc/0x75c kernel/kthread.c:464 >> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847 >> Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000) >> ---[ end trace 0000000000000000 ]--- > > So we lost a PAE bit for a pinned folio. > > [ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400 > [ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1 > > The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped). > > The page we are using for unpinning is not mapped (mapcount:0). > > pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page). > > > [ 97.640414][ T115] memcg:ffff0000f36b6000 > [ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff) > [ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 > [ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 > [ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 > [ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 > [ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff > [ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 > [ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) > > So we effectively only test the head page. Here we don't have the bit > set for that page. > > > In gup_fast() we perform a similar sanity check, which didn't trigger > at the time we pinned the folio. io_uring ends up calling > io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might > indeed trigger. > > > What could trigger this (in weird scenarios, though) is if we used > pin_user_page() to obtain a page, then did folio = page_folio(page) > and called unpin_user_page(&folio->page) instead of using > unpin_folio(). Or using any other page that we didn't pin. It would be > a corner case, though. > > Staring at io_release_ubuf(), that's also not immediately what's > happening. > > There is this coalescing code in > io_sqe_buffer_register()->io_check_coalesce_buffer(), maybe ... > something is going wrong there? > > > > Otherwise, I could only envision (a) some random memory overwrite > clearing the bit or (b) some weird race between GUP-fast and PAE > clearing that we didn't run into so far. But these sanity checks have > been around for a loooong time at this point. > > Unfortunately, no reproducer :( Too bad there's no reproducer... Since this looks recent, I'd suspect the recent changes there. Most notably: commit f446c6311e86618a1f81eb576b56a6266307238f Author: Jens Axboe <axboe@kernel.dk> Date: Mon May 12 09:06:06 2025 -0600 io_uring/memmap: don't use page_address() on a highmem page which seems a bit odd, as this is arm64 and there'd be no highmem. This went into the 6.15 kernel release. Let's hope a reproducer is forthcoming. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-03 17:20 ` Jens Axboe @ 2025-06-03 17:25 ` David Hildenbrand 2025-06-03 17:36 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand @ 2025-06-03 17:25 UTC (permalink / raw) To: Jens Axboe, syzbot, akpm, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs, Catalin Marinas On 03.06.25 19:20, Jens Axboe wrote: > On 6/3/25 10:22 AM, David Hildenbrand wrote: >> On 03.06.25 17:31, syzbot wrote: >>> Hello, >>> >>> syzbot found the following issue on: >>> >>> HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci >>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >>> console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000 >>> kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a >>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >>> userspace arch: arm64 >>> >>> Unfortunately, I don't have any reproducer for this issue yet. >>> >>> Downloadable assets: >>> disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz >>> vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz >>> kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz >>> >>> IMPORTANT: if you fix the issue, please add the following tag to the commit: >>> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com >>> >>> head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 >>> page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) >>> ------------[ cut here ]------------ >>> kernel BUG at mm/gup.c:70! >>> Internal error: Oops - BUG: 00000000f2000800 [#1] SMP >>> Modules linked in: >>> >>> CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT >>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 >>> Workqueue: iou_exit io_ring_exit_work >>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>> pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 >>> lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 >>> sp : ffff800097f17640 >>> x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000 >>> x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000 >>> x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c >>> x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276 >>> x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001 >>> x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000 >>> x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00 >>> x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001 >>> x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68 >>> x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061 >>> Call trace: >>> sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P) >>> unpin_user_page+0x80/0x10c mm/gup.c:191 >>> io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113 >>> io_buffer_unmap io_uring/rsrc.c:140 [inline] >>> io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513 >>> io_put_rsrc_node io_uring/rsrc.h:103 [inline] >>> io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197 >>> io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607 >>> io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723 >>> io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962 >>> process_one_work+0x7e8/0x156c kernel/workqueue.c:3238 >>> process_scheduled_works kernel/workqueue.c:3319 [inline] >>> worker_thread+0x958/0xed8 kernel/workqueue.c:3400 >>> kthread+0x5fc/0x75c kernel/kthread.c:464 >>> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847 >>> Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000) >>> ---[ end trace 0000000000000000 ]--- >> >> So we lost a PAE bit for a pinned folio. >> >> [ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400 >> [ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1 >> >> The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped). >> >> The page we are using for unpinning is not mapped (mapcount:0). >> >> pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page). >> >> >> [ 97.640414][ T115] memcg:ffff0000f36b6000 >> [ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff) >> [ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 >> [ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 >> [ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 >> [ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 >> [ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff >> [ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 >> [ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) >> >> So we effectively only test the head page. Here we don't have the bit >> set for that page. >> >> >> In gup_fast() we perform a similar sanity check, which didn't trigger >> at the time we pinned the folio. io_uring ends up calling >> io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might >> indeed trigger. >> >> >> What could trigger this (in weird scenarios, though) is if we used >> pin_user_page() to obtain a page, then did folio = page_folio(page) >> and called unpin_user_page(&folio->page) instead of using >> unpin_folio(). Or using any other page that we didn't pin. It would be >> a corner case, though. >> >> Staring at io_release_ubuf(), that's also not immediately what's >> happening. >> >> There is this coalescing code in >> io_sqe_buffer_register()->io_check_coalesce_buffer(), maybe ... >> something is going wrong there? >> >> >> >> Otherwise, I could only envision (a) some random memory overwrite >> clearing the bit or (b) some weird race between GUP-fast and PAE >> clearing that we didn't run into so far. But these sanity checks have >> been around for a loooong time at this point. >> >> Unfortunately, no reproducer :( > > Too bad there's no reproducer... Since this looks recent, I'd suspect > the recent changes there. Most notably: > > commit f446c6311e86618a1f81eb576b56a6266307238f > Author: Jens Axboe <axboe@kernel.dk> > Date: Mon May 12 09:06:06 2025 -0600 > > io_uring/memmap: don't use page_address() on a highmem page > > which seems a bit odd, as this is arm64 and there'd be no highmem. This > went into the 6.15 kernel release. Let's hope a reproducer is > forthcoming. Yeah, that does not really look problematic. Interestingly, this was found in git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci Hm. Let me dig a bit, but if it's some corner case race, it's weird that we didn't find it earlier. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-03 17:25 ` David Hildenbrand @ 2025-06-03 17:36 ` Jens Axboe 0 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2025-06-03 17:36 UTC (permalink / raw) To: David Hildenbrand, syzbot, akpm, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs, Catalin Marinas On 6/3/25 11:25 AM, David Hildenbrand wrote: > On 03.06.25 19:20, Jens Axboe wrote: >> On 6/3/25 10:22 AM, David Hildenbrand wrote: >>> On 03.06.25 17:31, syzbot wrote: >>>> Hello, >>>> >>>> syzbot found the following issue on: >>>> >>>> HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci >>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000 >>>> kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a >>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >>>> userspace arch: arm64 >>>> >>>> Unfortunately, I don't have any reproducer for this issue yet. >>>> >>>> Downloadable assets: >>>> disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz >>>> vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz >>>> kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz >>>> >>>> IMPORTANT: if you fix the issue, please add the following tag to the commit: >>>> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com >>>> >>>> head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 >>>> page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) >>>> ------------[ cut here ]------------ >>>> kernel BUG at mm/gup.c:70! >>>> Internal error: Oops - BUG: 00000000f2000800 [#1] SMP >>>> Modules linked in: >>>> >>>> CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT >>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 >>>> Workqueue: iou_exit io_ring_exit_work >>>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>> pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 >>>> lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 >>>> sp : ffff800097f17640 >>>> x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000 >>>> x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000 >>>> x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c >>>> x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276 >>>> x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001 >>>> x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000 >>>> x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00 >>>> x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001 >>>> x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68 >>>> x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061 >>>> Call trace: >>>> sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P) >>>> unpin_user_page+0x80/0x10c mm/gup.c:191 >>>> io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113 >>>> io_buffer_unmap io_uring/rsrc.c:140 [inline] >>>> io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513 >>>> io_put_rsrc_node io_uring/rsrc.h:103 [inline] >>>> io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197 >>>> io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607 >>>> io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723 >>>> io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962 >>>> process_one_work+0x7e8/0x156c kernel/workqueue.c:3238 >>>> process_scheduled_works kernel/workqueue.c:3319 [inline] >>>> worker_thread+0x958/0xed8 kernel/workqueue.c:3400 >>>> kthread+0x5fc/0x75c kernel/kthread.c:464 >>>> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847 >>>> Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000) >>>> ---[ end trace 0000000000000000 ]--- >>> >>> So we lost a PAE bit for a pinned folio. >>> >>> [ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400 >>> [ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1 >>> >>> The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped). >>> >>> The page we are using for unpinning is not mapped (mapcount:0). >>> >>> pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page). >>> >>> >>> [ 97.640414][ T115] memcg:ffff0000f36b6000 >>> [ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff) >>> [ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 >>> [ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 >>> [ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1 >>> [ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000 >>> [ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff >>> [ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200 >>> [ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) >>> >>> So we effectively only test the head page. Here we don't have the bit >>> set for that page. >>> >>> >>> In gup_fast() we perform a similar sanity check, which didn't trigger >>> at the time we pinned the folio. io_uring ends up calling >>> io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might >>> indeed trigger. >>> >>> >>> What could trigger this (in weird scenarios, though) is if we used >>> pin_user_page() to obtain a page, then did folio = page_folio(page) >>> and called unpin_user_page(&folio->page) instead of using >>> unpin_folio(). Or using any other page that we didn't pin. It would be >>> a corner case, though. >>> >>> Staring at io_release_ubuf(), that's also not immediately what's >>> happening. >>> >>> There is this coalescing code in >>> io_sqe_buffer_register()->io_check_coalesce_buffer(), maybe ... >>> something is going wrong there? >>> >>> >>> >>> Otherwise, I could only envision (a) some random memory overwrite >>> clearing the bit or (b) some weird race between GUP-fast and PAE >>> clearing that we didn't run into so far. But these sanity checks have >>> been around for a loooong time at this point. >>> >>> Unfortunately, no reproducer :( >> >> Too bad there's no reproducer... Since this looks recent, I'd suspect >> the recent changes there. Most notably: >> >> commit f446c6311e86618a1f81eb576b56a6266307238f >> Author: Jens Axboe <axboe@kernel.dk> >> Date: Mon May 12 09:06:06 2025 -0600 >> >> io_uring/memmap: don't use page_address() on a highmem page >> >> which seems a bit odd, as this is arm64 and there'd be no highmem. This >> went into the 6.15 kernel release. Let's hope a reproducer is >> forthcoming. > > Yeah, that does not really look problematic. > > Interestingly, this was found in > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci > > Hm. Yep, pulled that into 6.15 as released, and got a few mm/ changes in there. So perhaps related? -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-03 15:31 [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages syzbot 2025-06-03 16:22 ` David Hildenbrand @ 2025-06-21 21:52 ` syzbot 2025-06-23 9:29 ` David Hildenbrand 1 sibling, 1 reply; 17+ messages in thread From: syzbot @ 2025-06-21 21:52 UTC (permalink / raw) To: akpm, axboe, catalin.marinas, david, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs syzbot has found a reproducer for the following issue on: HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 userspace arch: arm64 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/974f3ac1c6a5/disk-9aa9b43d.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/a5b5075d317f/vmlinux-9aa9b43d.xz kernel image: https://storage.googleapis.com/syzbot-assets/2f0ba7fec19b/Image-9aa9b43d.gz.xz mounted in repro: https://storage.googleapis.com/syzbot-assets/76067befefec/mount_4.gz fsck result: failed (log: https://syzkaller.appspot.com/x/fsck.log?x=1549f6bc580000) IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000200 page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page)) ------------[ cut here ]------------ kernel BUG at mm/gup.c:71! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 2171 Comm: kworker/u8:9 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: iou_exit io_ring_exit_work pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:70 lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:70 sp : ffff8000a03a7640 x29: ffff8000a03a7660 x28: dfff800000000000 x27: 1fffffbff8723000 x26: 05ffc00000020178 x25: 05ffc00000020178 x24: fffffdffc3918000 x23: fffffdffc3918000 x22: ffff8000a03a76e0 x21: 05ffc00000020178 x20: 0000000000000000 x19: ffff8000a03a76e0 x18: 00000000ffffffff x17: 703e2d6f696c6f66 x16: ffff80008aecb65c x15: 0000000000000001 x14: 1fffe000337e14e2 x13: 0000000000000000 x12: 0000000000000000 x11: ffff6000337e14e3 x10: 0000000000ff0100 x9 : cc07ffb5a919f400 x8 : cc07ffb5a919f400 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a03a6d58 x4 : ffff80008f727060 x3 : ffff8000807bef2c x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061 Call trace: sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:70 (P) unpin_user_page+0x80/0x10c mm/gup.c:192 io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:116 io_buffer_unmap io_uring/rsrc.c:143 [inline] io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:516 io_put_rsrc_node io_uring/rsrc.h:103 [inline] io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:200 io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:610 io_ring_ctx_free+0x48/0x480 io_uring/io_uring.c:2729 io_ring_exit_work+0x764/0x7d8 io_uring/io_uring.c:2971 process_one_work+0x7e8/0x155c kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3321 [inline] worker_thread+0x958/0xed8 kernel/workqueue.c:3402 kthread+0x5fc/0x75c kernel/kthread.c:464 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847 Code: b0052bc1 91008021 aa1703e0 97fff8ab (d4210000) ---[ end trace 0000000000000000 ]--- --- If you want syzbot to run the reproducer, reply with: #syz test: git://repo/address.git branch-or-commit-hash If you attach or paste a git patch, syzbot will apply it before testing. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-21 21:52 ` syzbot @ 2025-06-23 9:29 ` David Hildenbrand 2025-06-23 9:53 ` Alexander Potapenko 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand @ 2025-06-23 9:29 UTC (permalink / raw) To: syzbot, akpm, axboe, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 21.06.25 23:52, syzbot wrote: > syzbot has found a reproducer for the following issue on: > > HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci > git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci > console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 > kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd > dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 > compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 > userspace arch: arm64 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 There is not that much magic in there, I'm afraid. fork() is only used to spin up guests, but before the memory region of interest is actually allocated, IIUC. No threading code that races. IIUC, it triggers fairly fast on aarch64. I've left it running for a while on x86_64 without any luck. So maybe this is really some aarch64-special stuff (pointer tagging?). In particular, there is something very weird in the reproducer: syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul, /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul); advise is supposed to be a 32bit int. What does the magical "0x800000000" do? Let me try my luck reproducing in on arm. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 9:29 ` David Hildenbrand @ 2025-06-23 9:53 ` Alexander Potapenko 2025-06-23 10:10 ` David Hildenbrand 0 siblings, 1 reply; 17+ messages in thread From: Alexander Potapenko @ 2025-06-23 9:53 UTC (permalink / raw) To: David Hildenbrand Cc: syzbot, akpm, axboe, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: > > On 21.06.25 23:52, syzbot wrote: > > syzbot has found a reproducer for the following issue on: > > > > HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci > > git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci > > console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 > > kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd > > dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 > > compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 > > userspace arch: arm64 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 > > There is not that much magic in there, I'm afraid. > > fork() is only used to spin up guests, but before the memory region of > interest is actually allocated, IIUC. No threading code that races. > > IIUC, it triggers fairly fast on aarch64. I've left it running for a > while on x86_64 without any luck. > > So maybe this is really some aarch64-special stuff (pointer tagging?). > > In particular, there is something very weird in the reproducer: > > syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul, > /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul); > > advise is supposed to be a 32bit int. What does the magical > "0x800000000" do? I am pretty sure this is a red herring. Syzkaller sometimes mutates integer flags, even if the result makes no sense - because sometimes it can trigger interesting bugs. This `advice` argument will be discarded by is_valid_madvise(), resulting in -EINVAL. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 9:53 ` Alexander Potapenko @ 2025-06-23 10:10 ` David Hildenbrand 2025-06-23 12:22 ` David Hildenbrand 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand @ 2025-06-23 10:10 UTC (permalink / raw) To: Alexander Potapenko Cc: syzbot, akpm, axboe, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 23.06.25 11:53, Alexander Potapenko wrote: > On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via > syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >> >> On 21.06.25 23:52, syzbot wrote: >>> syzbot has found a reproducer for the following issue on: >>> >>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci >>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 >>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd >>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >>> userspace arch: arm64 >>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 >>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 >> >> There is not that much magic in there, I'm afraid. >> >> fork() is only used to spin up guests, but before the memory region of >> interest is actually allocated, IIUC. No threading code that races. >> >> IIUC, it triggers fairly fast on aarch64. I've left it running for a >> while on x86_64 without any luck. >> >> So maybe this is really some aarch64-special stuff (pointer tagging?). >> >> In particular, there is something very weird in the reproducer: >> >> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul, >> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul); >> >> advise is supposed to be a 32bit int. What does the magical >> "0x800000000" do? > > I am pretty sure this is a red herring. > Syzkaller sometimes mutates integer flags, even if the result makes no > sense - because sometimes it can trigger interesting bugs. > This `advice` argument will be discarded by is_valid_madvise(), > resulting in -EINVAL. I thought the same, but likely the upper bits are discarded, and we end up with __NR_madvise succeeding. The kernel config has CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y So without MADV_HUGEPAGE, we wouldn't get a THP in the first place. So likely this is really just like dropping the "0x800000000" Anyhow, I managed to reproduce in the VM using the provided rootfs on aarch64. It triggers immediately, so no races involved. Running the reproducer on a Fedora 42 debug-kernel in the hypervisor does not trigger. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 10:10 ` David Hildenbrand @ 2025-06-23 12:22 ` David Hildenbrand 2025-06-23 12:47 ` David Hildenbrand 2025-06-23 14:58 ` Jens Axboe 0 siblings, 2 replies; 17+ messages in thread From: David Hildenbrand @ 2025-06-23 12:22 UTC (permalink / raw) To: Alexander Potapenko, axboe Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 23.06.25 12:10, David Hildenbrand wrote: > On 23.06.25 11:53, Alexander Potapenko wrote: >> On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via >> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >>> >>> On 21.06.25 23:52, syzbot wrote: >>>> syzbot has found a reproducer for the following issue on: >>>> >>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci >>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 >>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd >>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >>>> userspace arch: arm64 >>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 >>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 >>> >>> There is not that much magic in there, I'm afraid. >>> >>> fork() is only used to spin up guests, but before the memory region of >>> interest is actually allocated, IIUC. No threading code that races. >>> >>> IIUC, it triggers fairly fast on aarch64. I've left it running for a >>> while on x86_64 without any luck. >>> >>> So maybe this is really some aarch64-special stuff (pointer tagging?). >>> >>> In particular, there is something very weird in the reproducer: >>> >>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul, >>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul); >>> >>> advise is supposed to be a 32bit int. What does the magical >>> "0x800000000" do? >> >> I am pretty sure this is a red herring. >> Syzkaller sometimes mutates integer flags, even if the result makes no >> sense - because sometimes it can trigger interesting bugs. >> This `advice` argument will be discarded by is_valid_madvise(), >> resulting in -EINVAL. > > I thought the same, but likely the upper bits are discarded, and we end > up with __NR_madvise succeeding. > > The kernel config has > > CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y > > So without MADV_HUGEPAGE, we wouldn't get a THP in the first place. > > So likely this is really just like dropping the "0x800000000" > > Anyhow, I managed to reproduce in the VM using the provided rootfs on > aarch64. It triggers immediately, so no races involved. > > Running the reproducer on a Fedora 42 debug-kernel in the hypervisor > does not trigger. Simplified reproducer that does not depend on a race with the child process. As expected previously, we have PAE cleared on the head page, because it is/was COW-shared with a child process. We are registering more than one consecutive tail pages of that THP through iouring, GUP-pinning them. These pages are not COW-shared and, therefore, do not have PAE set. #define _GNU_SOURCE #include <stdio.h> #include <string.h> #include <stdlib.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/types.h> #include <liburing.h> int main(void) { struct io_uring_params params = { .wq_fd = -1, }; struct iovec iovec; const size_t pagesize = getpagesize(); size_t size = 2048 * pagesize; char *addr; int fd; /* We need a THP-aligned area. */ addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ, MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("MAP_FIXED failed\n"); return 1; } if (madvise(addr, size, MADV_HUGEPAGE)) { perror("MADV_HUGEPAGE failed\n"); return 1; } /* Populate a THP. */ memset(addr, 0, size); /* COW-share only the first page ... */ if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) { perror("MADV_DONTFORK failed\n"); return 1; } /* ... using fork(). This will clear PAE on the head page. */ if (fork() == 0) exit(0); /* Setup iouring */ fd = syscall(__NR_io_uring_setup, 1024, ¶ms); if (fd < 0) { perror("__NR_io_uring_setup failed\n"); return 1; } /* Register (GUP-pin) two consecutive tail pages. */ iovec.iov_base = addr + pagesize; iovec.iov_len = 2 * pagesize; syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1); return 0; } [ 108.070381][ T14] kernel BUG at mm/gup.c:71! [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 108.117202][ T14] Modules linked in: [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0 [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0 [ 108.138025][ T14] sp : ffff800097ac7640 [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000 [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000 [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4 [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700 [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001 [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348 [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061 [ 108.174205][ T14] Call trace: [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) [ 108.178138][ T14] unpin_user_page+0x80/0x10c [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 [ 108.193207][ T14] process_one_work+0x7e8/0x155c [ 108.195431][ T14] worker_thread+0x958/0xed8 [ 108.197561][ T14] kthread+0x5fc/0x75c [ 108.199362][ T14] ret_from_fork+0x10/0x20 When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected. So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page() on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page (IOW, one we never pinned). So it's related to the io_coalesce_buffer() machinery. And in fact, in there, we have this weird logic: /* Store head pages only*/ new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL); ... Essentially discarding the subpage information when coalescing tail pages. I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be flawed (we can -- in theory -- coalesc different folio page ranges in a GUP result?). @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first place. Can you look into that, as you are more familiar with the logic? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 12:22 ` David Hildenbrand @ 2025-06-23 12:47 ` David Hildenbrand 2025-06-23 14:58 ` Jens Axboe 1 sibling, 0 replies; 17+ messages in thread From: David Hildenbrand @ 2025-06-23 12:47 UTC (permalink / raw) To: Alexander Potapenko, axboe Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 23.06.25 14:22, David Hildenbrand wrote: > On 23.06.25 12:10, David Hildenbrand wrote: >> On 23.06.25 11:53, Alexander Potapenko wrote: >>> On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via >>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >>>> >>>> On 21.06.25 23:52, syzbot wrote: >>>>> syzbot has found a reproducer for the following issue on: >>>>> >>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci >>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 >>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd >>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >>>>> userspace arch: arm64 >>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 >>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 >>>> >>>> There is not that much magic in there, I'm afraid. >>>> >>>> fork() is only used to spin up guests, but before the memory region of >>>> interest is actually allocated, IIUC. No threading code that races. >>>> >>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a >>>> while on x86_64 without any luck. >>>> >>>> So maybe this is really some aarch64-special stuff (pointer tagging?). >>>> >>>> In particular, there is something very weird in the reproducer: >>>> >>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul, >>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul); >>>> >>>> advise is supposed to be a 32bit int. What does the magical >>>> "0x800000000" do? >>> >>> I am pretty sure this is a red herring. >>> Syzkaller sometimes mutates integer flags, even if the result makes no >>> sense - because sometimes it can trigger interesting bugs. >>> This `advice` argument will be discarded by is_valid_madvise(), >>> resulting in -EINVAL. >> >> I thought the same, but likely the upper bits are discarded, and we end >> up with __NR_madvise succeeding. >> >> The kernel config has >> >> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y >> >> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place. >> >> So likely this is really just like dropping the "0x800000000" >> >> Anyhow, I managed to reproduce in the VM using the provided rootfs on >> aarch64. It triggers immediately, so no races involved. >> >> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor >> does not trigger. > > Simplified reproducer that does not depend on a race with the > child process. > > As expected previously, we have PAE cleared on the head page, > because it is/was COW-shared with a child process. > > We are registering more than one consecutive tail pages of that > THP through iouring, GUP-pinning them. These pages are not > COW-shared and, therefore, do not have PAE set. > > #define _GNU_SOURCE > #include <stdio.h> > #include <string.h> > #include <stdlib.h> > #include <sys/ioctl.h> > #include <sys/mman.h> > #include <sys/syscall.h> > #include <sys/types.h> > #include <liburing.h> > > int main(void) > { > struct io_uring_params params = { > .wq_fd = -1, > }; > struct iovec iovec; > const size_t pagesize = getpagesize(); > size_t size = 2048 * pagesize; > char *addr; > int fd; > > /* We need a THP-aligned area. */ > addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ, > MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); > if (addr == MAP_FAILED) { > perror("MAP_FIXED failed\n"); > return 1; > } > > if (madvise(addr, size, MADV_HUGEPAGE)) { > perror("MADV_HUGEPAGE failed\n"); > return 1; > } > > /* Populate a THP. */ > memset(addr, 0, size); > > /* COW-share only the first page ... */ > if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) { > perror("MADV_DONTFORK failed\n"); > return 1; > } > > /* ... using fork(). This will clear PAE on the head page. */ > if (fork() == 0) > exit(0); > > /* Setup iouring */ > fd = syscall(__NR_io_uring_setup, 1024, ¶ms); > if (fd < 0) { > perror("__NR_io_uring_setup failed\n"); > return 1; > } > > /* Register (GUP-pin) two consecutive tail pages. */ > iovec.iov_base = addr + pagesize; > iovec.iov_len = 2 * pagesize; > syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1); > return 0; > } > > [ 108.070381][ T14] kernel BUG at mm/gup.c:71! > [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP > [ 108.117202][ T14] Modules linked in: > [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT > [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 > [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work > [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0 > [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0 > [ 108.138025][ T14] sp : ffff800097ac7640 > [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000 > [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000 > [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c > [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff > [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4 > [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff > [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700 > [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001 > [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348 > [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061 > [ 108.174205][ T14] Call trace: > [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) > [ 108.178138][ T14] unpin_user_page+0x80/0x10c > [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 > [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c > [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 > [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 > [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 > [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 > [ 108.193207][ T14] process_one_work+0x7e8/0x155c > [ 108.195431][ T14] worker_thread+0x958/0xed8 > [ 108.197561][ T14] kthread+0x5fc/0x75c > [ 108.199362][ T14] ret_from_fork+0x10/0x20 FWIW, a slight cow.c selftest modification can trigger the same: diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 4214070d03ce..50c538b47bb4 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -991,6 +991,8 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize) log_test_result(KSFT_FAIL); goto munmap; } + mem += pagesize; + size -= pagesize; break; default: assert(false); -- Cheers, David / dhildenb ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 12:22 ` David Hildenbrand 2025-06-23 12:47 ` David Hildenbrand @ 2025-06-23 14:58 ` Jens Axboe 2025-06-23 15:11 ` David Hildenbrand 1 sibling, 1 reply; 17+ messages in thread From: Jens Axboe @ 2025-06-23 14:58 UTC (permalink / raw) To: David Hildenbrand, Alexander Potapenko Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs, Pavel Begunkov On 6/23/25 6:22 AM, David Hildenbrand wrote: > On 23.06.25 12:10, David Hildenbrand wrote: >> On 23.06.25 11:53, Alexander Potapenko wrote: >>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via >>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >>>> >>>> On 21.06.25 23:52, syzbot wrote: >>>>> syzbot has found a reproducer for the following issue on: >>>>> >>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci >>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 >>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd >>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >>>>> userspace arch: arm64 >>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 >>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 >>>> >>>> There is not that much magic in there, I'm afraid. >>>> >>>> fork() is only used to spin up guests, but before the memory region of >>>> interest is actually allocated, IIUC. No threading code that races. >>>> >>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a >>>> while on x86_64 without any luck. >>>> >>>> So maybe this is really some aarch64-special stuff (pointer tagging?). >>>> >>>> In particular, there is something very weird in the reproducer: >>>> >>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul, >>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul); >>>> >>>> advise is supposed to be a 32bit int. What does the magical >>>> "0x800000000" do? >>> >>> I am pretty sure this is a red herring. >>> Syzkaller sometimes mutates integer flags, even if the result makes no >>> sense - because sometimes it can trigger interesting bugs. >>> This `advice` argument will be discarded by is_valid_madvise(), >>> resulting in -EINVAL. >> >> I thought the same, but likely the upper bits are discarded, and we end >> up with __NR_madvise succeeding. >> >> The kernel config has >> >> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y >> >> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place. >> >> So likely this is really just like dropping the "0x800000000" >> >> Anyhow, I managed to reproduce in the VM using the provided rootfs on >> aarch64. It triggers immediately, so no races involved. >> >> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor >> does not trigger. > > Simplified reproducer that does not depend on a race with the > child process. > > As expected previously, we have PAE cleared on the head page, > because it is/was COW-shared with a child process. > > We are registering more than one consecutive tail pages of that > THP through iouring, GUP-pinning them. These pages are not > COW-shared and, therefore, do not have PAE set. > > #define _GNU_SOURCE > #include <stdio.h> > #include <string.h> > #include <stdlib.h> > #include <sys/ioctl.h> > #include <sys/mman.h> > #include <sys/syscall.h> > #include <sys/types.h> > #include <liburing.h> > > int main(void) > { > struct io_uring_params params = { > .wq_fd = -1, > }; > struct iovec iovec; > const size_t pagesize = getpagesize(); > size_t size = 2048 * pagesize; > char *addr; > int fd; > > /* We need a THP-aligned area. */ > addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ, > MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); > if (addr == MAP_FAILED) { > perror("MAP_FIXED failed\n"); > return 1; > } > > if (madvise(addr, size, MADV_HUGEPAGE)) { > perror("MADV_HUGEPAGE failed\n"); > return 1; > } > > /* Populate a THP. */ > memset(addr, 0, size); > > /* COW-share only the first page ... */ > if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) { > perror("MADV_DONTFORK failed\n"); > return 1; > } > > /* ... using fork(). This will clear PAE on the head page. */ > if (fork() == 0) > exit(0); > > /* Setup iouring */ > fd = syscall(__NR_io_uring_setup, 1024, ¶ms); > if (fd < 0) { > perror("__NR_io_uring_setup failed\n"); > return 1; > } > > /* Register (GUP-pin) two consecutive tail pages. */ > iovec.iov_base = addr + pagesize; > iovec.iov_len = 2 * pagesize; > syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1); > return 0; > } > > [ 108.070381][ T14] kernel BUG at mm/gup.c:71! > [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP > [ 108.117202][ T14] Modules linked in: > [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT > [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 > [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work > [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0 > [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0 > [ 108.138025][ T14] sp : ffff800097ac7640 > [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000 > [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000 > [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c > [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff > [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4 > [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff > [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700 > [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001 > [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348 > [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061 > [ 108.174205][ T14] Call trace: > [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) > [ 108.178138][ T14] unpin_user_page+0x80/0x10c > [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 > [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c > [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 > [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 > [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 > [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 > [ 108.193207][ T14] process_one_work+0x7e8/0x155c > [ 108.195431][ T14] worker_thread+0x958/0xed8 > [ 108.197561][ T14] kthread+0x5fc/0x75c > [ 108.199362][ T14] ret_from_fork+0x10/0x20 > > > When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected. > > So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page() > on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page > (IOW, one we never pinned). > > So it's related to the io_coalesce_buffer() machinery. > > And in fact, in there, we have this weird logic: > > /* Store head pages only*/ > new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL); > ... > > > Essentially discarding the subpage information when coalescing tail pages. > > > I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be > flawed (we can -- in theory -- coalesc different folio page ranges in > a GUP result?). > > @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up > imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first > place. > > Can you look into that, as you are more familiar with the logic? Leaving this all quoted and adding Pavel, who wrote that code. I'm currently away, so can't look into this right now. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 14:58 ` Jens Axboe @ 2025-06-23 15:11 ` David Hildenbrand 2025-06-23 16:48 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand @ 2025-06-23 15:11 UTC (permalink / raw) To: Jens Axboe, Alexander Potapenko Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs, Pavel Begunkov On 23.06.25 16:58, Jens Axboe wrote: > On 6/23/25 6:22 AM, David Hildenbrand wrote: >> On 23.06.25 12:10, David Hildenbrand wrote: >>> On 23.06.25 11:53, Alexander Potapenko wrote: >>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via >>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >>>>> >>>>> On 21.06.25 23:52, syzbot wrote: >>>>>> syzbot has found a reproducer for the following issue on: >>>>>> >>>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci >>>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci >>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000 >>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd >>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6 >>>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6 >>>>>> userspace arch: arm64 >>>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000 >>>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000 >>>>> >>>>> There is not that much magic in there, I'm afraid. >>>>> >>>>> fork() is only used to spin up guests, but before the memory region of >>>>> interest is actually allocated, IIUC. No threading code that races. >>>>> >>>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a >>>>> while on x86_64 without any luck. >>>>> >>>>> So maybe this is really some aarch64-special stuff (pointer tagging?). >>>>> >>>>> In particular, there is something very weird in the reproducer: >>>>> >>>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul, >>>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul); >>>>> >>>>> advise is supposed to be a 32bit int. What does the magical >>>>> "0x800000000" do? >>>> >>>> I am pretty sure this is a red herring. >>>> Syzkaller sometimes mutates integer flags, even if the result makes no >>>> sense - because sometimes it can trigger interesting bugs. >>>> This `advice` argument will be discarded by is_valid_madvise(), >>>> resulting in -EINVAL. >>> >>> I thought the same, but likely the upper bits are discarded, and we end >>> up with __NR_madvise succeeding. >>> >>> The kernel config has >>> >>> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y >>> >>> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place. >>> >>> So likely this is really just like dropping the "0x800000000" >>> >>> Anyhow, I managed to reproduce in the VM using the provided rootfs on >>> aarch64. It triggers immediately, so no races involved. >>> >>> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor >>> does not trigger. >> >> Simplified reproducer that does not depend on a race with the >> child process. >> >> As expected previously, we have PAE cleared on the head page, >> because it is/was COW-shared with a child process. >> >> We are registering more than one consecutive tail pages of that >> THP through iouring, GUP-pinning them. These pages are not >> COW-shared and, therefore, do not have PAE set. >> >> #define _GNU_SOURCE >> #include <stdio.h> >> #include <string.h> >> #include <stdlib.h> >> #include <sys/ioctl.h> >> #include <sys/mman.h> >> #include <sys/syscall.h> >> #include <sys/types.h> >> #include <liburing.h> >> >> int main(void) >> { >> struct io_uring_params params = { >> .wq_fd = -1, >> }; >> struct iovec iovec; >> const size_t pagesize = getpagesize(); >> size_t size = 2048 * pagesize; >> char *addr; >> int fd; >> >> /* We need a THP-aligned area. */ >> addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ, >> MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); >> if (addr == MAP_FAILED) { >> perror("MAP_FIXED failed\n"); >> return 1; >> } >> >> if (madvise(addr, size, MADV_HUGEPAGE)) { >> perror("MADV_HUGEPAGE failed\n"); >> return 1; >> } >> >> /* Populate a THP. */ >> memset(addr, 0, size); >> >> /* COW-share only the first page ... */ >> if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) { >> perror("MADV_DONTFORK failed\n"); >> return 1; >> } >> >> /* ... using fork(). This will clear PAE on the head page. */ >> if (fork() == 0) >> exit(0); >> >> /* Setup iouring */ >> fd = syscall(__NR_io_uring_setup, 1024, ¶ms); >> if (fd < 0) { >> perror("__NR_io_uring_setup failed\n"); >> return 1; >> } >> >> /* Register (GUP-pin) two consecutive tail pages. */ >> iovec.iov_base = addr + pagesize; >> iovec.iov_len = 2 * pagesize; >> syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1); >> return 0; >> } >> >> [ 108.070381][ T14] kernel BUG at mm/gup.c:71! >> [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP >> [ 108.117202][ T14] Modules linked in: >> [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT >> [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 >> [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work >> [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0 >> [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0 >> [ 108.138025][ T14] sp : ffff800097ac7640 >> [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000 >> [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000 >> [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c >> [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff >> [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4 >> [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff >> [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700 >> [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001 >> [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348 >> [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061 >> [ 108.174205][ T14] Call trace: >> [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) >> [ 108.178138][ T14] unpin_user_page+0x80/0x10c >> [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 >> [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c >> [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 >> [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 >> [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 >> [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 >> [ 108.193207][ T14] process_one_work+0x7e8/0x155c >> [ 108.195431][ T14] worker_thread+0x958/0xed8 >> [ 108.197561][ T14] kthread+0x5fc/0x75c >> [ 108.199362][ T14] ret_from_fork+0x10/0x20 >> >> >> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected. >> >> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page() >> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page >> (IOW, one we never pinned). >> >> So it's related to the io_coalesce_buffer() machinery. >> >> And in fact, in there, we have this weird logic: >> >> /* Store head pages only*/ >> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL); >> ... >> >> >> Essentially discarding the subpage information when coalescing tail pages. >> >> >> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be >> flawed (we can -- in theory -- coalesc different folio page ranges in >> a GUP result?). >> >> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up >> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first >> place. >> >> Can you look into that, as you are more familiar with the logic? > > Leaving this all quoted and adding Pavel, who wrote that code. I'm > currently away, so can't look into this right now. I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data(). Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always run into the if (page_folio(page_array[i]) == folio && page_array[i] == page_array[i-1] + 1) { count++; continue; } case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong. In general, storing head pages when they are not the first page to be coalesced seems wrong. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 15:11 ` David Hildenbrand @ 2025-06-23 16:48 ` Pavel Begunkov 2025-06-23 16:59 ` David Hildenbrand 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2025-06-23 16:48 UTC (permalink / raw) To: David Hildenbrand, Jens Axboe, Alexander Potapenko Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 6/23/25 16:11, David Hildenbrand wrote: > On 23.06.25 16:58, Jens Axboe wrote: >> On 6/23/25 6:22 AM, David Hildenbrand wrote: >>> On 23.06.25 12:10, David Hildenbrand wrote: >>>> On 23.06.25 11:53, Alexander Potapenko wrote: >>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via >>>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >>>>>> ...>>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected. >>> >>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page() >>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page >>> (IOW, one we never pinned). >>> >>> So it's related to the io_coalesce_buffer() machinery. >>> >>> And in fact, in there, we have this weird logic: >>> >>> /* Store head pages only*/ >>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL); >>> ... >>> >>> >>> Essentially discarding the subpage information when coalescing tail pages. >>> >>> >>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be >>> flawed (we can -- in theory -- coalesc different folio page ranges in >>> a GUP result?). >>> >>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up >>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first >>> place. >>> >>> Can you look into that, as you are more familiar with the logic? >> >> Leaving this all quoted and adding Pavel, who wrote that code. I'm >> currently away, so can't look into this right now. Chenliang Li did, but not like it matters > I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data(). > > Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always > run into the > > if (page_folio(page_array[i]) == folio && > page_array[i] == page_array[i-1] + 1) { > count++; > continue; > } > > case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong. > > In general, storing head pages when they are not the first page to be coalesced seems wrong. Yes, it stores the head page even if the range passed to pin_user_pages() doesn't cover the head page. It should be converted to unpin_user_folio(), which doesn't seem to do sanity_check_pinned_pages(). Do you think that'll be enough (conceptually)? Nobody is actually touching the head page in those cases apart from the final unpin, and storing the head page is more convenient than keeping folios. I'll take a look if it can be fully converted to folios w/o extra overhead. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 16:48 ` Pavel Begunkov @ 2025-06-23 16:59 ` David Hildenbrand 2025-06-23 17:36 ` David Hildenbrand 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand @ 2025-06-23 16:59 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, Alexander Potapenko Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 23.06.25 18:48, Pavel Begunkov wrote: > On 6/23/25 16:11, David Hildenbrand wrote: >> On 23.06.25 16:58, Jens Axboe wrote: >>> On 6/23/25 6:22 AM, David Hildenbrand wrote: >>>> On 23.06.25 12:10, David Hildenbrand wrote: >>>>> On 23.06.25 11:53, Alexander Potapenko wrote: >>>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via >>>>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >>>>>>> > ...>>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected. >>>> >>>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page() >>>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page >>>> (IOW, one we never pinned). >>>> >>>> So it's related to the io_coalesce_buffer() machinery. >>>> >>>> And in fact, in there, we have this weird logic: >>>> >>>> /* Store head pages only*/ >>>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL); >>>> ... >>>> >>>> >>>> Essentially discarding the subpage information when coalescing tail pages. >>>> >>>> >>>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be >>>> flawed (we can -- in theory -- coalesc different folio page ranges in >>>> a GUP result?). >>>> >>>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up >>>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first >>>> place. >>>> >>>> Can you look into that, as you are more familiar with the logic? >>> >>> Leaving this all quoted and adding Pavel, who wrote that code. I'm >>> currently away, so can't look into this right now. > > Chenliang Li did, but not like it matters > >> I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data(). >> >> Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always >> run into the >> >> if (page_folio(page_array[i]) == folio && >> page_array[i] == page_array[i-1] + 1) { >> count++; >> continue; >> } >> >> case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong. >> >> In general, storing head pages when they are not the first page to be coalesced seems wrong. > > Yes, it stores the head page even if the range passed to > pin_user_pages() doesn't cover the head page. > > It should be converted to unpin_user_folio(), which doesn't seem > to do sanity_check_pinned_pages(). Do you think that'll be enough > (conceptually)? Nobody is actually touching the head page in those > cases apart from the final unpin, and storing the head page is > more convenient than keeping folios. I'll take a look if it can > be fully converted to folios w/o extra overhead. Assuming we had from GUP nr_pages = 2 pages[0] = folio_page(folio, 1) pages[1] = folio_page(folio, 2) After io_coalesce_buffer() we have nr_pages = 1 pages[0] = folio_page(folio, 0) Using unpin_user_folio() in all places where we could see something like that would be the right thing to do. The sanity checks are not in unpin_user_folio() for exactly that reason: we don't know which folio pages we pinned. But now I wonder where you make sure that "Nobody is actually touching the head page"? How do you get back the "which folio range" information after io_coalesce_buffer() ? If you rely on alignment in virtual address space for you, combined with imu->folio_shift, that might not work reliably ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 16:59 ` David Hildenbrand @ 2025-06-23 17:36 ` David Hildenbrand 2025-06-23 18:02 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand @ 2025-06-23 17:36 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, Alexander Potapenko Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 23.06.25 18:59, David Hildenbrand wrote: > On 23.06.25 18:48, Pavel Begunkov wrote: >> On 6/23/25 16:11, David Hildenbrand wrote: >>> On 23.06.25 16:58, Jens Axboe wrote: >>>> On 6/23/25 6:22 AM, David Hildenbrand wrote: >>>>> On 23.06.25 12:10, David Hildenbrand wrote: >>>>>> On 23.06.25 11:53, Alexander Potapenko wrote: >>>>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via >>>>>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote: >>>>>>>> >> ...>>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected. >>>>> >>>>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page() >>>>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page >>>>> (IOW, one we never pinned). >>>>> >>>>> So it's related to the io_coalesce_buffer() machinery. >>>>> >>>>> And in fact, in there, we have this weird logic: >>>>> >>>>> /* Store head pages only*/ >>>>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL); >>>>> ... >>>>> >>>>> >>>>> Essentially discarding the subpage information when coalescing tail pages. >>>>> >>>>> >>>>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be >>>>> flawed (we can -- in theory -- coalesc different folio page ranges in >>>>> a GUP result?). >>>>> >>>>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up >>>>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first >>>>> place. >>>>> >>>>> Can you look into that, as you are more familiar with the logic? >>>> >>>> Leaving this all quoted and adding Pavel, who wrote that code. I'm >>>> currently away, so can't look into this right now. >> >> Chenliang Li did, but not like it matters >> >>> I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data(). >>> >>> Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always >>> run into the >>> >>> if (page_folio(page_array[i]) == folio && >>> page_array[i] == page_array[i-1] + 1) { >>> count++; >>> continue; >>> } >>> >>> case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong. >>> >>> In general, storing head pages when they are not the first page to be coalesced seems wrong. >> >> Yes, it stores the head page even if the range passed to >> pin_user_pages() doesn't cover the head page. > > > It should be converted to unpin_user_folio(), which doesn't seem >> to do sanity_check_pinned_pages(). Do you think that'll be enough >> (conceptually)? Nobody is actually touching the head page in those >> cases apart from the final unpin, and storing the head page is >> more convenient than keeping folios. I'll take a look if it can >> be fully converted to folios w/o extra overhead. > > Assuming we had from GUP > > nr_pages = 2 > pages[0] = folio_page(folio, 1) > pages[1] = folio_page(folio, 2) > > After io_coalesce_buffer() we have > > nr_pages = 1 > pages[0] = folio_page(folio, 0) > > > Using unpin_user_folio() in all places where we could see something like > that would be the right thing to do. The sanity checks are not in > unpin_user_folio() for exactly that reason: we don't know which folio > pages we pinned. > > But now I wonder where you make sure that "Nobody is actually touching > the head page"? > > How do you get back the "which folio range" information after > io_coalesce_buffer() ? > > > If you rely on alignment in virtual address space for you, combined with > imu->folio_shift, that might not work reliably ... FWIW, applying the following on top of origin/master: diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index dbbcc5eb3dce5..e62a284dcf906 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -946,6 +946,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize) log_test_result(KSFT_FAIL); goto munmap; } + mem = mremap_mem; size = mremap_size; break; case THP_RUN_PARTIAL_SHARED: and then running the selftest, something is not happy: ... # [RUN] R/O-mapping a page registered as iouring fixed buffer ... with partially mremap()'ed THP (512 kB) [34272.021973] Oops: general protection fault, maybe for address 0xffff8bab09d5b000: 0000 [#1] PREEMPT SMP NOPTI [34272.021980] CPU: 3 UID: 0 PID: 1048307 Comm: iou-wrk-1047940 Not tainted 6.14.9-300.fc42.x86_64 #1 [34272.021983] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET53W (1.53 ) 03/22/2023 [34272.021984] RIP: 0010:memcpy+0xc/0x20 [34272.021989] Code: cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 e9 4d f9 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 [34272.021991] RSP: 0018:ffffcff459183c20 EFLAGS: 00010206 [34272.021993] RAX: ffff8bab09d5b000 RBX: 0000000000000fff RCX: 0000000000000fff [34272.021994] RDX: 0000000000000fff RSI: 0021461670800001 RDI: ffff8bab09d5b000 [34272.021995] RBP: ffff8ba794866c40 R08: ffff8bab09d5b000 R09: 0000000000001000 [34272.021996] R10: ffff8ba7a316f9d0 R11: ffff8ba92f133080 R12: 0000000000000fff [34272.021997] R13: ffff8baa85d5b6a0 R14: 0000000000000fff R15: 0000000000001000 [34272.021998] FS: 00007f16c568a740(0000) GS:ffff8baebf580000(0000) knlGS:0000000000000000 [34272.021999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [34272.022000] CR2: 00007fffb6a10b00 CR3: 00000003df9eb006 CR4: 0000000000f72ef0 [34272.022001] PKRU: 55555554 [34272.022002] Call Trace: [34272.022004] <TASK> [34272.022005] copy_page_from_iter_atomic+0x36f/0x7e0 [34272.022009] ? simple_xattr_get+0x59/0xa0 [34272.022012] generic_perform_write+0x86/0x2e0 [34272.022016] shmem_file_write_iter+0x86/0x90 [34272.022019] io_write+0xe4/0x390 [34272.022023] io_issue_sqe+0x65/0x4f0 [34272.022024] ? lock_timer_base+0x7d/0xc0 [34272.022027] io_wq_submit_work+0xb8/0x320 [34272.022029] io_worker_handle_work+0xd5/0x300 [34272.022032] io_wq_worker+0xda/0x300 [34272.022034] ? finish_task_switch.isra.0+0x99/0x2c0 [34272.022037] ? __pfx_io_wq_worker+0x10/0x10 [34272.022039] ret_from_fork+0x34/0x50 [34272.022042] ? __pfx_io_wq_worker+0x10/0x10 [34272.022044] ret_from_fork_asm+0x1a/0x30 [34272.022047] </TASK> There, we essentially mremap a THP to not be aligned in VA space, and then register half the THP as a fixed buffer. So ... my suspicion that this is all rather broken grows :) -- Cheers, David / dhildenb ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages 2025-06-23 17:36 ` David Hildenbrand @ 2025-06-23 18:02 ` Pavel Begunkov 0 siblings, 0 replies; 17+ messages in thread From: Pavel Begunkov @ 2025-06-23 18:02 UTC (permalink / raw) To: David Hildenbrand, Jens Axboe, Alexander Potapenko Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel, linux-mm, peterx, syzkaller-bugs On 6/23/25 18:36, David Hildenbrand wrote: > On 23.06.25 18:59, David Hildenbrand wrote: >> On 23.06.25 18:48, Pavel Begunkov wrote: >>> On 6/23/25 16:11, David Hildenbrand wrote: ...>>> Yes, it stores the head page even if the range passed to >>> pin_user_pages() doesn't cover the head page. >> > > It should be converted to unpin_user_folio(), which doesn't seem >>> to do sanity_check_pinned_pages(). Do you think that'll be enough >>> (conceptually)? Nobody is actually touching the head page in those >>> cases apart from the final unpin, and storing the head page is >>> more convenient than keeping folios. I'll take a look if it can >>> be fully converted to folios w/o extra overhead. >> >> Assuming we had from GUP >> >> nr_pages = 2 >> pages[0] = folio_page(folio, 1) >> pages[1] = folio_page(folio, 2) >> >> After io_coalesce_buffer() we have >> >> nr_pages = 1 >> pages[0] = folio_page(folio, 0) >> >> >> Using unpin_user_folio() in all places where we could see something like >> that would be the right thing to do. The sanity checks are not in >> unpin_user_folio() for exactly that reason: we don't know which folio >> pages we pinned. Let's do that for starters >> But now I wonder where you make sure that "Nobody is actually touching >> the head page"? >> >> How do you get back the "which folio range" information after >> io_coalesce_buffer() ? >> >> >> If you rely on alignment in virtual address space for you, combined with >> imu->folio_shift, that might not work reliably ... > > FWIW, applying the following on top of origin/master: > > diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c > index dbbcc5eb3dce5..e62a284dcf906 100644 > --- a/tools/testing/selftests/mm/cow.c > +++ b/tools/testing/selftests/mm/cow.c > @@ -946,6 +946,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize) > log_test_result(KSFT_FAIL); > goto munmap; > } > + mem = mremap_mem; > size = mremap_size; > break; > case THP_RUN_PARTIAL_SHARED: > > > and then running the selftest, something is not happy: > > ... > # [RUN] R/O-mapping a page registered as iouring fixed buffer ... with partially mremap()'ed THP (512 kB) > [34272.021973] Oops: general protection fault, maybe for address 0xffff8bab09d5b000: 0000 [#1] PREEMPT SMP NOPTI > [34272.021980] CPU: 3 UID: 0 PID: 1048307 Comm: iou-wrk-1047940 Not tainted 6.14.9-300.fc42.x86_64 #1 > [34272.021983] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET53W (1.53 ) 03/22/2023 > [34272.021984] RIP: 0010:memcpy+0xc/0x20 > [34272.021989] Code: cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 e9 4d f9 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 > [34272.021991] RSP: 0018:ffffcff459183c20 EFLAGS: 00010206 > [34272.021993] RAX: ffff8bab09d5b000 RBX: 0000000000000fff RCX: 0000000000000fff > [34272.021994] RDX: 0000000000000fff RSI: 0021461670800001 RDI: ffff8bab09d5b000 > [34272.021995] RBP: ffff8ba794866c40 R08: ffff8bab09d5b000 R09: 0000000000001000 > [34272.021996] R10: ffff8ba7a316f9d0 R11: ffff8ba92f133080 R12: 0000000000000fff > [34272.021997] R13: ffff8baa85d5b6a0 R14: 0000000000000fff R15: 0000000000001000 > [34272.021998] FS: 00007f16c568a740(0000) GS:ffff8baebf580000(0000) knlGS:0000000000000000 > [34272.021999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [34272.022000] CR2: 00007fffb6a10b00 CR3: 00000003df9eb006 CR4: 0000000000f72ef0 > [34272.022001] PKRU: 55555554 > [34272.022002] Call Trace: > [34272.022004] <TASK> > [34272.022005] copy_page_from_iter_atomic+0x36f/0x7e0 > [34272.022009] ? simple_xattr_get+0x59/0xa0 > [34272.022012] generic_perform_write+0x86/0x2e0 > [34272.022016] shmem_file_write_iter+0x86/0x90 > [34272.022019] io_write+0xe4/0x390 > [34272.022023] io_issue_sqe+0x65/0x4f0 > [34272.022024] ? lock_timer_base+0x7d/0xc0 > [34272.022027] io_wq_submit_work+0xb8/0x320 > [34272.022029] io_worker_handle_work+0xd5/0x300 > [34272.022032] io_wq_worker+0xda/0x300 > [34272.022034] ? finish_task_switch.isra.0+0x99/0x2c0 > [34272.022037] ? __pfx_io_wq_worker+0x10/0x10 > [34272.022039] ret_from_fork+0x34/0x50 > [34272.022042] ? __pfx_io_wq_worker+0x10/0x10 > [34272.022044] ret_from_fork_asm+0x1a/0x30 > [34272.022047] </TASK> > > > There, we essentially mremap a THP to not be aligned in VA space, and then register half the > THP as a fixed buffer. > > So ... my suspicion that this is all rather broken grows :) It's supposed to calculate the offset from a user pointer and then work with that, but I guess there are masking that violate it, I'll check. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-06-23 18:01 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-03 15:31 [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages syzbot 2025-06-03 16:22 ` David Hildenbrand 2025-06-03 17:20 ` Jens Axboe 2025-06-03 17:25 ` David Hildenbrand 2025-06-03 17:36 ` Jens Axboe 2025-06-21 21:52 ` syzbot 2025-06-23 9:29 ` David Hildenbrand 2025-06-23 9:53 ` Alexander Potapenko 2025-06-23 10:10 ` David Hildenbrand 2025-06-23 12:22 ` David Hildenbrand 2025-06-23 12:47 ` David Hildenbrand 2025-06-23 14:58 ` Jens Axboe 2025-06-23 15:11 ` David Hildenbrand 2025-06-23 16:48 ` Pavel Begunkov 2025-06-23 16:59 ` David Hildenbrand 2025-06-23 17:36 ` David Hildenbrand 2025-06-23 18:02 ` Pavel Begunkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).