* [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
@ 2025-06-03 15:31 syzbot
2025-06-03 16:22 ` David Hildenbrand
2025-06-21 21:52 ` syzbot
0 siblings, 2 replies; 17+ messages in thread
From: syzbot @ 2025-06-03 15:31 UTC (permalink / raw)
To: akpm, david, jgg, jhubbard, linux-kernel, linux-mm, peterx,
syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci
git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000
kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a
dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
userspace arch: arm64
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz
kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com
head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
------------[ cut here ]------------
kernel BUG at mm/gup.c:70!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Workqueue: iou_exit io_ring_exit_work
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
sp : ffff800097f17640
x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000
x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000
x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c
x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276
x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001
x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000
x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00
x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68
x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061
Call trace:
sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P)
unpin_user_page+0x80/0x10c mm/gup.c:191
io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113
io_buffer_unmap io_uring/rsrc.c:140 [inline]
io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513
io_put_rsrc_node io_uring/rsrc.h:103 [inline]
io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197
io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607
io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723
io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962
process_one_work+0x7e8/0x156c kernel/workqueue.c:3238
process_scheduled_works kernel/workqueue.c:3319 [inline]
worker_thread+0x958/0xed8 kernel/workqueue.c:3400
kthread+0x5fc/0x75c kernel/kthread.c:464
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847
Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000)
---[ end trace 0000000000000000 ]---
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-03 15:31 [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages syzbot
@ 2025-06-03 16:22 ` David Hildenbrand
2025-06-03 17:20 ` Jens Axboe
2025-06-21 21:52 ` syzbot
1 sibling, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-03 16:22 UTC (permalink / raw)
To: syzbot, akpm, jgg, jhubbard, linux-kernel, linux-mm, peterx,
syzkaller-bugs, Jens Axboe
On 03.06.25 17:31, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci
> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
> console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000
> kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a
> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
> userspace arch: arm64
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com
>
> head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
> page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
> ------------[ cut here ]------------
> kernel BUG at mm/gup.c:70!
> Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> Modules linked in:
>
> CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
> Workqueue: iou_exit io_ring_exit_work
> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
> lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
> sp : ffff800097f17640
> x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000
> x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000
> x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c
> x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276
> x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001
> x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000
> x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00
> x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001
> x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68
> x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061
> Call trace:
> sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P)
> unpin_user_page+0x80/0x10c mm/gup.c:191
> io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113
> io_buffer_unmap io_uring/rsrc.c:140 [inline]
> io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513
> io_put_rsrc_node io_uring/rsrc.h:103 [inline]
> io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197
> io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607
> io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723
> io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962
> process_one_work+0x7e8/0x156c kernel/workqueue.c:3238
> process_scheduled_works kernel/workqueue.c:3319 [inline]
> worker_thread+0x958/0xed8 kernel/workqueue.c:3400
> kthread+0x5fc/0x75c kernel/kthread.c:464
> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847
> Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000)
> ---[ end trace 0000000000000000 ]---
So we lost a PAE bit for a pinned folio.
[ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400
[ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1
The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped).
The page we are using for unpinning is not mapped (mapcount:0).
pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page).
[ 97.640414][ T115] memcg:ffff0000f36b6000
[ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff)
[ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
[ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
[ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
[ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
[ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff
[ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
[ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
So we effectively only test the head page. Here we don't have the bit set for that page.
In gup_fast() we perform a similar sanity check, which didn't trigger at the time we pinned the folio.
io_uring ends up calling io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might indeed
trigger.
What could trigger this (in weird scenarios, though) is if we used pin_user_page() to obtain a
page, then did folio = page_folio(page) and called unpin_user_page(&folio->page) instead of
using unpin_folio(). Or using any other page that we didn't pin. It would be a corner case, though.
Staring at io_release_ubuf(), that's also not immediately what's happening.
There is this coalescing code in io_sqe_buffer_register()->io_check_coalesce_buffer(),
maybe ... something is going wrong there?
Otherwise, I could only envision (a) some random memory overwrite clearing the bit or (b) some
weird race between GUP-fast and PAE clearing that we didn't run into so far. But these sanity
checks have been around for a loooong time at this point.
Unfortunately, no reproducer :(
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-03 16:22 ` David Hildenbrand
@ 2025-06-03 17:20 ` Jens Axboe
2025-06-03 17:25 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2025-06-03 17:20 UTC (permalink / raw)
To: David Hildenbrand, syzbot, akpm, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 6/3/25 10:22 AM, David Hildenbrand wrote:
> On 03.06.25 17:31, syzbot wrote:
>> Hello,
>>
>> syzbot found the following issue on:
>>
>> HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci
>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a
>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>> userspace arch: arm64
>>
>> Unfortunately, I don't have any reproducer for this issue yet.
>>
>> Downloadable assets:
>> disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz
>> vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz
>> kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz
>>
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com
>>
>> head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
>> page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
>> ------------[ cut here ]------------
>> kernel BUG at mm/gup.c:70!
>> Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>> Modules linked in:
>>
>> CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
>> Workqueue: iou_exit io_ring_exit_work
>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
>> lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
>> sp : ffff800097f17640
>> x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000
>> x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000
>> x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c
>> x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276
>> x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001
>> x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000
>> x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00
>> x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001
>> x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68
>> x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061
>> Call trace:
>> sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P)
>> unpin_user_page+0x80/0x10c mm/gup.c:191
>> io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113
>> io_buffer_unmap io_uring/rsrc.c:140 [inline]
>> io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513
>> io_put_rsrc_node io_uring/rsrc.h:103 [inline]
>> io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197
>> io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607
>> io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723
>> io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962
>> process_one_work+0x7e8/0x156c kernel/workqueue.c:3238
>> process_scheduled_works kernel/workqueue.c:3319 [inline]
>> worker_thread+0x958/0xed8 kernel/workqueue.c:3400
>> kthread+0x5fc/0x75c kernel/kthread.c:464
>> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847
>> Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000)
>> ---[ end trace 0000000000000000 ]---
>
> So we lost a PAE bit for a pinned folio.
>
> [ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400
> [ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1
>
> The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped).
>
> The page we are using for unpinning is not mapped (mapcount:0).
>
> pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page).
>
>
> [ 97.640414][ T115] memcg:ffff0000f36b6000
> [ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff)
> [ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
> [ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
> [ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
> [ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
> [ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff
> [ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
> [ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
>
> So we effectively only test the head page. Here we don't have the bit
> set for that page.
>
>
> In gup_fast() we perform a similar sanity check, which didn't trigger
> at the time we pinned the folio. io_uring ends up calling
> io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might
> indeed trigger.
>
>
> What could trigger this (in weird scenarios, though) is if we used
> pin_user_page() to obtain a page, then did folio = page_folio(page)
> and called unpin_user_page(&folio->page) instead of using
> unpin_folio(). Or using any other page that we didn't pin. It would be
> a corner case, though.
>
> Staring at io_release_ubuf(), that's also not immediately what's
> happening.
>
> There is this coalescing code in
> io_sqe_buffer_register()->io_check_coalesce_buffer(), maybe ...
> something is going wrong there?
>
>
>
> Otherwise, I could only envision (a) some random memory overwrite
> clearing the bit or (b) some weird race between GUP-fast and PAE
> clearing that we didn't run into so far. But these sanity checks have
> been around for a loooong time at this point.
>
> Unfortunately, no reproducer :(
Too bad there's no reproducer... Since this looks recent, I'd suspect
the recent changes there. Most notably:
commit f446c6311e86618a1f81eb576b56a6266307238f
Author: Jens Axboe <axboe@kernel.dk>
Date: Mon May 12 09:06:06 2025 -0600
io_uring/memmap: don't use page_address() on a highmem page
which seems a bit odd, as this is arm64 and there'd be no highmem. This
went into the 6.15 kernel release. Let's hope a reproducer is
forthcoming.
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-03 17:20 ` Jens Axboe
@ 2025-06-03 17:25 ` David Hildenbrand
2025-06-03 17:36 ` Jens Axboe
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-03 17:25 UTC (permalink / raw)
To: Jens Axboe, syzbot, akpm, jgg, jhubbard, linux-kernel, linux-mm,
peterx, syzkaller-bugs, Catalin Marinas
On 03.06.25 19:20, Jens Axboe wrote:
> On 6/3/25 10:22 AM, David Hildenbrand wrote:
>> On 03.06.25 17:31, syzbot wrote:
>>> Hello,
>>>
>>> syzbot found the following issue on:
>>>
>>> HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci
>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000
>>> kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>> userspace arch: arm64
>>>
>>> Unfortunately, I don't have any reproducer for this issue yet.
>>>
>>> Downloadable assets:
>>> disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz
>>> vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz
>>> kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz
>>>
>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com
>>>
>>> head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
>>> page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
>>> ------------[ cut here ]------------
>>> kernel BUG at mm/gup.c:70!
>>> Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>> Modules linked in:
>>>
>>> CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
>>> Workqueue: iou_exit io_ring_exit_work
>>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
>>> lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
>>> sp : ffff800097f17640
>>> x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000
>>> x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000
>>> x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c
>>> x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276
>>> x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001
>>> x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000
>>> x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00
>>> x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001
>>> x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68
>>> x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061
>>> Call trace:
>>> sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P)
>>> unpin_user_page+0x80/0x10c mm/gup.c:191
>>> io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113
>>> io_buffer_unmap io_uring/rsrc.c:140 [inline]
>>> io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513
>>> io_put_rsrc_node io_uring/rsrc.h:103 [inline]
>>> io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197
>>> io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607
>>> io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723
>>> io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962
>>> process_one_work+0x7e8/0x156c kernel/workqueue.c:3238
>>> process_scheduled_works kernel/workqueue.c:3319 [inline]
>>> worker_thread+0x958/0xed8 kernel/workqueue.c:3400
>>> kthread+0x5fc/0x75c kernel/kthread.c:464
>>> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847
>>> Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000)
>>> ---[ end trace 0000000000000000 ]---
>>
>> So we lost a PAE bit for a pinned folio.
>>
>> [ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400
>> [ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1
>>
>> The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped).
>>
>> The page we are using for unpinning is not mapped (mapcount:0).
>>
>> pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page).
>>
>>
>> [ 97.640414][ T115] memcg:ffff0000f36b6000
>> [ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff)
>> [ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
>> [ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
>> [ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
>> [ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
>> [ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff
>> [ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
>> [ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
>>
>> So we effectively only test the head page. Here we don't have the bit
>> set for that page.
>>
>>
>> In gup_fast() we perform a similar sanity check, which didn't trigger
>> at the time we pinned the folio. io_uring ends up calling
>> io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might
>> indeed trigger.
>>
>>
>> What could trigger this (in weird scenarios, though) is if we used
>> pin_user_page() to obtain a page, then did folio = page_folio(page)
>> and called unpin_user_page(&folio->page) instead of using
>> unpin_folio(). Or using any other page that we didn't pin. It would be
>> a corner case, though.
>>
>> Staring at io_release_ubuf(), that's also not immediately what's
>> happening.
>>
>> There is this coalescing code in
>> io_sqe_buffer_register()->io_check_coalesce_buffer(), maybe ...
>> something is going wrong there?
>>
>>
>>
>> Otherwise, I could only envision (a) some random memory overwrite
>> clearing the bit or (b) some weird race between GUP-fast and PAE
>> clearing that we didn't run into so far. But these sanity checks have
>> been around for a loooong time at this point.
>>
>> Unfortunately, no reproducer :(
>
> Too bad there's no reproducer... Since this looks recent, I'd suspect
> the recent changes there. Most notably:
>
> commit f446c6311e86618a1f81eb576b56a6266307238f
> Author: Jens Axboe <axboe@kernel.dk>
> Date: Mon May 12 09:06:06 2025 -0600
>
> io_uring/memmap: don't use page_address() on a highmem page
>
> which seems a bit odd, as this is arm64 and there'd be no highmem. This
> went into the 6.15 kernel release. Let's hope a reproducer is
> forthcoming.
Yeah, that does not really look problematic.
Interestingly, this was found in
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
Hm.
Let me dig a bit, but if it's some corner case race, it's weird that we didn't
find it earlier.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-03 17:25 ` David Hildenbrand
@ 2025-06-03 17:36 ` Jens Axboe
0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2025-06-03 17:36 UTC (permalink / raw)
To: David Hildenbrand, syzbot, akpm, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs, Catalin Marinas
On 6/3/25 11:25 AM, David Hildenbrand wrote:
> On 03.06.25 19:20, Jens Axboe wrote:
>> On 6/3/25 10:22 AM, David Hildenbrand wrote:
>>> On 03.06.25 17:31, syzbot wrote:
>>>> Hello,
>>>>
>>>> syzbot found the following issue on:
>>>>
>>>> HEAD commit: d7fa1af5b33e Merge branch 'for-next/core' into for-kernelci
>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1457d80c580000
>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=89c13de706fbf07a
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>>> userspace arch: arm64
>>>>
>>>> Unfortunately, I don't have any reproducer for this issue yet.
>>>>
>>>> Downloadable assets:
>>>> disk image: https://storage.googleapis.com/syzbot-assets/da97ad659b2c/disk-d7fa1af5.raw.xz
>>>> vmlinux: https://storage.googleapis.com/syzbot-assets/659e123552a8/vmlinux-d7fa1af5.xz
>>>> kernel image: https://storage.googleapis.com/syzbot-assets/6ec5dbf4643e/Image-d7fa1af5.gz.xz
>>>>
>>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>>> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com
>>>>
>>>> head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
>>>> page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
>>>> ------------[ cut here ]------------
>>>> kernel BUG at mm/gup.c:70!
>>>> Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>>> Modules linked in:
>>>>
>>>> CPU: 1 UID: 0 PID: 115 Comm: kworker/u8:4 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
>>>> Workqueue: iou_exit io_ring_exit_work
>>>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
>>>> lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69
>>>> sp : ffff800097f17640
>>>> x29: ffff800097f17660 x28: dfff800000000000 x27: 1fffffbff87da000
>>>> x26: 05ffc0000002107c x25: 05ffc0000002107c x24: fffffdffc3ed0000
>>>> x23: fffffdffc3ed0000 x22: ffff800097f176e0 x21: 05ffc0000002107c
>>>> x20: 0000000000000000 x19: ffff800097f176e0 x18: 1fffe0003386f276
>>>> x17: 703e2d6f696c6f66 x16: ffff80008adbe9e4 x15: 0000000000000001
>>>> x14: 1fffe0003386f2e2 x13: 0000000000000000 x12: 0000000000000000
>>>> x11: ffff60003386f2e3 x10: 0000000000ff0100 x9 : c8ccd30be98f3f00
>>>> x8 : c8ccd30be98f3f00 x7 : 0000000000000001 x6 : 0000000000000001
>>>> x5 : ffff800097f16d58 x4 : ffff80008f415ba0 x3 : ffff8000807b4b68
>>>> x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061
>>>> Call trace:
>>>> sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:69 (P)
>>>> unpin_user_page+0x80/0x10c mm/gup.c:191
>>>> io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:113
>>>> io_buffer_unmap io_uring/rsrc.c:140 [inline]
>>>> io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:513
>>>> io_put_rsrc_node io_uring/rsrc.h:103 [inline]
>>>> io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:197
>>>> io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:607
>>>> io_ring_ctx_free+0x48/0x430 io_uring/io_uring.c:2723
>>>> io_ring_exit_work+0x6c4/0x73c io_uring/io_uring.c:2962
>>>> process_one_work+0x7e8/0x156c kernel/workqueue.c:3238
>>>> process_scheduled_works kernel/workqueue.c:3319 [inline]
>>>> worker_thread+0x958/0xed8 kernel/workqueue.c:3400
>>>> kthread+0x5fc/0x75c kernel/kthread.c:464
>>>> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847
>>>> Code: 900523a1 910e0021 aa1703e0 97fff8a9 (d4210000)
>>>> ---[ end trace 0000000000000000 ]---
>>>
>>> So we lost a PAE bit for a pinned folio.
>>>
>>> [ 97.640225][ T115] page: refcount:512 mapcount:0 mapping:0000000000000000 index:0x20000 pfn:0x13b400
>>> [ 97.640378][ T115] head: order:9 mapcount:511 entire_mapcount:0 nr_pages_mapped:511 pincount:1
>>>
>>> The folio is indeed pinned, and it is PTE-mapped (511 PTEs are mapped).
>>>
>>> The page we are using for unpinning is not mapped (mapcount:0).
>>>
>>> pfn:0x13b400 indicates that the page we are provided is actually the head page (folio->page).
>>>
>>>
>>> [ 97.640414][ T115] memcg:ffff0000f36b6000
>>> [ 97.640435][ T115] anon flags: 0x5ffc0000002107c(referenced|uptodate|dirty|lru|arch_1|head|swapbacked|node=0|zone=2|lastcpupid=0x7ff)
>>> [ 97.640468][ T115] raw: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
>>> [ 97.640490][ T115] raw: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
>>> [ 97.640514][ T115] head: 05ffc0000002107c fffffdffc37be1c8 fffffdffc3d75f08 ffff0000d50c0ee1
>>> [ 97.640536][ T115] head: 0000000000020000 0000000000000000 00000200ffffffff ffff0000f36b6000
>>> [ 97.640559][ T115] head: 05ffc00000010a09 fffffdffc3ed0001 000001ff000001fe 00000001ffffffff
>>> [ 97.640581][ T115] head: ffffffff000001fe 0000000000000028 0000000000000000 0000000000000200
>>> [ 97.640600][ T115] page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
>>>
>>> So we effectively only test the head page. Here we don't have the bit
>>> set for that page.
>>>
>>>
>>> In gup_fast() we perform a similar sanity check, which didn't trigger
>>> at the time we pinned the folio. io_uring ends up calling
>>> io_pin_pages() where we call pin_user_pages_fast(), so GUP-fast might
>>> indeed trigger.
>>>
>>>
>>> What could trigger this (in weird scenarios, though) is if we used
>>> pin_user_page() to obtain a page, then did folio = page_folio(page)
>>> and called unpin_user_page(&folio->page) instead of using
>>> unpin_folio(). Or using any other page that we didn't pin. It would be
>>> a corner case, though.
>>>
>>> Staring at io_release_ubuf(), that's also not immediately what's
>>> happening.
>>>
>>> There is this coalescing code in
>>> io_sqe_buffer_register()->io_check_coalesce_buffer(), maybe ...
>>> something is going wrong there?
>>>
>>>
>>>
>>> Otherwise, I could only envision (a) some random memory overwrite
>>> clearing the bit or (b) some weird race between GUP-fast and PAE
>>> clearing that we didn't run into so far. But these sanity checks have
>>> been around for a loooong time at this point.
>>>
>>> Unfortunately, no reproducer :(
>>
>> Too bad there's no reproducer... Since this looks recent, I'd suspect
>> the recent changes there. Most notably:
>>
>> commit f446c6311e86618a1f81eb576b56a6266307238f
>> Author: Jens Axboe <axboe@kernel.dk>
>> Date: Mon May 12 09:06:06 2025 -0600
>>
>> io_uring/memmap: don't use page_address() on a highmem page
>>
>> which seems a bit odd, as this is arm64 and there'd be no highmem. This
>> went into the 6.15 kernel release. Let's hope a reproducer is
>> forthcoming.
>
> Yeah, that does not really look problematic.
>
> Interestingly, this was found in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>
> Hm.
Yep, pulled that into 6.15 as released, and got a few mm/ changes in there.
So perhaps related?
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-03 15:31 [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages syzbot
2025-06-03 16:22 ` David Hildenbrand
@ 2025-06-21 21:52 ` syzbot
2025-06-23 9:29 ` David Hildenbrand
1 sibling, 1 reply; 17+ messages in thread
From: syzbot @ 2025-06-21 21:52 UTC (permalink / raw)
To: akpm, axboe, catalin.marinas, david, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
syzbot has found a reproducer for the following issue on:
HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
userspace arch: arm64
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/974f3ac1c6a5/disk-9aa9b43d.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/a5b5075d317f/vmlinux-9aa9b43d.xz
kernel image: https://storage.googleapis.com/syzbot-assets/2f0ba7fec19b/Image-9aa9b43d.gz.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/76067befefec/mount_4.gz
fsck result: failed (log: https://syzkaller.appspot.com/x/fsck.log?x=1549f6bc580000)
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000200
page dumped because: VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) && !PageAnonExclusive(page))
------------[ cut here ]------------
kernel BUG at mm/gup.c:71!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 1 UID: 0 PID: 2171 Comm: kworker/u8:9 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Workqueue: iou_exit io_ring_exit_work
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:70
lr : sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:70
sp : ffff8000a03a7640
x29: ffff8000a03a7660 x28: dfff800000000000 x27: 1fffffbff8723000
x26: 05ffc00000020178 x25: 05ffc00000020178 x24: fffffdffc3918000
x23: fffffdffc3918000 x22: ffff8000a03a76e0 x21: 05ffc00000020178
x20: 0000000000000000 x19: ffff8000a03a76e0 x18: 00000000ffffffff
x17: 703e2d6f696c6f66 x16: ffff80008aecb65c x15: 0000000000000001
x14: 1fffe000337e14e2 x13: 0000000000000000 x12: 0000000000000000
x11: ffff6000337e14e3 x10: 0000000000ff0100 x9 : cc07ffb5a919f400
x8 : cc07ffb5a919f400 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff8000a03a6d58 x4 : ffff80008f727060 x3 : ffff8000807bef2c
x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000061
Call trace:
sanity_check_pinned_pages+0x7cc/0x7d0 mm/gup.c:70 (P)
unpin_user_page+0x80/0x10c mm/gup.c:192
io_release_ubuf+0x84/0xf8 io_uring/rsrc.c:116
io_buffer_unmap io_uring/rsrc.c:143 [inline]
io_free_rsrc_node+0x250/0x57c io_uring/rsrc.c:516
io_put_rsrc_node io_uring/rsrc.h:103 [inline]
io_rsrc_data_free+0x148/0x298 io_uring/rsrc.c:200
io_sqe_buffers_unregister+0x84/0xa0 io_uring/rsrc.c:610
io_ring_ctx_free+0x48/0x480 io_uring/io_uring.c:2729
io_ring_exit_work+0x764/0x7d8 io_uring/io_uring.c:2971
process_one_work+0x7e8/0x155c kernel/workqueue.c:3238
process_scheduled_works kernel/workqueue.c:3321 [inline]
worker_thread+0x958/0xed8 kernel/workqueue.c:3402
kthread+0x5fc/0x75c kernel/kthread.c:464
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:847
Code: b0052bc1 91008021 aa1703e0 97fff8ab (d4210000)
---[ end trace 0000000000000000 ]---
---
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-21 21:52 ` syzbot
@ 2025-06-23 9:29 ` David Hildenbrand
2025-06-23 9:53 ` Alexander Potapenko
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-23 9:29 UTC (permalink / raw)
To: syzbot, akpm, axboe, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 21.06.25 23:52, syzbot wrote:
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
> userspace arch: arm64
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
There is not that much magic in there, I'm afraid.
fork() is only used to spin up guests, but before the memory region of
interest is actually allocated, IIUC. No threading code that races.
IIUC, it triggers fairly fast on aarch64. I've left it running for a
while on x86_64 without any luck.
So maybe this is really some aarch64-special stuff (pointer tagging?).
In particular, there is something very weird in the reproducer:
syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
/*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
advise is supposed to be a 32bit int. What does the magical
"0x800000000" do?
Let me try my luck reproducing in on arm.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 9:29 ` David Hildenbrand
@ 2025-06-23 9:53 ` Alexander Potapenko
2025-06-23 10:10 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Alexander Potapenko @ 2025-06-23 9:53 UTC (permalink / raw)
To: David Hildenbrand
Cc: syzbot, akpm, axboe, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via
syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>
> On 21.06.25 23:52, syzbot wrote:
> > syzbot has found a reproducer for the following issue on:
> >
> > HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
> > git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
> > dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
> > compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
> > userspace arch: arm64
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
>
> There is not that much magic in there, I'm afraid.
>
> fork() is only used to spin up guests, but before the memory region of
> interest is actually allocated, IIUC. No threading code that races.
>
> IIUC, it triggers fairly fast on aarch64. I've left it running for a
> while on x86_64 without any luck.
>
> So maybe this is really some aarch64-special stuff (pointer tagging?).
>
> In particular, there is something very weird in the reproducer:
>
> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
>
> advise is supposed to be a 32bit int. What does the magical
> "0x800000000" do?
I am pretty sure this is a red herring.
Syzkaller sometimes mutates integer flags, even if the result makes no
sense - because sometimes it can trigger interesting bugs.
This `advice` argument will be discarded by is_valid_madvise(),
resulting in -EINVAL.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 9:53 ` Alexander Potapenko
@ 2025-06-23 10:10 ` David Hildenbrand
2025-06-23 12:22 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-23 10:10 UTC (permalink / raw)
To: Alexander Potapenko
Cc: syzbot, akpm, axboe, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 23.06.25 11:53, Alexander Potapenko wrote:
> On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via
> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>
>> On 21.06.25 23:52, syzbot wrote:
>>> syzbot has found a reproducer for the following issue on:
>>>
>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>> userspace arch: arm64
>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
>>
>> There is not that much magic in there, I'm afraid.
>>
>> fork() is only used to spin up guests, but before the memory region of
>> interest is actually allocated, IIUC. No threading code that races.
>>
>> IIUC, it triggers fairly fast on aarch64. I've left it running for a
>> while on x86_64 without any luck.
>>
>> So maybe this is really some aarch64-special stuff (pointer tagging?).
>>
>> In particular, there is something very weird in the reproducer:
>>
>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
>>
>> advise is supposed to be a 32bit int. What does the magical
>> "0x800000000" do?
>
> I am pretty sure this is a red herring.
> Syzkaller sometimes mutates integer flags, even if the result makes no
> sense - because sometimes it can trigger interesting bugs.
> This `advice` argument will be discarded by is_valid_madvise(),
> resulting in -EINVAL.
I thought the same, but likely the upper bits are discarded, and we end
up with __NR_madvise succeeding.
The kernel config has
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
So without MADV_HUGEPAGE, we wouldn't get a THP in the first place.
So likely this is really just like dropping the "0x800000000"
Anyhow, I managed to reproduce in the VM using the provided rootfs on
aarch64. It triggers immediately, so no races involved.
Running the reproducer on a Fedora 42 debug-kernel in the hypervisor
does not trigger.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 10:10 ` David Hildenbrand
@ 2025-06-23 12:22 ` David Hildenbrand
2025-06-23 12:47 ` David Hildenbrand
2025-06-23 14:58 ` Jens Axboe
0 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-06-23 12:22 UTC (permalink / raw)
To: Alexander Potapenko, axboe
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 23.06.25 12:10, David Hildenbrand wrote:
> On 23.06.25 11:53, Alexander Potapenko wrote:
>> On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via
>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>>
>>> On 21.06.25 23:52, syzbot wrote:
>>>> syzbot has found a reproducer for the following issue on:
>>>>
>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>>> userspace arch: arm64
>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
>>>
>>> There is not that much magic in there, I'm afraid.
>>>
>>> fork() is only used to spin up guests, but before the memory region of
>>> interest is actually allocated, IIUC. No threading code that races.
>>>
>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a
>>> while on x86_64 without any luck.
>>>
>>> So maybe this is really some aarch64-special stuff (pointer tagging?).
>>>
>>> In particular, there is something very weird in the reproducer:
>>>
>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
>>>
>>> advise is supposed to be a 32bit int. What does the magical
>>> "0x800000000" do?
>>
>> I am pretty sure this is a red herring.
>> Syzkaller sometimes mutates integer flags, even if the result makes no
>> sense - because sometimes it can trigger interesting bugs.
>> This `advice` argument will be discarded by is_valid_madvise(),
>> resulting in -EINVAL.
>
> I thought the same, but likely the upper bits are discarded, and we end
> up with __NR_madvise succeeding.
>
> The kernel config has
>
> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
>
> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place.
>
> So likely this is really just like dropping the "0x800000000"
>
> Anyhow, I managed to reproduce in the VM using the provided rootfs on
> aarch64. It triggers immediately, so no races involved.
>
> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor
> does not trigger.
Simplified reproducer that does not depend on a race with the
child process.
As expected previously, we have PAE cleared on the head page,
because it is/was COW-shared with a child process.
We are registering more than one consecutive tail pages of that
THP through iouring, GUP-pinning them. These pages are not
COW-shared and, therefore, do not have PAE set.
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <liburing.h>
int main(void)
{
struct io_uring_params params = {
.wq_fd = -1,
};
struct iovec iovec;
const size_t pagesize = getpagesize();
size_t size = 2048 * pagesize;
char *addr;
int fd;
/* We need a THP-aligned area. */
addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ,
MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("MAP_FIXED failed\n");
return 1;
}
if (madvise(addr, size, MADV_HUGEPAGE)) {
perror("MADV_HUGEPAGE failed\n");
return 1;
}
/* Populate a THP. */
memset(addr, 0, size);
/* COW-share only the first page ... */
if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) {
perror("MADV_DONTFORK failed\n");
return 1;
}
/* ... using fork(). This will clear PAE on the head page. */
if (fork() == 0)
exit(0);
/* Setup iouring */
fd = syscall(__NR_io_uring_setup, 1024, ¶ms);
if (fd < 0) {
perror("__NR_io_uring_setup failed\n");
return 1;
}
/* Register (GUP-pin) two consecutive tail pages. */
iovec.iov_base = addr + pagesize;
iovec.iov_len = 2 * pagesize;
syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1);
return 0;
}
[ 108.070381][ T14] kernel BUG at mm/gup.c:71!
[ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 108.117202][ T14] Modules linked in:
[ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT
[ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025
[ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work
[ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0
[ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0
[ 108.138025][ T14] sp : ffff800097ac7640
[ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000
[ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000
[ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c
[ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff
[ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4
[ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff
[ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700
[ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001
[ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348
[ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061
[ 108.174205][ T14] Call trace:
[ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P)
[ 108.178138][ T14] unpin_user_page+0x80/0x10c
[ 108.180189][ T14] io_release_ubuf+0x84/0xf8
[ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c
[ 108.184345][ T14] io_rsrc_data_free+0x148/0x298
[ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0
[ 108.188991][ T14] io_ring_ctx_free+0x48/0x480
[ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8
[ 108.193207][ T14] process_one_work+0x7e8/0x155c
[ 108.195431][ T14] worker_thread+0x958/0xed8
[ 108.197561][ T14] kthread+0x5fc/0x75c
[ 108.199362][ T14] ret_from_fork+0x10/0x20
When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
(IOW, one we never pinned).
So it's related to the io_coalesce_buffer() machinery.
And in fact, in there, we have this weird logic:
/* Store head pages only*/
new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
...
Essentially discarding the subpage information when coalescing tail pages.
I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
flawed (we can -- in theory -- coalesc different folio page ranges in
a GUP result?).
@Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
place.
Can you look into that, as you are more familiar with the logic?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 12:22 ` David Hildenbrand
@ 2025-06-23 12:47 ` David Hildenbrand
2025-06-23 14:58 ` Jens Axboe
1 sibling, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-06-23 12:47 UTC (permalink / raw)
To: Alexander Potapenko, axboe
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 23.06.25 14:22, David Hildenbrand wrote:
> On 23.06.25 12:10, David Hildenbrand wrote:
>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>> On Mon, Jun 23, 2025 at 11:29 AM 'David Hildenbrand' via
>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>>>
>>>> On 21.06.25 23:52, syzbot wrote:
>>>>> syzbot has found a reproducer for the following issue on:
>>>>>
>>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
>>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>>>> userspace arch: arm64
>>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
>>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
>>>>
>>>> There is not that much magic in there, I'm afraid.
>>>>
>>>> fork() is only used to spin up guests, but before the memory region of
>>>> interest is actually allocated, IIUC. No threading code that races.
>>>>
>>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a
>>>> while on x86_64 without any luck.
>>>>
>>>> So maybe this is really some aarch64-special stuff (pointer tagging?).
>>>>
>>>> In particular, there is something very weird in the reproducer:
>>>>
>>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
>>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
>>>>
>>>> advise is supposed to be a 32bit int. What does the magical
>>>> "0x800000000" do?
>>>
>>> I am pretty sure this is a red herring.
>>> Syzkaller sometimes mutates integer flags, even if the result makes no
>>> sense - because sometimes it can trigger interesting bugs.
>>> This `advice` argument will be discarded by is_valid_madvise(),
>>> resulting in -EINVAL.
>>
>> I thought the same, but likely the upper bits are discarded, and we end
>> up with __NR_madvise succeeding.
>>
>> The kernel config has
>>
>> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
>>
>> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place.
>>
>> So likely this is really just like dropping the "0x800000000"
>>
>> Anyhow, I managed to reproduce in the VM using the provided rootfs on
>> aarch64. It triggers immediately, so no races involved.
>>
>> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor
>> does not trigger.
>
> Simplified reproducer that does not depend on a race with the
> child process.
>
> As expected previously, we have PAE cleared on the head page,
> because it is/was COW-shared with a child process.
>
> We are registering more than one consecutive tail pages of that
> THP through iouring, GUP-pinning them. These pages are not
> COW-shared and, therefore, do not have PAE set.
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <sys/ioctl.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <liburing.h>
>
> int main(void)
> {
> struct io_uring_params params = {
> .wq_fd = -1,
> };
> struct iovec iovec;
> const size_t pagesize = getpagesize();
> size_t size = 2048 * pagesize;
> char *addr;
> int fd;
>
> /* We need a THP-aligned area. */
> addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ,
> MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> if (addr == MAP_FAILED) {
> perror("MAP_FIXED failed\n");
> return 1;
> }
>
> if (madvise(addr, size, MADV_HUGEPAGE)) {
> perror("MADV_HUGEPAGE failed\n");
> return 1;
> }
>
> /* Populate a THP. */
> memset(addr, 0, size);
>
> /* COW-share only the first page ... */
> if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) {
> perror("MADV_DONTFORK failed\n");
> return 1;
> }
>
> /* ... using fork(). This will clear PAE on the head page. */
> if (fork() == 0)
> exit(0);
>
> /* Setup iouring */
> fd = syscall(__NR_io_uring_setup, 1024, ¶ms);
> if (fd < 0) {
> perror("__NR_io_uring_setup failed\n");
> return 1;
> }
>
> /* Register (GUP-pin) two consecutive tail pages. */
> iovec.iov_base = addr + pagesize;
> iovec.iov_len = 2 * pagesize;
> syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1);
> return 0;
> }
>
> [ 108.070381][ T14] kernel BUG at mm/gup.c:71!
> [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [ 108.117202][ T14] Modules linked in:
> [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT
> [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025
> [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work
> [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0
> [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0
> [ 108.138025][ T14] sp : ffff800097ac7640
> [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000
> [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000
> [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c
> [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff
> [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4
> [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff
> [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700
> [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001
> [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348
> [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061
> [ 108.174205][ T14] Call trace:
> [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P)
> [ 108.178138][ T14] unpin_user_page+0x80/0x10c
> [ 108.180189][ T14] io_release_ubuf+0x84/0xf8
> [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c
> [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298
> [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0
> [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480
> [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8
> [ 108.193207][ T14] process_one_work+0x7e8/0x155c
> [ 108.195431][ T14] worker_thread+0x958/0xed8
> [ 108.197561][ T14] kthread+0x5fc/0x75c
> [ 108.199362][ T14] ret_from_fork+0x10/0x20
FWIW, a slight cow.c selftest modification can trigger the same:
diff --git a/tools/testing/selftests/mm/cow.c
b/tools/testing/selftests/mm/cow.c
index 4214070d03ce..50c538b47bb4 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -991,6 +991,8 @@ static void do_run_with_thp(test_fn fn, enum thp_run
thp_run, size_t thpsize)
log_test_result(KSFT_FAIL);
goto munmap;
}
+ mem += pagesize;
+ size -= pagesize;
break;
default:
assert(false);
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 12:22 ` David Hildenbrand
2025-06-23 12:47 ` David Hildenbrand
@ 2025-06-23 14:58 ` Jens Axboe
2025-06-23 15:11 ` David Hildenbrand
1 sibling, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2025-06-23 14:58 UTC (permalink / raw)
To: David Hildenbrand, Alexander Potapenko
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs, Pavel Begunkov
On 6/23/25 6:22 AM, David Hildenbrand wrote:
> On 23.06.25 12:10, David Hildenbrand wrote:
>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via
>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>>>
>>>> On 21.06.25 23:52, syzbot wrote:
>>>>> syzbot has found a reproducer for the following issue on:
>>>>>
>>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
>>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>>>> userspace arch: arm64
>>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
>>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
>>>>
>>>> There is not that much magic in there, I'm afraid.
>>>>
>>>> fork() is only used to spin up guests, but before the memory region of
>>>> interest is actually allocated, IIUC. No threading code that races.
>>>>
>>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a
>>>> while on x86_64 without any luck.
>>>>
>>>> So maybe this is really some aarch64-special stuff (pointer tagging?).
>>>>
>>>> In particular, there is something very weird in the reproducer:
>>>>
>>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
>>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
>>>>
>>>> advise is supposed to be a 32bit int. What does the magical
>>>> "0x800000000" do?
>>>
>>> I am pretty sure this is a red herring.
>>> Syzkaller sometimes mutates integer flags, even if the result makes no
>>> sense - because sometimes it can trigger interesting bugs.
>>> This `advice` argument will be discarded by is_valid_madvise(),
>>> resulting in -EINVAL.
>>
>> I thought the same, but likely the upper bits are discarded, and we end
>> up with __NR_madvise succeeding.
>>
>> The kernel config has
>>
>> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
>>
>> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place.
>>
>> So likely this is really just like dropping the "0x800000000"
>>
>> Anyhow, I managed to reproduce in the VM using the provided rootfs on
>> aarch64. It triggers immediately, so no races involved.
>>
>> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor
>> does not trigger.
>
> Simplified reproducer that does not depend on a race with the
> child process.
>
> As expected previously, we have PAE cleared on the head page,
> because it is/was COW-shared with a child process.
>
> We are registering more than one consecutive tail pages of that
> THP through iouring, GUP-pinning them. These pages are not
> COW-shared and, therefore, do not have PAE set.
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <sys/ioctl.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <liburing.h>
>
> int main(void)
> {
> struct io_uring_params params = {
> .wq_fd = -1,
> };
> struct iovec iovec;
> const size_t pagesize = getpagesize();
> size_t size = 2048 * pagesize;
> char *addr;
> int fd;
>
> /* We need a THP-aligned area. */
> addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ,
> MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> if (addr == MAP_FAILED) {
> perror("MAP_FIXED failed\n");
> return 1;
> }
>
> if (madvise(addr, size, MADV_HUGEPAGE)) {
> perror("MADV_HUGEPAGE failed\n");
> return 1;
> }
>
> /* Populate a THP. */
> memset(addr, 0, size);
>
> /* COW-share only the first page ... */
> if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) {
> perror("MADV_DONTFORK failed\n");
> return 1;
> }
>
> /* ... using fork(). This will clear PAE on the head page. */
> if (fork() == 0)
> exit(0);
>
> /* Setup iouring */
> fd = syscall(__NR_io_uring_setup, 1024, ¶ms);
> if (fd < 0) {
> perror("__NR_io_uring_setup failed\n");
> return 1;
> }
>
> /* Register (GUP-pin) two consecutive tail pages. */
> iovec.iov_base = addr + pagesize;
> iovec.iov_len = 2 * pagesize;
> syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1);
> return 0;
> }
>
> [ 108.070381][ T14] kernel BUG at mm/gup.c:71!
> [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [ 108.117202][ T14] Modules linked in:
> [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT
> [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025
> [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work
> [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0
> [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0
> [ 108.138025][ T14] sp : ffff800097ac7640
> [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000
> [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000
> [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c
> [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff
> [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4
> [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff
> [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700
> [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001
> [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348
> [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061
> [ 108.174205][ T14] Call trace:
> [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P)
> [ 108.178138][ T14] unpin_user_page+0x80/0x10c
> [ 108.180189][ T14] io_release_ubuf+0x84/0xf8
> [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c
> [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298
> [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0
> [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480
> [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8
> [ 108.193207][ T14] process_one_work+0x7e8/0x155c
> [ 108.195431][ T14] worker_thread+0x958/0xed8
> [ 108.197561][ T14] kthread+0x5fc/0x75c
> [ 108.199362][ T14] ret_from_fork+0x10/0x20
>
>
> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
>
> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
> (IOW, one we never pinned).
>
> So it's related to the io_coalesce_buffer() machinery.
>
> And in fact, in there, we have this weird logic:
>
> /* Store head pages only*/
> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
> ...
>
>
> Essentially discarding the subpage information when coalescing tail pages.
>
>
> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
> flawed (we can -- in theory -- coalesc different folio page ranges in
> a GUP result?).
>
> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
> place.
>
> Can you look into that, as you are more familiar with the logic?
Leaving this all quoted and adding Pavel, who wrote that code. I'm
currently away, so can't look into this right now.
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 14:58 ` Jens Axboe
@ 2025-06-23 15:11 ` David Hildenbrand
2025-06-23 16:48 ` Pavel Begunkov
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-23 15:11 UTC (permalink / raw)
To: Jens Axboe, Alexander Potapenko
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs, Pavel Begunkov
On 23.06.25 16:58, Jens Axboe wrote:
> On 6/23/25 6:22 AM, David Hildenbrand wrote:
>> On 23.06.25 12:10, David Hildenbrand wrote:
>>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via
>>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>>>>
>>>>> On 21.06.25 23:52, syzbot wrote:
>>>>>> syzbot has found a reproducer for the following issue on:
>>>>>>
>>>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
>>>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
>>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>>>>> userspace arch: arm64
>>>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
>>>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
>>>>>
>>>>> There is not that much magic in there, I'm afraid.
>>>>>
>>>>> fork() is only used to spin up guests, but before the memory region of
>>>>> interest is actually allocated, IIUC. No threading code that races.
>>>>>
>>>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a
>>>>> while on x86_64 without any luck.
>>>>>
>>>>> So maybe this is really some aarch64-special stuff (pointer tagging?).
>>>>>
>>>>> In particular, there is something very weird in the reproducer:
>>>>>
>>>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
>>>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
>>>>>
>>>>> advise is supposed to be a 32bit int. What does the magical
>>>>> "0x800000000" do?
>>>>
>>>> I am pretty sure this is a red herring.
>>>> Syzkaller sometimes mutates integer flags, even if the result makes no
>>>> sense - because sometimes it can trigger interesting bugs.
>>>> This `advice` argument will be discarded by is_valid_madvise(),
>>>> resulting in -EINVAL.
>>>
>>> I thought the same, but likely the upper bits are discarded, and we end
>>> up with __NR_madvise succeeding.
>>>
>>> The kernel config has
>>>
>>> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
>>>
>>> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place.
>>>
>>> So likely this is really just like dropping the "0x800000000"
>>>
>>> Anyhow, I managed to reproduce in the VM using the provided rootfs on
>>> aarch64. It triggers immediately, so no races involved.
>>>
>>> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor
>>> does not trigger.
>>
>> Simplified reproducer that does not depend on a race with the
>> child process.
>>
>> As expected previously, we have PAE cleared on the head page,
>> because it is/was COW-shared with a child process.
>>
>> We are registering more than one consecutive tail pages of that
>> THP through iouring, GUP-pinning them. These pages are not
>> COW-shared and, therefore, do not have PAE set.
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <sys/ioctl.h>
>> #include <sys/mman.h>
>> #include <sys/syscall.h>
>> #include <sys/types.h>
>> #include <liburing.h>
>>
>> int main(void)
>> {
>> struct io_uring_params params = {
>> .wq_fd = -1,
>> };
>> struct iovec iovec;
>> const size_t pagesize = getpagesize();
>> size_t size = 2048 * pagesize;
>> char *addr;
>> int fd;
>>
>> /* We need a THP-aligned area. */
>> addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ,
>> MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>> if (addr == MAP_FAILED) {
>> perror("MAP_FIXED failed\n");
>> return 1;
>> }
>>
>> if (madvise(addr, size, MADV_HUGEPAGE)) {
>> perror("MADV_HUGEPAGE failed\n");
>> return 1;
>> }
>>
>> /* Populate a THP. */
>> memset(addr, 0, size);
>>
>> /* COW-share only the first page ... */
>> if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) {
>> perror("MADV_DONTFORK failed\n");
>> return 1;
>> }
>>
>> /* ... using fork(). This will clear PAE on the head page. */
>> if (fork() == 0)
>> exit(0);
>>
>> /* Setup iouring */
>> fd = syscall(__NR_io_uring_setup, 1024, ¶ms);
>> if (fd < 0) {
>> perror("__NR_io_uring_setup failed\n");
>> return 1;
>> }
>>
>> /* Register (GUP-pin) two consecutive tail pages. */
>> iovec.iov_base = addr + pagesize;
>> iovec.iov_len = 2 * pagesize;
>> syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1);
>> return 0;
>> }
>>
>> [ 108.070381][ T14] kernel BUG at mm/gup.c:71!
>> [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>> [ 108.117202][ T14] Modules linked in:
>> [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT
>> [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025
>> [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work
>> [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0
>> [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0
>> [ 108.138025][ T14] sp : ffff800097ac7640
>> [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000
>> [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000
>> [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c
>> [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff
>> [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4
>> [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff
>> [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700
>> [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001
>> [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348
>> [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061
>> [ 108.174205][ T14] Call trace:
>> [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P)
>> [ 108.178138][ T14] unpin_user_page+0x80/0x10c
>> [ 108.180189][ T14] io_release_ubuf+0x84/0xf8
>> [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c
>> [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298
>> [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0
>> [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480
>> [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8
>> [ 108.193207][ T14] process_one_work+0x7e8/0x155c
>> [ 108.195431][ T14] worker_thread+0x958/0xed8
>> [ 108.197561][ T14] kthread+0x5fc/0x75c
>> [ 108.199362][ T14] ret_from_fork+0x10/0x20
>>
>>
>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
>>
>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
>> (IOW, one we never pinned).
>>
>> So it's related to the io_coalesce_buffer() machinery.
>>
>> And in fact, in there, we have this weird logic:
>>
>> /* Store head pages only*/
>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
>> ...
>>
>>
>> Essentially discarding the subpage information when coalescing tail pages.
>>
>>
>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
>> flawed (we can -- in theory -- coalesc different folio page ranges in
>> a GUP result?).
>>
>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
>> place.
>>
>> Can you look into that, as you are more familiar with the logic?
>
> Leaving this all quoted and adding Pavel, who wrote that code. I'm
> currently away, so can't look into this right now.
I did some more digging, but ended up being all confused about
io_check_coalesce_buffer() and io_imu_folio_data().
Assuming we pass a bunch of consecutive tail pages that all belong to
the same folio, then the loop in io_check_coalesce_buffer() will always
run into the
if (page_folio(page_array[i]) == folio &&
page_array[i] == page_array[i-1] + 1) {
count++;
continue;
}
case, making the function return "true" ... in io_coalesce_buffer(), we
then store the head page ... which seems very wrong.
In general, storing head pages when they are not the first page to be
coalesced seems wrong.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 15:11 ` David Hildenbrand
@ 2025-06-23 16:48 ` Pavel Begunkov
2025-06-23 16:59 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2025-06-23 16:48 UTC (permalink / raw)
To: David Hildenbrand, Jens Axboe, Alexander Potapenko
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 6/23/25 16:11, David Hildenbrand wrote:
> On 23.06.25 16:58, Jens Axboe wrote:
>> On 6/23/25 6:22 AM, David Hildenbrand wrote:
>>> On 23.06.25 12:10, David Hildenbrand wrote:
>>>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via
>>>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>>>>>
...>>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
>>>
>>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
>>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
>>> (IOW, one we never pinned).
>>>
>>> So it's related to the io_coalesce_buffer() machinery.
>>>
>>> And in fact, in there, we have this weird logic:
>>>
>>> /* Store head pages only*/
>>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
>>> ...
>>>
>>>
>>> Essentially discarding the subpage information when coalescing tail pages.
>>>
>>>
>>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
>>> flawed (we can -- in theory -- coalesc different folio page ranges in
>>> a GUP result?).
>>>
>>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
>>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
>>> place.
>>>
>>> Can you look into that, as you are more familiar with the logic?
>>
>> Leaving this all quoted and adding Pavel, who wrote that code. I'm
>> currently away, so can't look into this right now.
Chenliang Li did, but not like it matters
> I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data().
>
> Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always
> run into the
>
> if (page_folio(page_array[i]) == folio &&
> page_array[i] == page_array[i-1] + 1) {
> count++;
> continue;
> }
>
> case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong.
>
> In general, storing head pages when they are not the first page to be coalesced seems wrong.
Yes, it stores the head page even if the range passed to
pin_user_pages() doesn't cover the head page.
It should be converted to unpin_user_folio(), which doesn't seem
to do sanity_check_pinned_pages(). Do you think that'll be enough
(conceptually)? Nobody is actually touching the head page in those
cases apart from the final unpin, and storing the head page is
more convenient than keeping folios. I'll take a look if it can
be fully converted to folios w/o extra overhead.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 16:48 ` Pavel Begunkov
@ 2025-06-23 16:59 ` David Hildenbrand
2025-06-23 17:36 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-23 16:59 UTC (permalink / raw)
To: Pavel Begunkov, Jens Axboe, Alexander Potapenko
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 23.06.25 18:48, Pavel Begunkov wrote:
> On 6/23/25 16:11, David Hildenbrand wrote:
>> On 23.06.25 16:58, Jens Axboe wrote:
>>> On 6/23/25 6:22 AM, David Hildenbrand wrote:
>>>> On 23.06.25 12:10, David Hildenbrand wrote:
>>>>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via
>>>>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>>>>>>
> ...>>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
>>>>
>>>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
>>>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
>>>> (IOW, one we never pinned).
>>>>
>>>> So it's related to the io_coalesce_buffer() machinery.
>>>>
>>>> And in fact, in there, we have this weird logic:
>>>>
>>>> /* Store head pages only*/
>>>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
>>>> ...
>>>>
>>>>
>>>> Essentially discarding the subpage information when coalescing tail pages.
>>>>
>>>>
>>>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
>>>> flawed (we can -- in theory -- coalesc different folio page ranges in
>>>> a GUP result?).
>>>>
>>>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
>>>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
>>>> place.
>>>>
>>>> Can you look into that, as you are more familiar with the logic?
>>>
>>> Leaving this all quoted and adding Pavel, who wrote that code. I'm
>>> currently away, so can't look into this right now.
>
> Chenliang Li did, but not like it matters
>
>> I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data().
>>
>> Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always
>> run into the
>>
>> if (page_folio(page_array[i]) == folio &&
>> page_array[i] == page_array[i-1] + 1) {
>> count++;
>> continue;
>> }
>>
>> case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong.
>>
>> In general, storing head pages when they are not the first page to be coalesced seems wrong.
>
> Yes, it stores the head page even if the range passed to
> pin_user_pages() doesn't cover the head page.
> > It should be converted to unpin_user_folio(), which doesn't seem
> to do sanity_check_pinned_pages(). Do you think that'll be enough
> (conceptually)? Nobody is actually touching the head page in those
> cases apart from the final unpin, and storing the head page is
> more convenient than keeping folios. I'll take a look if it can
> be fully converted to folios w/o extra overhead.
Assuming we had from GUP
nr_pages = 2
pages[0] = folio_page(folio, 1)
pages[1] = folio_page(folio, 2)
After io_coalesce_buffer() we have
nr_pages = 1
pages[0] = folio_page(folio, 0)
Using unpin_user_folio() in all places where we could see something like
that would be the right thing to do. The sanity checks are not in
unpin_user_folio() for exactly that reason: we don't know which folio
pages we pinned.
But now I wonder where you make sure that "Nobody is actually touching
the head page"?
How do you get back the "which folio range" information after
io_coalesce_buffer() ?
If you rely on alignment in virtual address space for you, combined with
imu->folio_shift, that might not work reliably ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 16:59 ` David Hildenbrand
@ 2025-06-23 17:36 ` David Hildenbrand
2025-06-23 18:02 ` Pavel Begunkov
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-23 17:36 UTC (permalink / raw)
To: Pavel Begunkov, Jens Axboe, Alexander Potapenko
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 23.06.25 18:59, David Hildenbrand wrote:
> On 23.06.25 18:48, Pavel Begunkov wrote:
>> On 6/23/25 16:11, David Hildenbrand wrote:
>>> On 23.06.25 16:58, Jens Axboe wrote:
>>>> On 6/23/25 6:22 AM, David Hildenbrand wrote:
>>>>> On 23.06.25 12:10, David Hildenbrand wrote:
>>>>>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>>>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via
>>>>>>> syzkaller-bugs <syzkaller-bugs@googlegroups.com> wrote:
>>>>>>>>
>> ...>>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
>>>>>
>>>>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
>>>>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
>>>>> (IOW, one we never pinned).
>>>>>
>>>>> So it's related to the io_coalesce_buffer() machinery.
>>>>>
>>>>> And in fact, in there, we have this weird logic:
>>>>>
>>>>> /* Store head pages only*/
>>>>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
>>>>> ...
>>>>>
>>>>>
>>>>> Essentially discarding the subpage information when coalescing tail pages.
>>>>>
>>>>>
>>>>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
>>>>> flawed (we can -- in theory -- coalesc different folio page ranges in
>>>>> a GUP result?).
>>>>>
>>>>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
>>>>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
>>>>> place.
>>>>>
>>>>> Can you look into that, as you are more familiar with the logic?
>>>>
>>>> Leaving this all quoted and adding Pavel, who wrote that code. I'm
>>>> currently away, so can't look into this right now.
>>
>> Chenliang Li did, but not like it matters
>>
>>> I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data().
>>>
>>> Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always
>>> run into the
>>>
>>> if (page_folio(page_array[i]) == folio &&
>>> page_array[i] == page_array[i-1] + 1) {
>>> count++;
>>> continue;
>>> }
>>>
>>> case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong.
>>>
>>> In general, storing head pages when they are not the first page to be coalesced seems wrong.
>>
>> Yes, it stores the head page even if the range passed to
>> pin_user_pages() doesn't cover the head page.
> > > It should be converted to unpin_user_folio(), which doesn't seem
>> to do sanity_check_pinned_pages(). Do you think that'll be enough
>> (conceptually)? Nobody is actually touching the head page in those
>> cases apart from the final unpin, and storing the head page is
>> more convenient than keeping folios. I'll take a look if it can
>> be fully converted to folios w/o extra overhead.
>
> Assuming we had from GUP
>
> nr_pages = 2
> pages[0] = folio_page(folio, 1)
> pages[1] = folio_page(folio, 2)
>
> After io_coalesce_buffer() we have
>
> nr_pages = 1
> pages[0] = folio_page(folio, 0)
>
>
> Using unpin_user_folio() in all places where we could see something like
> that would be the right thing to do. The sanity checks are not in
> unpin_user_folio() for exactly that reason: we don't know which folio
> pages we pinned.
>
> But now I wonder where you make sure that "Nobody is actually touching
> the head page"?
>
> How do you get back the "which folio range" information after
> io_coalesce_buffer() ?
>
>
> If you rely on alignment in virtual address space for you, combined with
> imu->folio_shift, that might not work reliably ...
FWIW, applying the following on top of origin/master:
diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index dbbcc5eb3dce5..e62a284dcf906 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -946,6 +946,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
log_test_result(KSFT_FAIL);
goto munmap;
}
+ mem = mremap_mem;
size = mremap_size;
break;
case THP_RUN_PARTIAL_SHARED:
and then running the selftest, something is not happy:
...
# [RUN] R/O-mapping a page registered as iouring fixed buffer ... with partially mremap()'ed THP (512 kB)
[34272.021973] Oops: general protection fault, maybe for address 0xffff8bab09d5b000: 0000 [#1] PREEMPT SMP NOPTI
[34272.021980] CPU: 3 UID: 0 PID: 1048307 Comm: iou-wrk-1047940 Not tainted 6.14.9-300.fc42.x86_64 #1
[34272.021983] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET53W (1.53 ) 03/22/2023
[34272.021984] RIP: 0010:memcpy+0xc/0x20
[34272.021989] Code: cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 e9 4d f9 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
[34272.021991] RSP: 0018:ffffcff459183c20 EFLAGS: 00010206
[34272.021993] RAX: ffff8bab09d5b000 RBX: 0000000000000fff RCX: 0000000000000fff
[34272.021994] RDX: 0000000000000fff RSI: 0021461670800001 RDI: ffff8bab09d5b000
[34272.021995] RBP: ffff8ba794866c40 R08: ffff8bab09d5b000 R09: 0000000000001000
[34272.021996] R10: ffff8ba7a316f9d0 R11: ffff8ba92f133080 R12: 0000000000000fff
[34272.021997] R13: ffff8baa85d5b6a0 R14: 0000000000000fff R15: 0000000000001000
[34272.021998] FS: 00007f16c568a740(0000) GS:ffff8baebf580000(0000) knlGS:0000000000000000
[34272.021999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34272.022000] CR2: 00007fffb6a10b00 CR3: 00000003df9eb006 CR4: 0000000000f72ef0
[34272.022001] PKRU: 55555554
[34272.022002] Call Trace:
[34272.022004] <TASK>
[34272.022005] copy_page_from_iter_atomic+0x36f/0x7e0
[34272.022009] ? simple_xattr_get+0x59/0xa0
[34272.022012] generic_perform_write+0x86/0x2e0
[34272.022016] shmem_file_write_iter+0x86/0x90
[34272.022019] io_write+0xe4/0x390
[34272.022023] io_issue_sqe+0x65/0x4f0
[34272.022024] ? lock_timer_base+0x7d/0xc0
[34272.022027] io_wq_submit_work+0xb8/0x320
[34272.022029] io_worker_handle_work+0xd5/0x300
[34272.022032] io_wq_worker+0xda/0x300
[34272.022034] ? finish_task_switch.isra.0+0x99/0x2c0
[34272.022037] ? __pfx_io_wq_worker+0x10/0x10
[34272.022039] ret_from_fork+0x34/0x50
[34272.022042] ? __pfx_io_wq_worker+0x10/0x10
[34272.022044] ret_from_fork_asm+0x1a/0x30
[34272.022047] </TASK>
There, we essentially mremap a THP to not be aligned in VA space, and then register half the
THP as a fixed buffer.
So ... my suspicion that this is all rather broken grows :)
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
2025-06-23 17:36 ` David Hildenbrand
@ 2025-06-23 18:02 ` Pavel Begunkov
0 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2025-06-23 18:02 UTC (permalink / raw)
To: David Hildenbrand, Jens Axboe, Alexander Potapenko
Cc: syzbot, akpm, catalin.marinas, jgg, jhubbard, linux-kernel,
linux-mm, peterx, syzkaller-bugs
On 6/23/25 18:36, David Hildenbrand wrote:
> On 23.06.25 18:59, David Hildenbrand wrote:
>> On 23.06.25 18:48, Pavel Begunkov wrote:
>>> On 6/23/25 16:11, David Hildenbrand wrote:
...>>> Yes, it stores the head page even if the range passed to
>>> pin_user_pages() doesn't cover the head page.
>> > > It should be converted to unpin_user_folio(), which doesn't seem
>>> to do sanity_check_pinned_pages(). Do you think that'll be enough
>>> (conceptually)? Nobody is actually touching the head page in those
>>> cases apart from the final unpin, and storing the head page is
>>> more convenient than keeping folios. I'll take a look if it can
>>> be fully converted to folios w/o extra overhead.
>>
>> Assuming we had from GUP
>>
>> nr_pages = 2
>> pages[0] = folio_page(folio, 1)
>> pages[1] = folio_page(folio, 2)
>>
>> After io_coalesce_buffer() we have
>>
>> nr_pages = 1
>> pages[0] = folio_page(folio, 0)
>>
>>
>> Using unpin_user_folio() in all places where we could see something like
>> that would be the right thing to do. The sanity checks are not in
>> unpin_user_folio() for exactly that reason: we don't know which folio
>> pages we pinned.
Let's do that for starters
>> But now I wonder where you make sure that "Nobody is actually touching
>> the head page"?
>>
>> How do you get back the "which folio range" information after
>> io_coalesce_buffer() ?
>>
>>
>> If you rely on alignment in virtual address space for you, combined with
>> imu->folio_shift, that might not work reliably ...
>
> FWIW, applying the following on top of origin/master:
>
> diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
> index dbbcc5eb3dce5..e62a284dcf906 100644
> --- a/tools/testing/selftests/mm/cow.c
> +++ b/tools/testing/selftests/mm/cow.c
> @@ -946,6 +946,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
> log_test_result(KSFT_FAIL);
> goto munmap;
> }
> + mem = mremap_mem;
> size = mremap_size;
> break;
> case THP_RUN_PARTIAL_SHARED:
>
>
> and then running the selftest, something is not happy:
>
> ...
> # [RUN] R/O-mapping a page registered as iouring fixed buffer ... with partially mremap()'ed THP (512 kB)
> [34272.021973] Oops: general protection fault, maybe for address 0xffff8bab09d5b000: 0000 [#1] PREEMPT SMP NOPTI
> [34272.021980] CPU: 3 UID: 0 PID: 1048307 Comm: iou-wrk-1047940 Not tainted 6.14.9-300.fc42.x86_64 #1
> [34272.021983] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET53W (1.53 ) 03/22/2023
> [34272.021984] RIP: 0010:memcpy+0xc/0x20
> [34272.021989] Code: cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 e9 4d f9 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
> [34272.021991] RSP: 0018:ffffcff459183c20 EFLAGS: 00010206
> [34272.021993] RAX: ffff8bab09d5b000 RBX: 0000000000000fff RCX: 0000000000000fff
> [34272.021994] RDX: 0000000000000fff RSI: 0021461670800001 RDI: ffff8bab09d5b000
> [34272.021995] RBP: ffff8ba794866c40 R08: ffff8bab09d5b000 R09: 0000000000001000
> [34272.021996] R10: ffff8ba7a316f9d0 R11: ffff8ba92f133080 R12: 0000000000000fff
> [34272.021997] R13: ffff8baa85d5b6a0 R14: 0000000000000fff R15: 0000000000001000
> [34272.021998] FS: 00007f16c568a740(0000) GS:ffff8baebf580000(0000) knlGS:0000000000000000
> [34272.021999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [34272.022000] CR2: 00007fffb6a10b00 CR3: 00000003df9eb006 CR4: 0000000000f72ef0
> [34272.022001] PKRU: 55555554
> [34272.022002] Call Trace:
> [34272.022004] <TASK>
> [34272.022005] copy_page_from_iter_atomic+0x36f/0x7e0
> [34272.022009] ? simple_xattr_get+0x59/0xa0
> [34272.022012] generic_perform_write+0x86/0x2e0
> [34272.022016] shmem_file_write_iter+0x86/0x90
> [34272.022019] io_write+0xe4/0x390
> [34272.022023] io_issue_sqe+0x65/0x4f0
> [34272.022024] ? lock_timer_base+0x7d/0xc0
> [34272.022027] io_wq_submit_work+0xb8/0x320
> [34272.022029] io_worker_handle_work+0xd5/0x300
> [34272.022032] io_wq_worker+0xda/0x300
> [34272.022034] ? finish_task_switch.isra.0+0x99/0x2c0
> [34272.022037] ? __pfx_io_wq_worker+0x10/0x10
> [34272.022039] ret_from_fork+0x34/0x50
> [34272.022042] ? __pfx_io_wq_worker+0x10/0x10
> [34272.022044] ret_from_fork_asm+0x1a/0x30
> [34272.022047] </TASK>
>
>
> There, we essentially mremap a THP to not be aligned in VA space, and then register half the
> THP as a fixed buffer.
>
> So ... my suspicion that this is all rather broken grows :)
It's supposed to calculate the offset from a user pointer and
then work with that, but I guess there are masking that violate
it, I'll check.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-06-23 18:01 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 15:31 [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages syzbot
2025-06-03 16:22 ` David Hildenbrand
2025-06-03 17:20 ` Jens Axboe
2025-06-03 17:25 ` David Hildenbrand
2025-06-03 17:36 ` Jens Axboe
2025-06-21 21:52 ` syzbot
2025-06-23 9:29 ` David Hildenbrand
2025-06-23 9:53 ` Alexander Potapenko
2025-06-23 10:10 ` David Hildenbrand
2025-06-23 12:22 ` David Hildenbrand
2025-06-23 12:47 ` David Hildenbrand
2025-06-23 14:58 ` Jens Axboe
2025-06-23 15:11 ` David Hildenbrand
2025-06-23 16:48 ` Pavel Begunkov
2025-06-23 16:59 ` David Hildenbrand
2025-06-23 17:36 ` David Hildenbrand
2025-06-23 18:02 ` Pavel Begunkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).