* Weird blockdev crash in 6.15-rc1?
@ 2025-04-08 17:51 Darrick J. Wong
2025-04-09 17:30 ` Darrick J. Wong
0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2025-04-08 17:51 UTC (permalink / raw)
To: Luis Chamberlain; +Cc: linux-block, linux-fsdevel, xfs
Hi everyone,
I saw the following crash in 6.15-rc1 when running xfs/032 from fstests
for-next. I don't see it in 6.14. I'll try to bisect, but in the
meantime does this look familiar to anyone? The XFS configuration is
pretty boring:
MKFS_OPTIONS="-m autofsck=1, -n size=8192"
MOUNT_OPTIONS="-o uquota,gquota,pquota"
(4k fsblocks, x64 host, directory blocks are 8k)
From the stack trace, it looks like the null pointer dereference is in
this call to bdev_nr_sectors:
void guard_bio_eod(struct bio *bio)
{
sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);
because bio->bi_bdev is NULL for some reason. The crash itself seems to
be from do_mpage_readpage around line 304:
alloc_new:
if (args->bio == NULL) {
args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf,
gfp);
bdev is NULL here ^^^^
if (args->bio == NULL)
goto confused;
args->bio->bi_iter.bi_sector = first_block << (blkbits - 9);
}
length = first_hole << blkbits;
if (!bio_add_folio(args->bio, folio, length, 0)) {
args->bio = mpage_bio_submit_read(args->bio);
goto alloc_new;
}
relative_block = block_in_file - args->first_logical_block;
nblocks = map_bh->b_size >> blkbits;
if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
(first_hole != blocks_per_folio))
args->bio = mpage_bio_submit_read(args->bio);
My guess is that there was no previous call to ->get_block and that
blocks_per_folio == 0, so nobody ever actually set the local @bdev
variable to a non-NULL value. blocks_per_folio is perhaps zero because
xfs/032 tried formatting with a sector size of 64k, which causes the
bdev inode->i_blkbits to be set to 16, but for some reason we got a
folio that wasn't 64k in size:
const unsigned blkbits = inode->i_blkbits;
const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
<shrug> That's just my conjecture for now.
--D
[87005.669555] run fstests xfs/032 at 2025-04-07 17:24:41
[87006.359661] XFS (sda3): EXPERIMENTAL exchange range feature enabled. Use at your own risk!
[87006.362419] XFS (sda3): EXPERIMENTAL parent pointer feature enabled. Use at your own risk!
[87006.366059] XFS (sda3): Mounting V5 Filesystem ec1e349e-c0e7-4cb2-a8ac-b41da447e314
[87006.417753] XFS (sda3): Ending clean mount
<repeats a bunch of times>
[87272.286501] XFS (sda4): EXPERIMENTAL large block size feature enabled. Use at your own risk!
[87272.289810] XFS (sda4): EXPERIMENTAL exchange range feature enabled. Use at your own risk!
[87272.292854] XFS (sda4): EXPERIMENTAL parent pointer feature enabled. Use at your own risk!
[87272.296468] XFS (sda4): Mounting V5 Filesystem ab5d65e3-52b5-47dc-8ace-15d0abdddbb8
[87272.339664] XFS (sda4): Ending clean mount
[87272.345326] XFS (sda4): Quotacheck needed: Please wait.
[87272.354286] XFS (sda4): Quotacheck: Done.
[87272.478858] XFS (sda4): Unmounting Filesystem ab5d65e3-52b5-47dc-8ace-15d0abdddbb8
[87281.127350] XFS (sda4): EXPERIMENTAL large block size feature enabled. Use at your own risk!
[87281.132043] XFS (sda4): Mounting V5 Filesystem 30e523c4-47a4-44ac-9cd2-2287dc04737e
[87281.185758] XFS (sda4): Ending clean mount
[87281.190101] XFS (sda4): Quotacheck needed: Please wait.
[87281.198888] XFS (sda4): Quotacheck: Done.
[87281.293127] XFS (sda4): Unmounting Filesystem 30e523c4-47a4-44ac-9cd2-2287dc04737e
[87290.299787] BUG: kernel NULL pointer dereference, address: 0000000000000008
[87290.302137] #PF: supervisor read access in kernel mode
[87290.303833] #PF: error_code(0x0000) - not-present page
[87290.305547] PGD 0 P4D 0
[87290.306362] Oops: Oops: 0000 [#1] SMP
[87290.307687] CPU: 0 UID: 0 PID: 932780 Comm: (udev-worker) Tainted: G W 6.15.0-rc1-djwx #rc1 PREEMPT(lazy) 19ee1dc3e4e157eae36f07f1b9cd9c98a1775e33
[87290.312198] Tainted: [W]=WARN
[87290.313093] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
[87290.316499] RIP: 0010:guard_bio_eod+0x17/0x210
[87290.317911] Code: f0 ff 46 1c e8 da 5b 00 00 48 89 d8 5b c3 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 30 48 8b 47 08 <48> 8b 68 08 48 85 ed 74 1e 48 8b 47 20 48 89 fb 48 39 e8 73 12 44
[87290.323459] RSP: 0018:ffffc9000274f8f8 EFLAGS: 00010282
[87290.325253] RAX: 0000000000000000 RBX: ffff888105f06e00 RCX: 0000000000000000
[87290.327451] RDX: 0000000000000000 RSI: ffffea0004096840 RDI: ffff888105f06e00
[87290.329720] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[87290.332110] R10: ffff8881007df110 R11: ffffc9000274fa18 R12: ffffc9000274f9f8
[87290.334433] R13: 000000000000000d R14: 0000000000000000 R15: ffffea0004096840
[87290.336591] FS: 00007f84f15528c0(0000) GS:ffff8884aa858000(0000) knlGS:0000000000000000
[87290.338904] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87290.340452] CR2: 0000000000000008 CR3: 00000001052f7000 CR4: 00000000003506f0
[87290.342705] Call Trace:
[87290.343474] <TASK>
[87290.344197] ? bio_alloc_bioset+0xcd/0x520
[87290.345511] ? bio_add_page+0x62/0xb0
[87290.346582] do_mpage_readpage+0x3da/0x730
[87290.347948] mpage_readahead+0x95/0x110
[87290.349230] ? blkdev_iomap_begin+0x70/0x70
[87290.350578] read_pages+0x84/0x220
[87290.351636] ? filemap_add_folio+0xaf/0xd0
[87290.353004] page_cache_ra_unbounded+0x1a7/0x240
[87290.354602] force_page_cache_ra+0x92/0xb0
[87290.355922] filemap_get_pages+0x13b/0x760
[87290.357347] ? current_time+0x3b/0x110
[87290.358674] filemap_read+0x114/0x480
[87290.359919] blkdev_read_iter+0x64/0x120
[87290.361268] vfs_read+0x290/0x390
[87290.362422] ksys_read+0x6f/0xe0
[87290.363422] do_syscall_64+0x47/0x100
[87290.364668] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[87290.366097] RIP: 0033:0x7f84f1c5a25d
[87290.367031] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[87290.373149] RSP: 002b:00007ffc88a090e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[87290.375431] RAX: ffffffffffffffda RBX: 000055b4264c76b0 RCX: 00007f84f1c5a25d
[87290.377757] RDX: 0000000000000400 RSI: 000055b4264e84a8 RDI: 0000000000000010
[87290.379746] RBP: 0000000000000c00 R08: 00007f84f1d35380 R09: 00007f84f1d35380
[87290.381570] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b4264e8480
[87290.383535] R13: 0000000000000400 R14: 000055b4264c7708 R15: 000055b4264e8498
[87290.385827] </TASK>
[87290.386578] Modules linked in: dm_delay ext4 mbcache jbd2 btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_log_writes dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio dm_zero dm_flakey xfs rpcsec_gss_krb5 auth_rpcgss nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnetlink bfq sha512_ssse3 sha512_generic pvpanic_mmio pvpanic sha256_ssse3 sch_fq_codel fuse configfs ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
[87290.404596] Dumping ftrace buffer:
[87290.405554] (ftrace buffer empty)
[87290.406677] CR2: 0000000000000008
[87290.407769] ---[ end trace 0000000000000000 ]---
[87290.409182] RIP: 0010:guard_bio_eod+0x17/0x210
[87290.410696] Code: f0 ff 46 1c e8 da 5b 00 00 48 89 d8 5b c3 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 30 48 8b 47 08 <48> 8b 68 08 48 85 ed 74 1e 48 8b 47 20 48 89 fb 48 39 e8 73 12 44
[87290.416951] RSP: 0018:ffffc9000274f8f8 EFLAGS: 00010282
[87290.418659] RAX: 0000000000000000 RBX: ffff888105f06e00 RCX: 0000000000000000
[87290.420948] RDX: 0000000000000000 RSI: ffffea0004096840 RDI: ffff888105f06e00
[87290.422926] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[87290.425178] R10: ffff8881007df110 R11: ffffc9000274fa18 R12: ffffc9000274f9f8
[87290.427631] R13: 000000000000000d R14: 0000000000000000 R15: ffffea0004096840
[87290.430009] FS: 00007f84f15528c0(0000) GS:ffff8884aa858000(0000) knlGS:0000000000000000
[87290.432636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87290.434574] CR2: 0000000000000008 CR3: 00000001052f7000 CR4: 00000000003506f0
[87290.436932] note: (udev-worker)[932780] exited with irqs disabled
[87290.439147] ------------[ cut here ]------------
[87290.440772] WARNING: CPU: 0 PID: 932780 at kernel/exit.c:900 do_exit+0x95a/0xbb0
[87290.443010] Modules linked in: dm_delay ext4 mbcache jbd2 btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_log_writes dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio dm_zero dm_flakey xfs rpcsec_gss_krb5 auth_rpcgss nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnetlink bfq sha512_ssse3 sha512_generic pvpanic_mmio pvpanic sha256_ssse3 sch_fq_codel fuse configfs ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
[87290.459803] CPU: 0 UID: 0 PID: 932780 Comm: (udev-worker) Tainted: G D W 6.15.0-rc1-djwx #rc1 PREEMPT(lazy) 19ee1dc3e4e157eae36f07f1b9cd9c98a1775e33
[87290.464613] Tainted: [D]=DIE, [W]=WARN
[87290.466017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
[87290.469408] RIP: 0010:do_exit+0x95a/0xbb0
[87290.470885] Code: 83 b8 0b 00 00 65 01 05 40 a0 4f 01 e9 14 ff ff ff 4c 89 e6 bf 05 06 00 00 e8 b2 0f 01 00 e9 ca f7 ff ff 0f 0b e9 de f6 ff ff <0f> 0b e9 16 f7 ff ff 4c 89 e6 48 89 df e8 04 94 00 00 e9 f7 f9 ff
[87290.476385] RSP: 0018:ffffc9000274fed8 EFLAGS: 00010282
[87290.478117] RAX: 0000000080000000 RBX: ffff8881afe0c180 RCX: 0000000000000000
[87290.480231] RDX: 0000000000000001 RSI: 0000000000002710 RDI: 00000000ffffffff
[87290.482972] RBP: ffff88812a74df00 R08: 0000000000000000 R09: 205d323339363334
[87290.485443] R10: 6b726f772d766564 R11: 7528203a65746f6e R12: 0000000000000009
[87290.487900] R13: ffff88811a661100 R14: ffff8881afe0c180 R15: 0000000000000000
[87290.489893] FS: 00007f84f15528c0(0000) GS:ffff8884aa858000(0000) knlGS:0000000000000000
[87290.492491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87290.494307] CR2: 0000000000000008 CR3: 00000001052f7000 CR4: 00000000003506f0
[87290.496540] Call Trace:
[87290.497265] <TASK>
[87290.497958] make_task_dead+0x79/0x160
[87290.499214] rewind_stack_and_make_dead+0x16/0x20
[87290.500781] RIP: 0033:0x7f84f1c5a25d
[87290.501947] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[87290.507872] RSP: 002b:00007ffc88a090e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[87290.510463] RAX: ffffffffffffffda RBX: 000055b4264c76b0 RCX: 00007f84f1c5a25d
[87290.512701] RDX: 0000000000000400 RSI: 000055b4264e84a8 RDI: 0000000000000010
[87290.514952] RBP: 0000000000000c00 R08: 00007f84f1d35380 R09: 00007f84f1d35380
[87290.517277] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b4264e8480
[87290.519406] R13: 0000000000000400 R14: 000055b4264c7708 R15: 000055b4264e8498
[87290.521500] </TASK>
[87290.522388] ---[ end trace 0000000000000000 ]---
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Weird blockdev crash in 6.15-rc1? 2025-04-08 17:51 Weird blockdev crash in 6.15-rc1? Darrick J. Wong @ 2025-04-09 17:30 ` Darrick J. Wong 2025-04-09 19:09 ` Darrick J. Wong 0 siblings, 1 reply; 6+ messages in thread From: Darrick J. Wong @ 2025-04-09 17:30 UTC (permalink / raw) To: Luis Chamberlain, Matthew Wilcox; +Cc: linux-block, linux-fsdevel, xfs On Tue, Apr 08, 2025 at 10:51:25AM -0700, Darrick J. Wong wrote: > Hi everyone, > > I saw the following crash in 6.15-rc1 when running xfs/032 from fstests > for-next. I don't see it in 6.14. I'll try to bisect, but in the > meantime does this look familiar to anyone? The XFS configuration is > pretty boring: > > MKFS_OPTIONS="-m autofsck=1, -n size=8192" > MOUNT_OPTIONS="-o uquota,gquota,pquota" > > (4k fsblocks, x64 host, directory blocks are 8k) > > From the stack trace, it looks like the null pointer dereference is in > this call to bdev_nr_sectors: > > void guard_bio_eod(struct bio *bio) > { > sector_t maxsector = bdev_nr_sectors(bio->bi_bdev); > > because bio->bi_bdev is NULL for some reason. The crash itself seems to > be from do_mpage_readpage around line 304: > > alloc_new: > if (args->bio == NULL) { > args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf, > gfp); > > bdev is NULL here ^^^^ > > if (args->bio == NULL) > goto confused; > args->bio->bi_iter.bi_sector = first_block << (blkbits - 9); > } > > length = first_hole << blkbits; > if (!bio_add_folio(args->bio, folio, length, 0)) { > args->bio = mpage_bio_submit_read(args->bio); > goto alloc_new; > } > > relative_block = block_in_file - args->first_logical_block; > nblocks = map_bh->b_size >> blkbits; > if ((buffer_boundary(map_bh) && relative_block == nblocks) || > (first_hole != blocks_per_folio)) > args->bio = mpage_bio_submit_read(args->bio); > > My guess is that there was no previous call to ->get_block and that > blocks_per_folio == 0, so nobody ever actually set the local @bdev > variable to a non-NULL value. blocks_per_folio is perhaps zero because > xfs/032 tried formatting with a sector size of 64k, which causes the > bdev inode->i_blkbits to be set to 16, but for some reason we got a > folio that wasn't 64k in size: > > const unsigned blkbits = inode->i_blkbits; > const unsigned blocks_per_folio = folio_size(folio) >> blkbits; > > <shrug> That's just my conjecture for now. Ok so overnight my debugging patch confirmed this hypothesis: XFS (sda4): Mounting V5 Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0 XFS (sda4): Ending clean mount XFS (sda4): Quotacheck needed: Please wait. XFS (sda4): Quotacheck: Done. XFS (sda4): Unmounting Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0 FARK bio_alloc with NULL bdev?! blkbits 13 fsize 4096 blocks_per_folio 0 willy told me to set CONFIG_DEBUG_VM=y and rerun xfs/032. That didn't turn anything up, so I decided to race it with: while sleep 0.1; do blkid -c /dev/null; done to simulate udev calling libblkid. That produced a debugging assertion with 40 seconds: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x4f3bc4 pfn:0x43da4 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 memcg:ffff8880446b4800 flags: 0x4fff80000000041(locked|head|node=1|zone=1|lastcpupid=0xfff) raw: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000 raw: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800 head: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000 head: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800 head: 04fff80000000201 ffffea00010f6901 00000000ffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002 page dumped because: VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1)) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:871! Oops: invalid opcode: 0000 [#1] SMP CPU: 3 UID: 0 PID: 26689 Comm: (udev-worker) Not tainted 6.15.0-rc1-djwx #rc1 PREEMPT(lazy) 8c302df0300eabbbd3cdc47fd812690b8d635c39 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:__filemap_add_folio+0x4ae/0x540 Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9 RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246 RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737 R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8 R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530 FS: 00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0 Call Trace: <TASK> ? memcg_list_lru_alloc+0x2d0/0x2d0 filemap_add_folio+0x7f/0xd0 page_cache_ra_unbounded+0x147/0x260 force_page_cache_ra+0x92/0xb0 filemap_get_pages+0x13b/0x7b0 ? current_time+0x3b/0x110 filemap_read+0x106/0x4c0 ? _raw_spin_unlock+0x14/0x30 blkdev_read_iter+0x64/0x120 vfs_read+0x290/0x390 ksys_read+0x6f/0xe0 do_syscall_64+0x47/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f886428025d Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f c RSP: 002b:00007fff5ce76228 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055cf45839640 RCX: 00007f886428025d RDX: 0000000000040000 RSI: 000055cf45996908 RDI: 000000000000000f RBP: 00000004f3b80000 R08: 00007f886435add0 R09: 00007f886435add0 R10: 0000000000000000 R11: 0000000000000246 R12: 000055cf459968e0 R13: 0000000000040000 R14: 000055cf45839698 R15: 000055cf459968f8 </TASK> Modules linked in: xfs ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnet] Dumping ftrace buffer: (ftrace buffer empty) ---[ end trace 0000000000000000 ]--- RIP: 0010:__filemap_add_folio+0x4ae/0x540 Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9 RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246 RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737 R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8 R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530 FS: 00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0 Digging into the VM, I noticed that mount is stuck in D state: /proc/44312/task/44312/stack : [<0>] folio_wait_bit_common+0x144/0x350 [<0>] truncate_inode_pages_range+0x4df/0x5b0 [<0>] set_blocksize+0x10b/0x130 [<0>] xfs_setsize_buftarg+0x1f/0x50 [xfs] [<0>] xfs_setup_devices+0x1a/0xc0 [xfs] [<0>] xfs_fs_fill_super+0x423/0xb20 [xfs] [<0>] get_tree_bdev_flags+0x132/0x1d0 [<0>] vfs_get_tree+0x17/0xa0 [<0>] path_mount+0x721/0xa90 [<0>] __x64_sys_mount+0x10c/0x140 [<0>] do_syscall_64+0x47/0x100 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Regrettably the udev worker is gone, but my guess is that the process exited with the folio locked, so now truncate_inode_pages_range can't lock it to get rid of it. Then it occurred to me to look at set_blocksize again: /* Don't change the size if it is same as current */ if (inode->i_blkbits != blksize_bits(size)) { sync_blockdev(bdev); inode->i_blkbits = blksize_bits(size); mapping_set_folio_order_range(inode->i_mapping, get_order(size), get_order(size)); kill_bdev(bdev); } (Note that I changed mapping_set_folio_min_order here to mapping_set_folio_order_range to shut up a folio migration bug that I reported elsewhere on fsdevel yesterday, and willy suggested forcing the max order as a temporary workaround.) The update of i_blkbits and the order bits of mapping->flags are performed before kill_bdev truncates the pagecache, which means there's a window where there can be a !uptodate order-0 folio in the pagecache but i_blkbits > PAGE_SHIFT (in this case, 13). The debugging assertion above is from someone trying to install a too-small folio into the pagecache. I think the "FARK" message I captured overnight is from readahead trying to bring in contents from disk for this too-small folio and failing. So I think the answer is that set_blocksize needs to lock out folio_add, flush the dirty folios, invalidate the entire bdev pagecache, set i_blkbits and the folio order, and only then allow new additions to the pagecache. But then, which lock(s)? Were this a file on XFS I'd say that one has to take i_rwsem and mmap_invalidate_lock before truncating the pagecache but by my recollection bdev devices don't take either lock in their IO paths. --D ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Weird blockdev crash in 6.15-rc1? 2025-04-09 17:30 ` Darrick J. Wong @ 2025-04-09 19:09 ` Darrick J. Wong 2025-04-10 7:40 ` Christoph Hellwig 0 siblings, 1 reply; 6+ messages in thread From: Darrick J. Wong @ 2025-04-09 19:09 UTC (permalink / raw) To: Luis Chamberlain, Matthew Wilcox; +Cc: linux-block, linux-fsdevel, xfs On Wed, Apr 09, 2025 at 10:30:15AM -0700, Darrick J. Wong wrote: > Then it occurred to me to look at set_blocksize again: > > /* Don't change the size if it is same as current */ > if (inode->i_blkbits != blksize_bits(size)) { > sync_blockdev(bdev); > inode->i_blkbits = blksize_bits(size); > mapping_set_folio_order_range(inode->i_mapping, > get_order(size), get_order(size)); > kill_bdev(bdev); > } > > (Note that I changed mapping_set_folio_min_order here to > mapping_set_folio_order_range to shut up a folio migration bug that I > reported elsewhere on fsdevel yesterday, and willy suggested forcing the > max order as a temporary workaround.) > > The update of i_blkbits and the order bits of mapping->flags are > performed before kill_bdev truncates the pagecache, which means there's > a window where there can be a !uptodate order-0 folio in the pagecache > but i_blkbits > PAGE_SHIFT (in this case, 13). The debugging assertion > above is from someone trying to install a too-small folio into the > pagecache. I think the "FARK" message I captured overnight is from > readahead trying to bring in contents from disk for this too-small folio > and failing. > > So I think the answer is that set_blocksize needs to lock out folio_add, > flush the dirty folios, invalidate the entire bdev pagecache, set > i_blkbits and the folio order, and only then allow new additions to the > pagecache. > > But then, which lock(s)? Were this a file on XFS I'd say that one has > to take i_rwsem and mmap_invalidate_lock before truncating the pagecache > but by my recollection bdev devices don't take either lock in their IO > paths. Here's my shabby attempt to lock my way out of this mess. My reproducer no longer trips, but I don't think that means much. --D From: Darrick J. Wong <djwong@kernel.org> Subject: [PATCH] block: fix race between set_blocksize and IO paths With the new large sector size support, it's now the case that set_blocksize needs to change i_blksize and the folio order with no folios in the pagecache because the geometry changes cause problems with the bufferhead code. Therefore, truncate the page cache after flushing but before updating i_blksize. However, that's not enough -- we also need to lock out file IO and page faults during the update. Take both the i_rwsem and the invalidate_lock in exclusive mode for invalidations, and in shared mode for read/write operations. I don't know if this is the correct fix. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- block/bdev.c | 12 ++++++++++++ block/blk-zoned.c | 5 ++++- block/fops.c | 7 +++++++ block/ioctl.c | 6 ++++++ 4 files changed, 29 insertions(+), 1 deletion(-) diff --git a/block/bdev.c b/block/bdev.c index 7b4e35a661b0c9..0cbdac46d98d86 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -169,11 +169,23 @@ int set_blocksize(struct file *file, int size) /* Don't change the size if it is same as current */ if (inode->i_blkbits != blksize_bits(size)) { + /* Prevent concurrent IO operations */ + inode_lock(inode); + filemap_invalidate_lock(inode->i_mapping); + + /* + * Flush and truncate the pagecache before we reconfigure the + * mapping geometry because folio sizes are variable now. + */ sync_blockdev(bdev); + kill_bdev(bdev); + inode->i_blkbits = blksize_bits(size); mapping_set_folio_order_range(inode->i_mapping, get_order(size), get_order(size)); kill_bdev(bdev); + filemap_invalidate_unlock(inode->i_mapping); + inode_unlock(inode); } return 0; } diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 0c77244a35c92e..8f15d1aa6eb89a 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -343,6 +343,7 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, op = REQ_OP_ZONE_RESET; /* Invalidate the page cache, including dirty pages. */ + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); ret = blkdev_truncate_zone_range(bdev, mode, &zrange); if (ret) @@ -364,8 +365,10 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, ret = blkdev_zone_mgmt(bdev, op, zrange.sector, zrange.nr_sectors); fail: - if (cmd == BLKRESETZONE) + if (cmd == BLKRESETZONE) { filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); + } return ret; } diff --git a/block/fops.c b/block/fops.c index be9f1dbea9ce0a..f46ae08fac33dd 100644 --- a/block/fops.c +++ b/block/fops.c @@ -746,7 +746,9 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) ret = direct_write_fallback(iocb, from, ret, blkdev_buffered_write(iocb, from)); } else { + inode_lock_shared(bd_inode); ret = blkdev_buffered_write(iocb, from); + inode_unlock_shared(bd_inode); } if (ret > 0) @@ -757,6 +759,7 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) { + struct inode *bd_inode = bdev_file_inode(iocb->ki_filp); struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); loff_t size = bdev_nr_bytes(bdev); loff_t pos = iocb->ki_pos; @@ -793,7 +796,9 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) goto reexpand; } + inode_lock_shared(bd_inode); ret = filemap_read(iocb, to, ret); + inode_unlock_shared(bd_inode); reexpand: if (unlikely(shorted)) @@ -836,6 +841,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, if ((start | len) & (bdev_logical_block_size(bdev) - 1)) return -EINVAL; + inode_lock(inode); filemap_invalidate_lock(inode->i_mapping); /* @@ -868,6 +874,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, fail: filemap_invalidate_unlock(inode->i_mapping); + inode_unlock(inode); return error; } diff --git a/block/ioctl.c b/block/ioctl.c index faa40f383e2736..e472cc1030c60c 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -142,6 +142,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode, if (err) return err; + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, start + len - 1); if (err) @@ -174,6 +175,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode, blk_finish_plug(&plug); fail: filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; } @@ -199,12 +201,14 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode, end > bdev_nr_bytes(bdev)) return -EINVAL; + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, end - 1); if (!err) err = blkdev_issue_secure_erase(bdev, start >> 9, len >> 9, GFP_KERNEL); filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; } @@ -236,6 +240,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode, return -EINVAL; /* Invalidate the page cache, including dirty pages */ + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, end); if (err) @@ -246,6 +251,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode, fail: filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: Weird blockdev crash in 6.15-rc1? 2025-04-09 19:09 ` Darrick J. Wong @ 2025-04-10 7:40 ` Christoph Hellwig 2025-04-10 15:25 ` Darrick J. Wong 0 siblings, 1 reply; 6+ messages in thread From: Christoph Hellwig @ 2025-04-10 7:40 UTC (permalink / raw) To: Darrick J. Wong Cc: Luis Chamberlain, Matthew Wilcox, linux-block, linux-fsdevel, xfs On Wed, Apr 09, 2025 at 12:09:07PM -0700, Darrick J. Wong wrote: > Subject: [PATCH] block: fix race between set_blocksize and IO paths > > With the new large sector size support, it's now the case that > set_blocksize needs to change i_blksize and the folio order with no > folios in the pagecache because the geometry changes cause problems with > the bufferhead code. Urrg. I wish we could just get out of the game of messing with block device inode settings from file systems. I guess doing it when using buffer_heads is hard, but file systems without buffer heads should have a way out of even propagating their block size to the block device inode. And file systems with buffer heads should probably not support large folios like this :P ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Weird blockdev crash in 6.15-rc1? 2025-04-10 7:40 ` Christoph Hellwig @ 2025-04-10 15:25 ` Darrick J. Wong 2025-04-11 20:39 ` Luis Chamberlain 0 siblings, 1 reply; 6+ messages in thread From: Darrick J. Wong @ 2025-04-10 15:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Luis Chamberlain, Matthew Wilcox, linux-block, linux-fsdevel, xfs On Thu, Apr 10, 2025 at 12:40:15AM -0700, Christoph Hellwig wrote: > On Wed, Apr 09, 2025 at 12:09:07PM -0700, Darrick J. Wong wrote: > > Subject: [PATCH] block: fix race between set_blocksize and IO paths > > > > With the new large sector size support, it's now the case that > > set_blocksize needs to change i_blksize and the folio order with no > > folios in the pagecache because the geometry changes cause problems with > > the bufferhead code. > > Urrg. I wish we could just get out of the game of messing with > block device inode settings from file systems. I guess doing it when > using buffer_heads is hard, but file systems without buffer heads > should have a way out of even propagating their block size to the > block device inode. And file systems with buffer heads should probably > not support large folios like this :P Heh. Why does xfs still call set_blocksize, anyway? I can understand why we want to validate the fs sector size is a power of 2, greater than 512, and not smaller than the LBA size; and flushing the dirty bdev pagecache. But do we really need to fiddle with i_blksize or dumping the pagecache? --D ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Weird blockdev crash in 6.15-rc1? 2025-04-10 15:25 ` Darrick J. Wong @ 2025-04-11 20:39 ` Luis Chamberlain 0 siblings, 0 replies; 6+ messages in thread From: Luis Chamberlain @ 2025-04-11 20:39 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, Matthew Wilcox, linux-block, linux-fsdevel, xfs On Thu, Apr 10, 2025 at 08:25:54AM -0700, Darrick J. Wong wrote: > Heh. Why does xfs still call set_blocksize, anyway? Hrm. That's called from xfs_setsize_buftarg() so the buffer target, and it uses a different min order. We have logdev and rtdev. I just tested and indeed, a 32k logical sector size drive can be used for data with for example a 512k logical sector size logdev. Luis ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-04-11 20:39 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-08 17:51 Weird blockdev crash in 6.15-rc1? Darrick J. Wong 2025-04-09 17:30 ` Darrick J. Wong 2025-04-09 19:09 ` Darrick J. Wong 2025-04-10 7:40 ` Christoph Hellwig 2025-04-10 15:25 ` Darrick J. Wong 2025-04-11 20:39 ` Luis Chamberlain
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox