public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Weird blockdev crash in 6.15-rc1?
@ 2025-04-08 17:51 Darrick J. Wong
  2025-04-09 17:30 ` Darrick J. Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2025-04-08 17:51 UTC (permalink / raw)
  To: Luis Chamberlain; +Cc: linux-block, linux-fsdevel, xfs

Hi everyone,

I saw the following crash in 6.15-rc1 when running xfs/032 from fstests
for-next.  I don't see it in 6.14.  I'll try to bisect, but in the
meantime does this look familiar to anyone?  The XFS configuration is
pretty boring:

MKFS_OPTIONS="-m autofsck=1, -n size=8192"
MOUNT_OPTIONS="-o uquota,gquota,pquota"

(4k fsblocks, x64 host, directory blocks are 8k)

From the stack trace, it looks like the null pointer dereference is in
this call to bdev_nr_sectors:

void guard_bio_eod(struct bio *bio)
{
	sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);

because bio->bi_bdev is NULL for some reason.  The crash itself seems to
be from do_mpage_readpage around line 304:

alloc_new:
	if (args->bio == NULL) {
		args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf,
				      gfp);

bdev is NULL here                     ^^^^

		if (args->bio == NULL)
			goto confused;
		args->bio->bi_iter.bi_sector = first_block << (blkbits - 9);
	}

	length = first_hole << blkbits;
	if (!bio_add_folio(args->bio, folio, length, 0)) {
		args->bio = mpage_bio_submit_read(args->bio);
		goto alloc_new;
	}

	relative_block = block_in_file - args->first_logical_block;
	nblocks = map_bh->b_size >> blkbits;
	if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
	    (first_hole != blocks_per_folio))
		args->bio = mpage_bio_submit_read(args->bio);

My guess is that there was no previous call to ->get_block and that
blocks_per_folio == 0, so nobody ever actually set the local @bdev
variable to a non-NULL value.  blocks_per_folio is perhaps zero because
xfs/032 tried formatting with a sector size of 64k, which causes the
bdev inode->i_blkbits to be set to 16, but for some reason we got a
folio that wasn't 64k in size:

	const unsigned blkbits = inode->i_blkbits;
	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;

<shrug> That's just my conjecture for now.

--D

[87005.669555] run fstests xfs/032 at 2025-04-07 17:24:41
[87006.359661] XFS (sda3): EXPERIMENTAL exchange range feature enabled.  Use at your own risk!
[87006.362419] XFS (sda3): EXPERIMENTAL parent pointer feature enabled.  Use at your own risk!
[87006.366059] XFS (sda3): Mounting V5 Filesystem ec1e349e-c0e7-4cb2-a8ac-b41da447e314
[87006.417753] XFS (sda3): Ending clean mount

<repeats a bunch of times>

[87272.286501] XFS (sda4): EXPERIMENTAL large block size feature enabled.  Use at your own risk!
[87272.289810] XFS (sda4): EXPERIMENTAL exchange range feature enabled.  Use at your own risk!
[87272.292854] XFS (sda4): EXPERIMENTAL parent pointer feature enabled.  Use at your own risk!
[87272.296468] XFS (sda4): Mounting V5 Filesystem ab5d65e3-52b5-47dc-8ace-15d0abdddbb8
[87272.339664] XFS (sda4): Ending clean mount
[87272.345326] XFS (sda4): Quotacheck needed: Please wait.
[87272.354286] XFS (sda4): Quotacheck: Done.
[87272.478858] XFS (sda4): Unmounting Filesystem ab5d65e3-52b5-47dc-8ace-15d0abdddbb8
[87281.127350] XFS (sda4): EXPERIMENTAL large block size feature enabled.  Use at your own risk!
[87281.132043] XFS (sda4): Mounting V5 Filesystem 30e523c4-47a4-44ac-9cd2-2287dc04737e
[87281.185758] XFS (sda4): Ending clean mount
[87281.190101] XFS (sda4): Quotacheck needed: Please wait.
[87281.198888] XFS (sda4): Quotacheck: Done.
[87281.293127] XFS (sda4): Unmounting Filesystem 30e523c4-47a4-44ac-9cd2-2287dc04737e
[87290.299787] BUG: kernel NULL pointer dereference, address: 0000000000000008
[87290.302137] #PF: supervisor read access in kernel mode
[87290.303833] #PF: error_code(0x0000) - not-present page
[87290.305547] PGD 0 P4D 0 
[87290.306362] Oops: Oops: 0000 [#1] SMP
[87290.307687] CPU: 0 UID: 0 PID: 932780 Comm: (udev-worker) Tainted: G        W           6.15.0-rc1-djwx #rc1 PREEMPT(lazy)  19ee1dc3e4e157eae36f07f1b9cd9c98a1775e33
[87290.312198] Tainted: [W]=WARN
[87290.313093] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
[87290.316499] RIP: 0010:guard_bio_eod+0x17/0x210
[87290.317911] Code: f0 ff 46 1c e8 da 5b 00 00 48 89 d8 5b c3 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 30 48 8b 47 08 <48> 8b 68 08 48 85 ed 74 1e 48 8b 47 20 48 89 fb 48 39 e8 73 12 44
[87290.323459] RSP: 0018:ffffc9000274f8f8 EFLAGS: 00010282
[87290.325253] RAX: 0000000000000000 RBX: ffff888105f06e00 RCX: 0000000000000000
[87290.327451] RDX: 0000000000000000 RSI: ffffea0004096840 RDI: ffff888105f06e00
[87290.329720] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[87290.332110] R10: ffff8881007df110 R11: ffffc9000274fa18 R12: ffffc9000274f9f8
[87290.334433] R13: 000000000000000d R14: 0000000000000000 R15: ffffea0004096840
[87290.336591] FS:  00007f84f15528c0(0000) GS:ffff8884aa858000(0000) knlGS:0000000000000000
[87290.338904] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87290.340452] CR2: 0000000000000008 CR3: 00000001052f7000 CR4: 00000000003506f0
[87290.342705] Call Trace:
[87290.343474]  <TASK>
[87290.344197]  ? bio_alloc_bioset+0xcd/0x520
[87290.345511]  ? bio_add_page+0x62/0xb0
[87290.346582]  do_mpage_readpage+0x3da/0x730
[87290.347948]  mpage_readahead+0x95/0x110
[87290.349230]  ? blkdev_iomap_begin+0x70/0x70
[87290.350578]  read_pages+0x84/0x220
[87290.351636]  ? filemap_add_folio+0xaf/0xd0
[87290.353004]  page_cache_ra_unbounded+0x1a7/0x240
[87290.354602]  force_page_cache_ra+0x92/0xb0
[87290.355922]  filemap_get_pages+0x13b/0x760
[87290.357347]  ? current_time+0x3b/0x110
[87290.358674]  filemap_read+0x114/0x480
[87290.359919]  blkdev_read_iter+0x64/0x120
[87290.361268]  vfs_read+0x290/0x390
[87290.362422]  ksys_read+0x6f/0xe0
[87290.363422]  do_syscall_64+0x47/0x100
[87290.364668]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[87290.366097] RIP: 0033:0x7f84f1c5a25d
[87290.367031] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[87290.373149] RSP: 002b:00007ffc88a090e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[87290.375431] RAX: ffffffffffffffda RBX: 000055b4264c76b0 RCX: 00007f84f1c5a25d
[87290.377757] RDX: 0000000000000400 RSI: 000055b4264e84a8 RDI: 0000000000000010
[87290.379746] RBP: 0000000000000c00 R08: 00007f84f1d35380 R09: 00007f84f1d35380
[87290.381570] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b4264e8480
[87290.383535] R13: 0000000000000400 R14: 000055b4264c7708 R15: 000055b4264e8498
[87290.385827]  </TASK>
[87290.386578] Modules linked in: dm_delay ext4 mbcache jbd2 btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_log_writes dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio dm_zero dm_flakey xfs rpcsec_gss_krb5 auth_rpcgss nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnetlink bfq sha512_ssse3 sha512_generic pvpanic_mmio pvpanic sha256_ssse3 sch_fq_codel fuse configfs ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
[87290.404596] Dumping ftrace buffer:
[87290.405554]    (ftrace buffer empty)
[87290.406677] CR2: 0000000000000008
[87290.407769] ---[ end trace 0000000000000000 ]---
[87290.409182] RIP: 0010:guard_bio_eod+0x17/0x210
[87290.410696] Code: f0 ff 46 1c e8 da 5b 00 00 48 89 d8 5b c3 0f 0b 0f 1f 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 30 48 8b 47 08 <48> 8b 68 08 48 85 ed 74 1e 48 8b 47 20 48 89 fb 48 39 e8 73 12 44
[87290.416951] RSP: 0018:ffffc9000274f8f8 EFLAGS: 00010282
[87290.418659] RAX: 0000000000000000 RBX: ffff888105f06e00 RCX: 0000000000000000
[87290.420948] RDX: 0000000000000000 RSI: ffffea0004096840 RDI: ffff888105f06e00
[87290.422926] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[87290.425178] R10: ffff8881007df110 R11: ffffc9000274fa18 R12: ffffc9000274f9f8
[87290.427631] R13: 000000000000000d R14: 0000000000000000 R15: ffffea0004096840
[87290.430009] FS:  00007f84f15528c0(0000) GS:ffff8884aa858000(0000) knlGS:0000000000000000
[87290.432636] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87290.434574] CR2: 0000000000000008 CR3: 00000001052f7000 CR4: 00000000003506f0
[87290.436932] note: (udev-worker)[932780] exited with irqs disabled
[87290.439147] ------------[ cut here ]------------
[87290.440772] WARNING: CPU: 0 PID: 932780 at kernel/exit.c:900 do_exit+0x95a/0xbb0
[87290.443010] Modules linked in: dm_delay ext4 mbcache jbd2 btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress dm_log_writes dm_thin_pool dm_persistent_data dm_bio_prison dm_snapshot dm_bufio dm_zero dm_flakey xfs rpcsec_gss_krb5 auth_rpcgss nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnetlink bfq sha512_ssse3 sha512_generic pvpanic_mmio pvpanic sha256_ssse3 sch_fq_codel fuse configfs ip_tables x_tables overlay nfsv4 af_packet [last unloaded: scsi_debug]
[87290.459803] CPU: 0 UID: 0 PID: 932780 Comm: (udev-worker) Tainted: G      D W           6.15.0-rc1-djwx #rc1 PREEMPT(lazy)  19ee1dc3e4e157eae36f07f1b9cd9c98a1775e33
[87290.464613] Tainted: [D]=DIE, [W]=WARN
[87290.466017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
[87290.469408] RIP: 0010:do_exit+0x95a/0xbb0
[87290.470885] Code: 83 b8 0b 00 00 65 01 05 40 a0 4f 01 e9 14 ff ff ff 4c 89 e6 bf 05 06 00 00 e8 b2 0f 01 00 e9 ca f7 ff ff 0f 0b e9 de f6 ff ff <0f> 0b e9 16 f7 ff ff 4c 89 e6 48 89 df e8 04 94 00 00 e9 f7 f9 ff
[87290.476385] RSP: 0018:ffffc9000274fed8 EFLAGS: 00010282
[87290.478117] RAX: 0000000080000000 RBX: ffff8881afe0c180 RCX: 0000000000000000
[87290.480231] RDX: 0000000000000001 RSI: 0000000000002710 RDI: 00000000ffffffff
[87290.482972] RBP: ffff88812a74df00 R08: 0000000000000000 R09: 205d323339363334
[87290.485443] R10: 6b726f772d766564 R11: 7528203a65746f6e R12: 0000000000000009
[87290.487900] R13: ffff88811a661100 R14: ffff8881afe0c180 R15: 0000000000000000
[87290.489893] FS:  00007f84f15528c0(0000) GS:ffff8884aa858000(0000) knlGS:0000000000000000
[87290.492491] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87290.494307] CR2: 0000000000000008 CR3: 00000001052f7000 CR4: 00000000003506f0
[87290.496540] Call Trace:
[87290.497265]  <TASK>
[87290.497958]  make_task_dead+0x79/0x160
[87290.499214]  rewind_stack_and_make_dead+0x16/0x20
[87290.500781] RIP: 0033:0x7f84f1c5a25d
[87290.501947] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[87290.507872] RSP: 002b:00007ffc88a090e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[87290.510463] RAX: ffffffffffffffda RBX: 000055b4264c76b0 RCX: 00007f84f1c5a25d
[87290.512701] RDX: 0000000000000400 RSI: 000055b4264e84a8 RDI: 0000000000000010
[87290.514952] RBP: 0000000000000c00 R08: 00007f84f1d35380 R09: 00007f84f1d35380
[87290.517277] R10: 0000000000000000 R11: 0000000000000246 R12: 000055b4264e8480
[87290.519406] R13: 0000000000000400 R14: 000055b4264c7708 R15: 000055b4264e8498
[87290.521500]  </TASK>
[87290.522388] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird blockdev crash in 6.15-rc1?
  2025-04-08 17:51 Weird blockdev crash in 6.15-rc1? Darrick J. Wong
@ 2025-04-09 17:30 ` Darrick J. Wong
  2025-04-09 19:09   ` Darrick J. Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2025-04-09 17:30 UTC (permalink / raw)
  To: Luis Chamberlain, Matthew Wilcox; +Cc: linux-block, linux-fsdevel, xfs

On Tue, Apr 08, 2025 at 10:51:25AM -0700, Darrick J. Wong wrote:
> Hi everyone,
> 
> I saw the following crash in 6.15-rc1 when running xfs/032 from fstests
> for-next.  I don't see it in 6.14.  I'll try to bisect, but in the
> meantime does this look familiar to anyone?  The XFS configuration is
> pretty boring:
> 
> MKFS_OPTIONS="-m autofsck=1, -n size=8192"
> MOUNT_OPTIONS="-o uquota,gquota,pquota"
> 
> (4k fsblocks, x64 host, directory blocks are 8k)
> 
> From the stack trace, it looks like the null pointer dereference is in
> this call to bdev_nr_sectors:
> 
> void guard_bio_eod(struct bio *bio)
> {
> 	sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);
> 
> because bio->bi_bdev is NULL for some reason.  The crash itself seems to
> be from do_mpage_readpage around line 304:
> 
> alloc_new:
> 	if (args->bio == NULL) {
> 		args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf,
> 				      gfp);
> 
> bdev is NULL here                     ^^^^
> 
> 		if (args->bio == NULL)
> 			goto confused;
> 		args->bio->bi_iter.bi_sector = first_block << (blkbits - 9);
> 	}
> 
> 	length = first_hole << blkbits;
> 	if (!bio_add_folio(args->bio, folio, length, 0)) {
> 		args->bio = mpage_bio_submit_read(args->bio);
> 		goto alloc_new;
> 	}
> 
> 	relative_block = block_in_file - args->first_logical_block;
> 	nblocks = map_bh->b_size >> blkbits;
> 	if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
> 	    (first_hole != blocks_per_folio))
> 		args->bio = mpage_bio_submit_read(args->bio);
> 
> My guess is that there was no previous call to ->get_block and that
> blocks_per_folio == 0, so nobody ever actually set the local @bdev
> variable to a non-NULL value.  blocks_per_folio is perhaps zero because
> xfs/032 tried formatting with a sector size of 64k, which causes the
> bdev inode->i_blkbits to be set to 16, but for some reason we got a
> folio that wasn't 64k in size:
> 
> 	const unsigned blkbits = inode->i_blkbits;
> 	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
> 
> <shrug> That's just my conjecture for now.

Ok so overnight my debugging patch confirmed this hypothesis:

XFS (sda4): Mounting V5 Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0
XFS (sda4): Ending clean mount
XFS (sda4): Quotacheck needed: Please wait.
XFS (sda4): Quotacheck: Done.
XFS (sda4): Unmounting Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0
FARK bio_alloc with NULL bdev?! blkbits 13 fsize 4096 blocks_per_folio 0

willy told me to set CONFIG_DEBUG_VM=y and rerun xfs/032.  That
didn't turn anything up, so I decided to race it with:

	while sleep 0.1; do blkid -c /dev/null; done

to simulate udev calling libblkid.  That produced a debugging assertion
with 40 seconds:

page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x4f3bc4 pfn:0x43da4
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff8880446b4800
flags: 0x4fff80000000041(locked|head|node=1|zone=1|lastcpupid=0xfff)
raw: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000
raw: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800
head: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000
head: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800
head: 04fff80000000201 ffffea00010f6901 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
page dumped because: VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1))
------------[ cut here ]------------
kernel BUG at mm/filemap.c:871!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 3 UID: 0 PID: 26689 Comm: (udev-worker) Not tainted 6.15.0-rc1-djwx #rc1 PREEMPT(lazy)  8c302df0300eabbbd3cdc47fd812690b8d635c39
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__filemap_add_folio+0x4ae/0x540
Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9
RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246
RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737
R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8
R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530
FS:  00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0
Call Trace:
 <TASK>
 ? memcg_list_lru_alloc+0x2d0/0x2d0
 filemap_add_folio+0x7f/0xd0
 page_cache_ra_unbounded+0x147/0x260
 force_page_cache_ra+0x92/0xb0
 filemap_get_pages+0x13b/0x7b0
 ? current_time+0x3b/0x110
 filemap_read+0x106/0x4c0
 ? _raw_spin_unlock+0x14/0x30
 blkdev_read_iter+0x64/0x120
 vfs_read+0x290/0x390
 ksys_read+0x6f/0xe0
 do_syscall_64+0x47/0x100
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f886428025d
Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f c
RSP: 002b:00007fff5ce76228 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055cf45839640 RCX: 00007f886428025d
RDX: 0000000000040000 RSI: 000055cf45996908 RDI: 000000000000000f
RBP: 00000004f3b80000 R08: 00007f886435add0 R09: 00007f886435add0
R10: 0000000000000000 R11: 0000000000000246 R12: 000055cf459968e0
R13: 0000000000040000 R14: 000055cf45839698 R15: 000055cf459968f8
 </TASK>
Modules linked in: xfs ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnet]
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace 0000000000000000 ]---
RIP: 0010:__filemap_add_folio+0x4ae/0x540
Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9
RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246
RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737
R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8
R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530
FS:  00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0

Digging into the VM, I noticed that mount is stuck in D state:

/proc/44312/task/44312/stack :
[<0>] folio_wait_bit_common+0x144/0x350
[<0>] truncate_inode_pages_range+0x4df/0x5b0
[<0>] set_blocksize+0x10b/0x130
[<0>] xfs_setsize_buftarg+0x1f/0x50 [xfs]
[<0>] xfs_setup_devices+0x1a/0xc0 [xfs]
[<0>] xfs_fs_fill_super+0x423/0xb20 [xfs]
[<0>] get_tree_bdev_flags+0x132/0x1d0
[<0>] vfs_get_tree+0x17/0xa0
[<0>] path_mount+0x721/0xa90
[<0>] __x64_sys_mount+0x10c/0x140
[<0>] do_syscall_64+0x47/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

Regrettably the udev worker is gone, but my guess is that the process
exited with the folio locked, so now truncate_inode_pages_range can't
lock it to get rid of it.

Then it occurred to me to look at set_blocksize again:

	/* Don't change the size if it is same as current */
	if (inode->i_blkbits != blksize_bits(size)) {
		sync_blockdev(bdev);
		inode->i_blkbits = blksize_bits(size);
		mapping_set_folio_order_range(inode->i_mapping,
				get_order(size), get_order(size));
		kill_bdev(bdev);
	}

(Note that I changed mapping_set_folio_min_order here to
mapping_set_folio_order_range to shut up a folio migration bug that I
reported elsewhere on fsdevel yesterday, and willy suggested forcing the
max order as a temporary workaround.)

The update of i_blkbits and the order bits of mapping->flags are
performed before kill_bdev truncates the pagecache, which means there's
a window where there can be a !uptodate order-0 folio in the pagecache
but i_blkbits > PAGE_SHIFT (in this case, 13).  The debugging assertion
above is from someone trying to install a too-small folio into the
pagecache.  I think the "FARK" message I captured overnight is from
readahead trying to bring in contents from disk for this too-small folio
and failing.

So I think the answer is that set_blocksize needs to lock out folio_add,
flush the dirty folios, invalidate the entire bdev pagecache, set
i_blkbits and the folio order, and only then allow new additions to the
pagecache.

But then, which lock(s)?  Were this a file on XFS I'd say that one has
to take i_rwsem and mmap_invalidate_lock before truncating the pagecache
but by my recollection bdev devices don't take either lock in their IO
paths.

--D

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird blockdev crash in 6.15-rc1?
  2025-04-09 17:30 ` Darrick J. Wong
@ 2025-04-09 19:09   ` Darrick J. Wong
  2025-04-10  7:40     ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2025-04-09 19:09 UTC (permalink / raw)
  To: Luis Chamberlain, Matthew Wilcox; +Cc: linux-block, linux-fsdevel, xfs

On Wed, Apr 09, 2025 at 10:30:15AM -0700, Darrick J. Wong wrote:

> Then it occurred to me to look at set_blocksize again:
> 
> 	/* Don't change the size if it is same as current */
> 	if (inode->i_blkbits != blksize_bits(size)) {
> 		sync_blockdev(bdev);
> 		inode->i_blkbits = blksize_bits(size);
> 		mapping_set_folio_order_range(inode->i_mapping,
> 				get_order(size), get_order(size));
> 		kill_bdev(bdev);
> 	}
> 
> (Note that I changed mapping_set_folio_min_order here to
> mapping_set_folio_order_range to shut up a folio migration bug that I
> reported elsewhere on fsdevel yesterday, and willy suggested forcing the
> max order as a temporary workaround.)
> 
> The update of i_blkbits and the order bits of mapping->flags are
> performed before kill_bdev truncates the pagecache, which means there's
> a window where there can be a !uptodate order-0 folio in the pagecache
> but i_blkbits > PAGE_SHIFT (in this case, 13).  The debugging assertion
> above is from someone trying to install a too-small folio into the
> pagecache.  I think the "FARK" message I captured overnight is from
> readahead trying to bring in contents from disk for this too-small folio
> and failing.
> 
> So I think the answer is that set_blocksize needs to lock out folio_add,
> flush the dirty folios, invalidate the entire bdev pagecache, set
> i_blkbits and the folio order, and only then allow new additions to the
> pagecache.
> 
> But then, which lock(s)?  Were this a file on XFS I'd say that one has
> to take i_rwsem and mmap_invalidate_lock before truncating the pagecache
> but by my recollection bdev devices don't take either lock in their IO
> paths.

Here's my shabby attempt to lock my way out of this mess.  My reproducer
no longer trips, but I don't think that means much.

--D

From: Darrick J. Wong <djwong@kernel.org>
Subject: [PATCH] block: fix race between set_blocksize and IO paths

With the new large sector size support, it's now the case that
set_blocksize needs to change i_blksize and the folio order with no
folios in the pagecache because the geometry changes cause problems with
the bufferhead code.

Therefore, truncate the page cache after flushing but before updating
i_blksize.  However, that's not enough -- we also need to lock out file
IO and page faults during the update.  Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.

I don't know if this is the correct fix.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 block/bdev.c      |   12 ++++++++++++
 block/blk-zoned.c |    5 ++++-
 block/fops.c      |    7 +++++++
 block/ioctl.c     |    6 ++++++
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/block/bdev.c b/block/bdev.c
index 7b4e35a661b0c9..0cbdac46d98d86 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -169,11 +169,23 @@ int set_blocksize(struct file *file, int size)
 
 	/* Don't change the size if it is same as current */
 	if (inode->i_blkbits != blksize_bits(size)) {
+		/* Prevent concurrent IO operations */
+		inode_lock(inode);
+		filemap_invalidate_lock(inode->i_mapping);
+
+		/*
+		 * Flush and truncate the pagecache before we reconfigure the
+		 * mapping geometry because folio sizes are variable now.
+		 */
 		sync_blockdev(bdev);
+		kill_bdev(bdev);
+
 		inode->i_blkbits = blksize_bits(size);
 		mapping_set_folio_order_range(inode->i_mapping,
 				get_order(size), get_order(size));
 		kill_bdev(bdev);
+		filemap_invalidate_unlock(inode->i_mapping);
+		inode_unlock(inode);
 	}
 	return 0;
 }
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 0c77244a35c92e..8f15d1aa6eb89a 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -343,6 +343,7 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 		op = REQ_OP_ZONE_RESET;
 
 		/* Invalidate the page cache, including dirty pages. */
+		inode_lock(bdev->bd_mapping->host);
 		filemap_invalidate_lock(bdev->bd_mapping);
 		ret = blkdev_truncate_zone_range(bdev, mode, &zrange);
 		if (ret)
@@ -364,8 +365,10 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 	ret = blkdev_zone_mgmt(bdev, op, zrange.sector, zrange.nr_sectors);
 
 fail:
-	if (cmd == BLKRESETZONE)
+	if (cmd == BLKRESETZONE) {
 		filemap_invalidate_unlock(bdev->bd_mapping);
+		inode_unlock(bdev->bd_mapping->host);
+	}
 
 	return ret;
 }
diff --git a/block/fops.c b/block/fops.c
index be9f1dbea9ce0a..f46ae08fac33dd 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -746,7 +746,9 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = direct_write_fallback(iocb, from, ret,
 					blkdev_buffered_write(iocb, from));
 	} else {
+		inode_lock_shared(bd_inode);
 		ret = blkdev_buffered_write(iocb, from);
+		inode_unlock_shared(bd_inode);
 	}
 
 	if (ret > 0)
@@ -757,6 +759,7 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
+	struct inode *bd_inode = bdev_file_inode(iocb->ki_filp);
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	loff_t size = bdev_nr_bytes(bdev);
 	loff_t pos = iocb->ki_pos;
@@ -793,7 +796,9 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			goto reexpand;
 	}
 
+	inode_lock_shared(bd_inode);
 	ret = filemap_read(iocb, to, ret);
+	inode_unlock_shared(bd_inode);
 
 reexpand:
 	if (unlikely(shorted))
@@ -836,6 +841,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	if ((start | len) & (bdev_logical_block_size(bdev) - 1))
 		return -EINVAL;
 
+	inode_lock(inode);
 	filemap_invalidate_lock(inode->i_mapping);
 
 	/*
@@ -868,6 +874,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 
  fail:
 	filemap_invalidate_unlock(inode->i_mapping);
+	inode_unlock(inode);
 	return error;
 }
 
diff --git a/block/ioctl.c b/block/ioctl.c
index faa40f383e2736..e472cc1030c60c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -142,6 +142,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
 	if (err)
 		return err;
 
+	inode_lock(bdev->bd_mapping->host);
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, start + len - 1);
 	if (err)
@@ -174,6 +175,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
 	blk_finish_plug(&plug);
 fail:
 	filemap_invalidate_unlock(bdev->bd_mapping);
+	inode_unlock(bdev->bd_mapping->host);
 	return err;
 }
 
@@ -199,12 +201,14 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode,
 	    end > bdev_nr_bytes(bdev))
 		return -EINVAL;
 
+	inode_lock(bdev->bd_mapping->host);
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, end - 1);
 	if (!err)
 		err = blkdev_issue_secure_erase(bdev, start >> 9, len >> 9,
 						GFP_KERNEL);
 	filemap_invalidate_unlock(bdev->bd_mapping);
+	inode_unlock(bdev->bd_mapping->host);
 	return err;
 }
 
@@ -236,6 +240,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages */
+	inode_lock(bdev->bd_mapping->host);
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, end);
 	if (err)
@@ -246,6 +251,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
 
 fail:
 	filemap_invalidate_unlock(bdev->bd_mapping);
+	inode_unlock(bdev->bd_mapping->host);
 	return err;
 }
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Weird blockdev crash in 6.15-rc1?
  2025-04-09 19:09   ` Darrick J. Wong
@ 2025-04-10  7:40     ` Christoph Hellwig
  2025-04-10 15:25       ` Darrick J. Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2025-04-10  7:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Luis Chamberlain, Matthew Wilcox, linux-block, linux-fsdevel, xfs

On Wed, Apr 09, 2025 at 12:09:07PM -0700, Darrick J. Wong wrote:
> Subject: [PATCH] block: fix race between set_blocksize and IO paths
> 
> With the new large sector size support, it's now the case that
> set_blocksize needs to change i_blksize and the folio order with no
> folios in the pagecache because the geometry changes cause problems with
> the bufferhead code.

Urrg.  I wish we could just get out of the game of messing with
block device inode settings from file systems.  I guess doing it when
using buffer_heads is hard, but file systems without buffer heads
should have a way out of even propagating their block size to the
block device inode.  And file systems with buffer heads should probably
not support large folios like this :P


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird blockdev crash in 6.15-rc1?
  2025-04-10  7:40     ` Christoph Hellwig
@ 2025-04-10 15:25       ` Darrick J. Wong
  2025-04-11 20:39         ` Luis Chamberlain
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2025-04-10 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, Matthew Wilcox, linux-block, linux-fsdevel, xfs

On Thu, Apr 10, 2025 at 12:40:15AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 09, 2025 at 12:09:07PM -0700, Darrick J. Wong wrote:
> > Subject: [PATCH] block: fix race between set_blocksize and IO paths
> > 
> > With the new large sector size support, it's now the case that
> > set_blocksize needs to change i_blksize and the folio order with no
> > folios in the pagecache because the geometry changes cause problems with
> > the bufferhead code.
> 
> Urrg.  I wish we could just get out of the game of messing with
> block device inode settings from file systems.  I guess doing it when
> using buffer_heads is hard, but file systems without buffer heads
> should have a way out of even propagating their block size to the
> block device inode.  And file systems with buffer heads should probably
> not support large folios like this :P

Heh.  Why does xfs still call set_blocksize, anyway?  I can understand
why we want to validate the fs sector size is a power of 2, greater than
512, and not smaller than the LBA size; and flushing the dirty bdev
pagecache.  But do we really need to fiddle with i_blksize or dumping
the pagecache?

--D

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird blockdev crash in 6.15-rc1?
  2025-04-10 15:25       ` Darrick J. Wong
@ 2025-04-11 20:39         ` Luis Chamberlain
  0 siblings, 0 replies; 6+ messages in thread
From: Luis Chamberlain @ 2025-04-11 20:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Matthew Wilcox, linux-block, linux-fsdevel,
	xfs

On Thu, Apr 10, 2025 at 08:25:54AM -0700, Darrick J. Wong wrote:
> Heh.  Why does xfs still call set_blocksize, anyway? 

Hrm. That's called from xfs_setsize_buftarg() so the buffer target, and it
uses a different min order. We have logdev and rtdev. I just tested and
indeed, a 32k logical sector size drive can be used for data with for
example a 512k logical sector size logdev.

  Luis

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-04-11 20:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-08 17:51 Weird blockdev crash in 6.15-rc1? Darrick J. Wong
2025-04-09 17:30 ` Darrick J. Wong
2025-04-09 19:09   ` Darrick J. Wong
2025-04-10  7:40     ` Christoph Hellwig
2025-04-10 15:25       ` Darrick J. Wong
2025-04-11 20:39         ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox