From: "Darrick J. Wong" <djwong@kernel.org>
To: Luis Chamberlain <mcgrof@kernel.org>,
Matthew Wilcox <willy@infradead.org>
Cc: linux-block <linux-block@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
xfs <linux-xfs@vger.kernel.org>
Subject: Re: Weird blockdev crash in 6.15-rc1?
Date: Wed, 9 Apr 2025 10:30:15 -0700 [thread overview]
Message-ID: <20250409173015.GN6266@frogsfrogsfrogs> (raw)
In-Reply-To: <20250408175125.GL6266@frogsfrogsfrogs>
On Tue, Apr 08, 2025 at 10:51:25AM -0700, Darrick J. Wong wrote:
> Hi everyone,
>
> I saw the following crash in 6.15-rc1 when running xfs/032 from fstests
> for-next. I don't see it in 6.14. I'll try to bisect, but in the
> meantime does this look familiar to anyone? The XFS configuration is
> pretty boring:
>
> MKFS_OPTIONS="-m autofsck=1, -n size=8192"
> MOUNT_OPTIONS="-o uquota,gquota,pquota"
>
> (4k fsblocks, x64 host, directory blocks are 8k)
>
> From the stack trace, it looks like the null pointer dereference is in
> this call to bdev_nr_sectors:
>
> void guard_bio_eod(struct bio *bio)
> {
> sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);
>
> because bio->bi_bdev is NULL for some reason. The crash itself seems to
> be from do_mpage_readpage around line 304:
>
> alloc_new:
> if (args->bio == NULL) {
> args->bio = bio_alloc(bdev, bio_max_segs(args->nr_pages), opf,
> gfp);
>
> bdev is NULL here ^^^^
>
> if (args->bio == NULL)
> goto confused;
> args->bio->bi_iter.bi_sector = first_block << (blkbits - 9);
> }
>
> length = first_hole << blkbits;
> if (!bio_add_folio(args->bio, folio, length, 0)) {
> args->bio = mpage_bio_submit_read(args->bio);
> goto alloc_new;
> }
>
> relative_block = block_in_file - args->first_logical_block;
> nblocks = map_bh->b_size >> blkbits;
> if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
> (first_hole != blocks_per_folio))
> args->bio = mpage_bio_submit_read(args->bio);
>
> My guess is that there was no previous call to ->get_block and that
> blocks_per_folio == 0, so nobody ever actually set the local @bdev
> variable to a non-NULL value. blocks_per_folio is perhaps zero because
> xfs/032 tried formatting with a sector size of 64k, which causes the
> bdev inode->i_blkbits to be set to 16, but for some reason we got a
> folio that wasn't 64k in size:
>
> const unsigned blkbits = inode->i_blkbits;
> const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
>
> <shrug> That's just my conjecture for now.
Ok so overnight my debugging patch confirmed this hypothesis:
XFS (sda4): Mounting V5 Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0
XFS (sda4): Ending clean mount
XFS (sda4): Quotacheck needed: Please wait.
XFS (sda4): Quotacheck: Done.
XFS (sda4): Unmounting Filesystem 8cf3c461-57b0-4bba-86ab-6dc13b8cdab0
FARK bio_alloc with NULL bdev?! blkbits 13 fsize 4096 blocks_per_folio 0
willy told me to set CONFIG_DEBUG_VM=y and rerun xfs/032. That
didn't turn anything up, so I decided to race it with:
while sleep 0.1; do blkid -c /dev/null; done
to simulate udev calling libblkid. That produced a debugging assertion
with 40 seconds:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x4f3bc4 pfn:0x43da4
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff8880446b4800
flags: 0x4fff80000000041(locked|head|node=1|zone=1|lastcpupid=0xfff)
raw: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000
raw: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800
head: 04fff80000000041 0000000000000000 dead000000000122 0000000000000000
head: 00000000004f3bc4 0000000000000000 00000001ffffffff ffff8880446b4800
head: 04fff80000000201 ffffea00010f6901 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000002
page dumped because: VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1))
------------[ cut here ]------------
kernel BUG at mm/filemap.c:871!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 3 UID: 0 PID: 26689 Comm: (udev-worker) Not tainted 6.15.0-rc1-djwx #rc1 PREEMPT(lazy) 8c302df0300eabbbd3cdc47fd812690b8d635c39
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__filemap_add_folio+0x4ae/0x540
Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9
RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246
RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737
R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8
R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530
FS: 00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0
Call Trace:
<TASK>
? memcg_list_lru_alloc+0x2d0/0x2d0
filemap_add_folio+0x7f/0xd0
page_cache_ra_unbounded+0x147/0x260
force_page_cache_ra+0x92/0xb0
filemap_get_pages+0x13b/0x7b0
? current_time+0x3b/0x110
filemap_read+0x106/0x4c0
? _raw_spin_unlock+0x14/0x30
blkdev_read_iter+0x64/0x120
vfs_read+0x290/0x390
ksys_read+0x6f/0xe0
do_syscall_64+0x47/0x100
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f886428025d
Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6 53 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 81 23 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f c
RSP: 002b:00007fff5ce76228 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055cf45839640 RCX: 00007f886428025d
RDX: 0000000000040000 RSI: 000055cf45996908 RDI: 000000000000000f
RBP: 00000004f3b80000 R08: 00007f886435add0 R09: 00007f886435add0
R10: 0000000000000000 R11: 0000000000000246 R12: 000055cf459968e0
R13: 0000000000040000 R14: 000055cf45839698 R15: 000055cf459968f8
</TASK>
Modules linked in: xfs ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables nfnet]
Dumping ftrace buffer:
(ftrace buffer empty)
---[ end trace 0000000000000000 ]---
RIP: 0010:__filemap_add_folio+0x4ae/0x540
Code: 40 49 89 d4 0f b6 c1 49 d3 ec 81 e1 c0 00 00 00 0f 84 e0 fb ff ff e9 92 b6 d3 ff 48 c7 c6 68 57 ec 81 4c 89 ef e8 82 6e 05 00 <0f> 0b 49 89 d4 e9 c2 fb ff ff 48 c7 c6 9
RSP: 0018:ffffc900016e3a70 EFLAGS: 00010246
RAX: 0000000000000049 RBX: 0000000000112cc0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
RBP: 0000000000000001 R08: 0000000000000000 R09: 205d313431343737
R10: 0000000000000729 R11: 6d75642065676170 R12: 00000000004f3ba8
R13: ffffea00010f6900 R14: ffff88804076a530 R15: ffff88804076a530
FS: 00007f8863b788c0(0000) GS:ffff8880fb952000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055cf459d5000 CR3: 000000000d96f003 CR4: 00000000001706f0
Digging into the VM, I noticed that mount is stuck in D state:
/proc/44312/task/44312/stack :
[<0>] folio_wait_bit_common+0x144/0x350
[<0>] truncate_inode_pages_range+0x4df/0x5b0
[<0>] set_blocksize+0x10b/0x130
[<0>] xfs_setsize_buftarg+0x1f/0x50 [xfs]
[<0>] xfs_setup_devices+0x1a/0xc0 [xfs]
[<0>] xfs_fs_fill_super+0x423/0xb20 [xfs]
[<0>] get_tree_bdev_flags+0x132/0x1d0
[<0>] vfs_get_tree+0x17/0xa0
[<0>] path_mount+0x721/0xa90
[<0>] __x64_sys_mount+0x10c/0x140
[<0>] do_syscall_64+0x47/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
Regrettably the udev worker is gone, but my guess is that the process
exited with the folio locked, so now truncate_inode_pages_range can't
lock it to get rid of it.
Then it occurred to me to look at set_blocksize again:
/* Don't change the size if it is same as current */
if (inode->i_blkbits != blksize_bits(size)) {
sync_blockdev(bdev);
inode->i_blkbits = blksize_bits(size);
mapping_set_folio_order_range(inode->i_mapping,
get_order(size), get_order(size));
kill_bdev(bdev);
}
(Note that I changed mapping_set_folio_min_order here to
mapping_set_folio_order_range to shut up a folio migration bug that I
reported elsewhere on fsdevel yesterday, and willy suggested forcing the
max order as a temporary workaround.)
The update of i_blkbits and the order bits of mapping->flags are
performed before kill_bdev truncates the pagecache, which means there's
a window where there can be a !uptodate order-0 folio in the pagecache
but i_blkbits > PAGE_SHIFT (in this case, 13). The debugging assertion
above is from someone trying to install a too-small folio into the
pagecache. I think the "FARK" message I captured overnight is from
readahead trying to bring in contents from disk for this too-small folio
and failing.
So I think the answer is that set_blocksize needs to lock out folio_add,
flush the dirty folios, invalidate the entire bdev pagecache, set
i_blkbits and the folio order, and only then allow new additions to the
pagecache.
But then, which lock(s)? Were this a file on XFS I'd say that one has
to take i_rwsem and mmap_invalidate_lock before truncating the pagecache
but by my recollection bdev devices don't take either lock in their IO
paths.
--D
next prev parent reply other threads:[~2025-04-09 17:30 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-08 17:51 Weird blockdev crash in 6.15-rc1? Darrick J. Wong
2025-04-09 17:30 ` Darrick J. Wong [this message]
2025-04-09 19:09 ` Darrick J. Wong
2025-04-10 7:40 ` Christoph Hellwig
2025-04-10 15:25 ` Darrick J. Wong
2025-04-11 20:39 ` Luis Chamberlain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250409173015.GN6266@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox