* [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
@ 2025-10-07 3:10 Qu Wenruo
2025-10-07 3:52 ` Matthew Wilcox
2025-10-07 4:10 ` Christoph Hellwig
0 siblings, 2 replies; 8+ messages in thread
From: Qu Wenruo @ 2025-10-07 3:10 UTC (permalink / raw)
To: linux-btrfs; +Cc: brauner, djwong, linux-xfs, linux-fsdevel
[ASSERT TRIGGERED]
During my development of btrfs bs > ps support, I hit some read bios
that are submitted from iomap to btrfs, but are not aligned to fs block
size.
In my case the fs block size is 8K, the page size is 4K. The ASSERT()
looks like this:
assertion failed: IS_ALIGNED(logical, blocksize) && IS_ALIGNED(length, blocksize) && length != 0 :: 0, in fs/btrfs/bio.c:833 (root=256 inode=260 logical=299360256 length=69632)
------------[ cut here ]------------
kernel BUG at fs/btrfs/bio.c:833!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 6 UID: 0 PID: 1153 Comm: fsstress Tainted: G E 6.17.0-rc7-custom+ #291 PREEMPT(full) be3ff76d2e76a554af2cfea604366d16e719ba97
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:btrfs_submit_bbio.cold+0x10c/0x127 [btrfs]
Call Trace:
<TASK>
iomap_dio_bio_iter+0x1d3/0x570
__iomap_dio_rw+0x547/0x8e0
iomap_dio_rw+0x12/0x30
btrfs_direct_read+0x135/0x220 [btrfs 24de5898492ba42c5e58573a6f20bf3c9894c726]
btrfs_file_read_iter+0x21/0x70 [btrfs 24de5898492ba42c5e58573a6f20bf3c9894c726]
vfs_read+0x25e/0x380
ksys_read+0x73/0xe0
do_syscall_64+0x82/0xae0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Dumping ftrace buffer:
---------------------------------
fsstress-1153 6..... 68530us : iomap_dio_bio_iter: length=81920 nr_pages=20 enter
fsstress-1153 6..... 68539us : iomap_dio_bio_iter: length=81920 realsize=69632
fsstress-1153 6..... 68540us : iomap_dio_bio_iter: nr_pages=3 for next
---------------------------------
[CAUSE]
The function iomap_dio_bio_iter() is doing the bio assembly and
submission, and it's calling bio_iov_iter_get_pages().
However that function can split the range, and in my case it split the
original 20 pages range into two ranges, with 17 and 3 pages for each.
Then the 17 pages range is passed to btrfs_dio_submit_io(), which later
calls into assert_bbio_alignment() and triggered the ASSERT() on fs
block size check.
This check is critical as btrfs needs to verify the data checksum at
read time and retry other mirrors when necessary.
If a sub-block range is passed in the read-verification and read-repair
functionality is lost.
This is never a problem for btrfs in the past, just because we do not
have the support for bs > ps cases.
And this is also not a problem for fses using iomap, because there is no
data checksum support.
[ENHANCEMENT]
Just follow what bcachefs is doing, check the bio size and revert the bio
to the fs boundary.
If after revert the bio is empty, we have to error out because we can
not fault in enough pages to fill a fs block.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:
This change is forcing all fses using iomap to revert the iov iter each
time an unaligned range hit, even if the fs doesn't need to (e.g. no
data checksum requirement).
I'm not sure if the cost is acceptable or even necessary.
If the extra cost is not acceptable, I can add a new
iomap_dio_ops::need_strong_alignment() callback so that only btrfs will
do the revert.
fs/iomap/direct-io.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b84f6af2eb4c..f08babe7c83f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -419,6 +419,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
do {
size_t n;
+ size_t unaligned;
if (dio->error) {
iov_iter_revert(dio->submit.iter, copied);
copied = ret = 0;
@@ -444,9 +445,26 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
*/
bio_put(bio);
goto zero_tail;
+
}
+ /*
+ * bio_iov_iter_get_pages() can split the ranges at page boundary,
+ * if the fs has block size > page size and requires checksum,
+ * such unaligned bio will cause problems.
+ * Revert back to the fs block boundary.
+ */
+ unaligned = bio->bi_iter.bi_size & (fs_block_size - 1);
+ bio->bi_iter.bi_size -= unaligned;
+ iov_iter_revert(dio->submit.iter, unaligned);
n = bio->bi_iter.bi_size;
+
+ /* Failed to get any aligned range. */
+ if (unlikely(n == 0)) {
+ bio_put(bio);
+ ret = -EFAULT;
+ goto zero_tail;
+ }
if (WARN_ON_ONCE((bio_opf & REQ_ATOMIC) && n != length)) {
/*
* An atomic write bio must cover the complete length,
--
2.50.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
2025-10-07 3:10 [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned Qu Wenruo
@ 2025-10-07 3:52 ` Matthew Wilcox
2025-10-07 4:05 ` Christoph Hellwig
2025-10-07 4:10 ` Christoph Hellwig
1 sibling, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2025-10-07 3:52 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, brauner, djwong, linux-xfs, linux-fsdevel
On Tue, Oct 07, 2025 at 01:40:22PM +1030, Qu Wenruo wrote:
> During my development of btrfs bs > ps support, I hit some read bios
> that are submitted from iomap to btrfs, but are not aligned to fs block
> size.
>
> In my case the fs block size is 8K, the page size is 4K. The ASSERT()
> looks like this:
Why isn't bdev_logical_block_size() set correctly by btrfs?
bio_iov_iter_get_bdev_pages:
return bio_iov_iter_get_pages_aligned(bio, iter,
bdev_logical_block_size(bdev) - 1);
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
2025-10-07 3:52 ` Matthew Wilcox
@ 2025-10-07 4:05 ` Christoph Hellwig
2025-10-07 4:19 ` Matthew Wilcox
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2025-10-07 4:05 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Qu Wenruo, linux-btrfs, brauner, djwong, linux-xfs, linux-fsdevel
On Tue, Oct 07, 2025 at 04:52:09AM +0100, Matthew Wilcox wrote:
> On Tue, Oct 07, 2025 at 01:40:22PM +1030, Qu Wenruo wrote:
> > During my development of btrfs bs > ps support, I hit some read bios
> > that are submitted from iomap to btrfs, but are not aligned to fs block
> > size.
> >
> > In my case the fs block size is 8K, the page size is 4K. The ASSERT()
> > looks like this:
>
> Why isn't bdev_logical_block_size() set correctly by btrfs?
bdev_logical_block_size is never set by the file system. It is the LBA
size of the underlying block device. But if the file system block size
is larger AND the file system needs to do file system block size
granularity operations that is not the correct boundary. See also the
iov_iter_alignment for always COW / zoned file system in
xfs_file_dio_write.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
2025-10-07 4:05 ` Christoph Hellwig
@ 2025-10-07 4:19 ` Matthew Wilcox
2025-10-07 4:24 ` Christoph Hellwig
2025-10-07 4:26 ` Qu Wenruo
0 siblings, 2 replies; 8+ messages in thread
From: Matthew Wilcox @ 2025-10-07 4:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Qu Wenruo, linux-btrfs, brauner, djwong, linux-xfs, linux-fsdevel
On Mon, Oct 06, 2025 at 09:05:15PM -0700, Christoph Hellwig wrote:
> On Tue, Oct 07, 2025 at 04:52:09AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 07, 2025 at 01:40:22PM +1030, Qu Wenruo wrote:
> > > During my development of btrfs bs > ps support, I hit some read bios
> > > that are submitted from iomap to btrfs, but are not aligned to fs block
> > > size.
> > >
> > > In my case the fs block size is 8K, the page size is 4K. The ASSERT()
> > > looks like this:
> >
> > Why isn't bdev_logical_block_size() set correctly by btrfs?
>
> bdev_logical_block_size is never set by the file system. It is the LBA
> size of the underlying block device. But if the file system block size
> is larger AND the file system needs to do file system block size
> granularity operations that is not the correct boundary. See also the
> iov_iter_alignment for always COW / zoned file system in
> xfs_file_dio_write.
But the case he's complaining about is bs>PS, so the LBA size really is
larger than PAGE_SIZE.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
2025-10-07 4:19 ` Matthew Wilcox
@ 2025-10-07 4:24 ` Christoph Hellwig
2025-10-07 4:26 ` Qu Wenruo
1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-10-07 4:24 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, brauner, djwong,
linux-xfs, linux-fsdevel
On Tue, Oct 07, 2025 at 05:19:42AM +0100, Matthew Wilcox wrote:
> > bdev_logical_block_size is never set by the file system. It is the LBA
> > size of the underlying block device. But if the file system block size
> > is larger AND the file system needs to do file system block size
> > granularity operations that is not the correct boundary. See also the
> > iov_iter_alignment for always COW / zoned file system in
> > xfs_file_dio_write.
>
> But the case he's complaining about is bs>PS, so the LBA size really is
> larger than PAGE_SIZE.
The file system block size is, the LBA probably is not unless Qu is
running on not generally available prototype hardware.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
2025-10-07 4:19 ` Matthew Wilcox
2025-10-07 4:24 ` Christoph Hellwig
@ 2025-10-07 4:26 ` Qu Wenruo
1 sibling, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2025-10-07 4:26 UTC (permalink / raw)
To: Matthew Wilcox, Christoph Hellwig
Cc: Qu Wenruo, linux-btrfs, brauner, djwong, linux-xfs, linux-fsdevel
在 2025/10/7 14:49, Matthew Wilcox 写道:
> On Mon, Oct 06, 2025 at 09:05:15PM -0700, Christoph Hellwig wrote:
>> On Tue, Oct 07, 2025 at 04:52:09AM +0100, Matthew Wilcox wrote:
>>> On Tue, Oct 07, 2025 at 01:40:22PM +1030, Qu Wenruo wrote:
>>>> During my development of btrfs bs > ps support, I hit some read bios
>>>> that are submitted from iomap to btrfs, but are not aligned to fs block
>>>> size.
>>>>
>>>> In my case the fs block size is 8K, the page size is 4K. The ASSERT()
>>>> looks like this:
>>>
>>> Why isn't bdev_logical_block_size() set correctly by btrfs?
>>
>> bdev_logical_block_size is never set by the file system. It is the LBA
>> size of the underlying block device. But if the file system block size
>> is larger AND the file system needs to do file system block size
>> granularity operations that is not the correct boundary. See also the
>> iov_iter_alignment for always COW / zoned file system in
>> xfs_file_dio_write.
>
> But the case he's complaining about is bs>PS, so the LBA size really is
> larger than PAGE_SIZE.
>
That bdev_logical_block_size() really looks like something bounded to
the block device, not the fs, thus I'm not sure if it's the best layer
to handle it.
Furthermore, there is something like RAIDZ down in the roadmap, where we
will have raid stripe length smaller than fs block size.
In that case we will need to explicitly distinguish the io size from the
bdev, and the fs block size.
Thanks,
Qu
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
2025-10-07 3:10 [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned Qu Wenruo
2025-10-07 3:52 ` Matthew Wilcox
@ 2025-10-07 4:10 ` Christoph Hellwig
2025-10-07 4:29 ` Qu Wenruo
1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2025-10-07 4:10 UTC (permalink / raw)
To: Qu Wenruo
Cc: linux-btrfs, brauner, djwong, linux-xfs, linux-fsdevel,
Keith Busch
On Tue, Oct 07, 2025 at 01:40:22PM +1030, Qu Wenruo wrote:
> +++ b/fs/iomap/direct-io.c
> @@ -419,6 +419,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
> nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
> do {
> size_t n;
> + size_t unaligned;
> if (dio->error) {
Nit: please add an empty line after the ceclarations when touching this.
The existing version keeps irking me every time I get into this code.
> + /*
> + * bio_iov_iter_get_pages() can split the ranges at page boundary,
Overly long line.
> + * if the fs has block size > page size and requires checksum,
> + * such unaligned bio will cause problems.
> + * Revert back to the fs block boundary.
> + */
The comment here feels a bit too specific to your use case.
> + unaligned = bio->bi_iter.bi_size & (fs_block_size - 1);
> + bio->bi_iter.bi_size -= unaligned;
> + iov_iter_revert(dio->submit.iter, unaligned);
> n = bio->bi_iter.bi_size;
But more importantly I think this is the wrong layer. In Linus'
current tree, Keith added bio_iov_iter_get_pages_aligned which can pass
in the expected alignment. We should use that, and figure out a way
to make it conditional and not require the stricter alignment for
everyone, i.e. by adding a flag passes through the dio_flags argument
to iomap_dio_rw.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned
2025-10-07 4:10 ` Christoph Hellwig
@ 2025-10-07 4:29 ` Qu Wenruo
0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2025-10-07 4:29 UTC (permalink / raw)
To: Christoph Hellwig, Qu Wenruo
Cc: linux-btrfs, brauner, djwong, linux-xfs, linux-fsdevel,
Keith Busch
在 2025/10/7 14:40, Christoph Hellwig 写道:
> On Tue, Oct 07, 2025 at 01:40:22PM +1030, Qu Wenruo wrote:
>> +++ b/fs/iomap/direct-io.c
>> @@ -419,6 +419,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>> nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
>> do {
>> size_t n;
>> + size_t unaligned;
>> if (dio->error) {
>
> Nit: please add an empty line after the ceclarations when touching this.
> The existing version keeps irking me every time I get into this code.
>
>> + /*
>> + * bio_iov_iter_get_pages() can split the ranges at page boundary,
>
> Overly long line.
>
>> + * if the fs has block size > page size and requires checksum,
>> + * such unaligned bio will cause problems.
>> + * Revert back to the fs block boundary.
>> + */
>
> The comment here feels a bit too specific to your use case.
>
>> + unaligned = bio->bi_iter.bi_size & (fs_block_size - 1);
>> + bio->bi_iter.bi_size -= unaligned;
>> + iov_iter_revert(dio->submit.iter, unaligned);
>> n = bio->bi_iter.bi_size;
>
> But more importantly I think this is the wrong layer. In Linus'
> current tree, Keith added bio_iov_iter_get_pages_aligned which can pass
> in the expected alignment. We should use that, and figure out a way
> to make it conditional and not require the stricter alignment for
> everyone, i.e. by adding a flag passes through the dio_flags argument
> to iomap_dio_rw.
Thanks a lot, the newer helper is exactly what I need, and unfortunately
it's not yet in btrfs tree. I'll rebase my patches and utilize the new
helper.
The DIO flag solution sounds pretty good to me. Will go that path.
Thanks a lot for all the hints,
Qu
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-10-07 4:29 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-07 3:10 [PATCH RFC] iomap: ensure iomap_dio_bio_iter() only submit bios that are fs block aligned Qu Wenruo
2025-10-07 3:52 ` Matthew Wilcox
2025-10-07 4:05 ` Christoph Hellwig
2025-10-07 4:19 ` Matthew Wilcox
2025-10-07 4:24 ` Christoph Hellwig
2025-10-07 4:26 ` Qu Wenruo
2025-10-07 4:10 ` Christoph Hellwig
2025-10-07 4:29 ` Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).