linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Direct IO reads being split unexpected at page boundary, but in the middle of a fs block (bs > ps cases)
@ 2025-10-01  1:29 Qu Wenruo
  2025-10-06 15:07 ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-10-01  1:29 UTC (permalink / raw)
  To: linux-fsdevel@vger.kernel.org, linux-btrfs

Hi,

Recently during the btrfs bs > ps direct IO enablement, I'm hitting a 
case where:

- The direct IO iov is properly aligned to fs block size (8K, 2 pages)
   They do not need to be large folio backed, regular incontiguous pages
   are supported.

- The btrfs now can handle sub-block pages
   But still require the bi_size and (bi_sector << 9) to be block size
   aligned.

- The bio passed into iomap_dio_ops::submit_io is not block size
   aligned
   The bio only contains one page, not 2.

   This makes things like checksum verification impossible.


This can be worked around by falling direct IO read on inodes with 
checksum to buffered IO.

However the fallback itself is very slow (around 1/5 of the storage 
speed, something we will need to address in the future), I'm still 
trying to implement the true zero-copy direct read support when possible.


Any way to force the minimal amount of pages for iomap_dio_bio_iter()?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Direct IO reads being split unexpected at page boundary, but in the middle of a fs block (bs > ps cases)
  2025-10-01  1:29 Direct IO reads being split unexpected at page boundary, but in the middle of a fs block (bs > ps cases) Qu Wenruo
@ 2025-10-06 15:07 ` Matthew Wilcox
  2025-10-07  2:30   ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-10-06 15:07 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, Christian Brauner,
	Darrick J. Wong, Christoph Hellwig

On Wed, Oct 01, 2025 at 10:59:18AM +0930, Qu Wenruo wrote:
> Recently during the btrfs bs > ps direct IO enablement, I'm hitting a case
> where:
> 
> - The direct IO iov is properly aligned to fs block size (8K, 2 pages)
>   They do not need to be large folio backed, regular incontiguous pages
>   are supported.
> 
> - The btrfs now can handle sub-block pages
>   But still require the bi_size and (bi_sector << 9) to be block size
>   aligned.
> 
> - The bio passed into iomap_dio_ops::submit_io is not block size
>   aligned
>   The bio only contains one page, not 2.

That seems like a bug in the VFS/iomap somewhere.  Maybe try cc'ing the
people who know this code?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Direct IO reads being split unexpected at page boundary, but in the middle of a fs block (bs > ps cases)
  2025-10-06 15:07 ` Matthew Wilcox
@ 2025-10-07  2:30   ` Qu Wenruo
  2025-10-07 14:58     ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-10-07  2:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, Christian Brauner,
	Darrick J. Wong, Christoph Hellwig, linux-bcachefs



在 2025/10/7 01:37, Matthew Wilcox 写道:
> On Wed, Oct 01, 2025 at 10:59:18AM +0930, Qu Wenruo wrote:
>> Recently during the btrfs bs > ps direct IO enablement, I'm hitting a case
>> where:
>>
>> - The direct IO iov is properly aligned to fs block size (8K, 2 pages)
>>    They do not need to be large folio backed, regular incontiguous pages
>>    are supported.
>>
>> - The btrfs now can handle sub-block pages
>>    But still require the bi_size and (bi_sector << 9) to be block size
>>    aligned.
>>
>> - The bio passed into iomap_dio_ops::submit_io is not block size
>>    aligned
>>    The bio only contains one page, not 2.
> 
> That seems like a bug in the VFS/iomap somewhere.  Maybe try cc'ing the
> people who know this code?
> 

Add xfs and bcachefs subsystem into CC.

The root cause is that, function __bio_iov_iter_get_pages() can split 
the iov.

In my case, I hit the following dio during iomap_dio_bio_iter();

  fsstress-1153      6..... 68530us : iomap_dio_bio_iter: length=81920 
nr_pages=20 enter
  fsstress-1153      6..... 68539us : iomap_dio_bio_iter: length=81920 
realsize=69632(17 pages)
  fsstress-1153      6..... 68540us : iomap_dio_bio_iter: nr_pages=3 for 
next

Which bio_iov_iter_get_pages() split the 20 pages into two segments (17 
+ 3 pages).
That 17/3 split is not meeting the btrfs' block size requirement (in my 
case it's 8K block size).


I'm seeing XFS having a comment related to bio_iov_iter_get_pages() 
inside xfs_file_dio_write(), but there is no special checks other than 
iov_iter_alignment() check, which btrfs is also doing.

I guess since XFS do not need to bother data checksum thus such split is 
not a big deal?


On the other hand, bcachefs is doing reverting to the block boundary 
instead thus solved the problem.
However btrfs is using iomap for direct IOs, thus we can not manually 
revert the iov/bio just inside btrfs.

So I guess in this case we need to add a callback for iomap, to get the 
fs block size so that at least iomap_dio_bio_iter() can revert to the fs 
block boundary?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Direct IO reads being split unexpected at page boundary, but in the middle of a fs block (bs > ps cases)
  2025-10-07  2:30   ` Qu Wenruo
@ 2025-10-07 14:58     ` Darrick J. Wong
  2025-10-07 21:28       ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2025-10-07 14:58 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Matthew Wilcox, linux-fsdevel@vger.kernel.org, linux-btrfs,
	Christian Brauner, Christoph Hellwig, linux-bcachefs

On Tue, Oct 07, 2025 at 01:00:58PM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/10/7 01:37, Matthew Wilcox 写道:
> > On Wed, Oct 01, 2025 at 10:59:18AM +0930, Qu Wenruo wrote:
> > > Recently during the btrfs bs > ps direct IO enablement, I'm hitting a case
> > > where:
> > > 
> > > - The direct IO iov is properly aligned to fs block size (8K, 2 pages)
> > >    They do not need to be large folio backed, regular incontiguous pages
> > >    are supported.
> > > 
> > > - The btrfs now can handle sub-block pages
> > >    But still require the bi_size and (bi_sector << 9) to be block size
> > >    aligned.
> > > 
> > > - The bio passed into iomap_dio_ops::submit_io is not block size
> > >    aligned
> > >    The bio only contains one page, not 2.
> > 
> > That seems like a bug in the VFS/iomap somewhere.  Maybe try cc'ing the
> > people who know this code?
> > 
> 
> Add xfs and bcachefs subsystem into CC.
> 
> The root cause is that, function __bio_iov_iter_get_pages() can split the
> iov.
> 
> In my case, I hit the following dio during iomap_dio_bio_iter();
> 
>  fsstress-1153      6..... 68530us : iomap_dio_bio_iter: length=81920
> nr_pages=20 enter
>  fsstress-1153      6..... 68539us : iomap_dio_bio_iter: length=81920
> realsize=69632(17 pages)
>  fsstress-1153      6..... 68540us : iomap_dio_bio_iter: nr_pages=3 for next
> 
> Which bio_iov_iter_get_pages() split the 20 pages into two segments (17 + 3
> pages).
> That 17/3 split is not meeting the btrfs' block size requirement (in my case
> it's 8K block size).

Just out of curiosity, what are the corresponding
iomap_iter_{src,dst}map tracepoints for these iomap_dio_bio_iters?

I'm assuming there's one mapping for all 80k of data?

> I'm seeing XFS having a comment related to bio_iov_iter_get_pages() inside
> xfs_file_dio_write(), but there is no special checks other than
> iov_iter_alignment() check, which btrfs is also doing.
> 
> I guess since XFS do not need to bother data checksum thus such split is not
> a big deal?

I think so too.  The bios all point to the original iomap_dio so the
ioend only gets called once for the the full write IO, so a completion
of an out of place write will never see sub-block ranges.

> On the other hand, bcachefs is doing reverting to the block boundary instead
> thus solved the problem.
> However btrfs is using iomap for direct IOs, thus we can not manually revert
> the iov/bio just inside btrfs.
> 
> So I guess in this case we need to add a callback for iomap, to get the fs
> block size so that at least iomap_dio_bio_iter() can revert to the fs block
> boundary?

Or add a flags bit to iomap_dio_ops to indicate that the fs requires
block sized bios?

I'm guessing that you can't do sub-block directio writes to btrfs
either?

--D

> Thanks,
> Qu
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Direct IO reads being split unexpected at page boundary, but in the middle of a fs block (bs > ps cases)
  2025-10-07 14:58     ` Darrick J. Wong
@ 2025-10-07 21:28       ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-10-07 21:28 UTC (permalink / raw)
  To: Darrick J. Wong, Qu Wenruo
  Cc: Matthew Wilcox, linux-fsdevel@vger.kernel.org, linux-btrfs,
	Christian Brauner, Christoph Hellwig, linux-bcachefs



在 2025/10/8 01:28, Darrick J. Wong 写道:
> On Tue, Oct 07, 2025 at 01:00:58PM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2025/10/7 01:37, Matthew Wilcox 写道:
>>> On Wed, Oct 01, 2025 at 10:59:18AM +0930, Qu Wenruo wrote:
>>>> Recently during the btrfs bs > ps direct IO enablement, I'm hitting a case
>>>> where:
>>>>
>>>> - The direct IO iov is properly aligned to fs block size (8K, 2 pages)
>>>>     They do not need to be large folio backed, regular incontiguous pages
>>>>     are supported.
>>>>
>>>> - The btrfs now can handle sub-block pages
>>>>     But still require the bi_size and (bi_sector << 9) to be block size
>>>>     aligned.
>>>>
>>>> - The bio passed into iomap_dio_ops::submit_io is not block size
>>>>     aligned
>>>>     The bio only contains one page, not 2.
>>>
>>> That seems like a bug in the VFS/iomap somewhere.  Maybe try cc'ing the
>>> people who know this code?
>>>
>>
>> Add xfs and bcachefs subsystem into CC.
>>
>> The root cause is that, function __bio_iov_iter_get_pages() can split the
>> iov.
>>
>> In my case, I hit the following dio during iomap_dio_bio_iter();
>>
>>   fsstress-1153      6..... 68530us : iomap_dio_bio_iter: length=81920
>> nr_pages=20 enter
>>   fsstress-1153      6..... 68539us : iomap_dio_bio_iter: length=81920
>> realsize=69632(17 pages)
>>   fsstress-1153      6..... 68540us : iomap_dio_bio_iter: nr_pages=3 for next
>>
>> Which bio_iov_iter_get_pages() split the 20 pages into two segments (17 + 3
>> pages).
>> That 17/3 split is not meeting the btrfs' block size requirement (in my case
>> it's 8K block size).
> 
> Just out of curiosity, what are the corresponding
> iomap_iter_{src,dst}map tracepoints for these iomap_dio_bio_iters?

None, those are adhoc added trace_printk()s.

> 
> I'm assuming there's one mapping for all 80k of data?
> 
>> I'm seeing XFS having a comment related to bio_iov_iter_get_pages() inside
>> xfs_file_dio_write(), but there is no special checks other than
>> iov_iter_alignment() check, which btrfs is also doing.
>>
>> I guess since XFS do not need to bother data checksum thus such split is not
>> a big deal?
> 
> I think so too.  The bios all point to the original iomap_dio so the
> ioend only gets called once for the the full write IO, so a completion
> of an out of place write will never see sub-block ranges.
> 
>> On the other hand, bcachefs is doing reverting to the block boundary instead
>> thus solved the problem.
>> However btrfs is using iomap for direct IOs, thus we can not manually revert
>> the iov/bio just inside btrfs.
>>
>> So I guess in this case we need to add a callback for iomap, to get the fs
>> block size so that at least iomap_dio_bio_iter() can revert to the fs block
>> boundary?
> 
> Or add a flags bit to iomap_dio_ops to indicate that the fs requires
> block sized bios?

Yep, that's the next step.

> 
> I'm guessing that you can't do sub-block directio writes to btrfs
> either?

Exactly.

Thanks,
Qu

> 
> --D
> 
>> Thanks,
>> Qu
>>
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-07 21:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-01  1:29 Direct IO reads being split unexpected at page boundary, but in the middle of a fs block (bs > ps cases) Qu Wenruo
2025-10-06 15:07 ` Matthew Wilcox
2025-10-07  2:30   ` Qu Wenruo
2025-10-07 14:58     ` Darrick J. Wong
2025-10-07 21:28       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).