O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
       [not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
@ 2025-10-20 10:00 ` Christoph Hellwig
  2025-10-20 10:24   ` Qu Wenruo
                     ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-20 10:00 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack

On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
> There is a bug report about that direct IO (and even concurrent buffered
> IO) can lead to different contents of md-raid.

What concurrent buffered I/O?

> It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
> ("btrfs: always fallback to buffered write if the inode requires
> checksum"), however we still leave a hole for nodatasum cases.
> 
> For nodatasum cases we still reuse the bio from direct IO, making it to
> cause the same problem for RAID1*/5/6 profiles, and results
> unreliable data contents read from disk, depending on the load balance.
> 
> Just do not trust any bio from direct IO, and never reuse those bios even
> for nodatasum cases. Instead alloc our own bio with newly allocated
> pages.
> 
> For direct read, submit that new bio, and at end io time copy the
> contents to the dio bio.
> For direct write, copy the contents from the dio bio, then submit the
> new one.

This basically reinvents IOCB_DONTCACHE I/O with duplicate code?

> Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
> modifying the page cache when it's still under writeback) can lead to
> raid mirror contents mismatch, the 23% performance drop should still be
> acceptable, and bcachefs is already doing this bouncing behavior.

XFS (and EXT4 as well, but I've not tested it) wait for I/O to
finish before allowing modifications when mapping_stable_writes returns
true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
is fine.  Direct I/O is broken, and at least for XFS I have patches
to force DONTCACHE instead of DIRECT I/O by default in that case, but
allowing for an opt-out for known applications (e.g. file or storage
servers).

I'll need to rebase them, but I plan to send them out soon together
with other T10 PI enabling patches.  Sorry, juggling a few too many
things at the moment.

> But still, such performance drop can be very obvious, and performance
> oriented users (who are very happy running various benchmark tools) are
> going to notice or even complain.

I've unfortunately seen much bigger performance drops with direct I/O and
PI on fast SSDs, but we still should be safe by default.

> Another question is, should we push this behavior to iomap layer so that other
> fses can also benefit from it?

The right place is above iomap to pick the buffered I/O path instead.

The real question is if we can finally get a version of pin_user_pages
that prevents user modifications entirely.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig
@ 2025-10-20 10:24   ` Qu Wenruo
  2025-10-20 11:45     ` Christoph Hellwig
  2025-10-20 11:16   ` Jan Kara
  2025-10-21  3:17   ` Qu Wenruo
  2 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-10-20 10:24 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack



在 2025/10/20 20:30, Christoph Hellwig 写道:
> On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
>> There is a bug report about that direct IO (and even concurrent buffered
>> IO) can lead to different contents of md-raid.
> 
> What concurrent buffered I/O?

filemap_get_folio(), for address spaces with STABEL_WRITES, there will 
be a folio_wait_stable() call to wait for writeback.

But since almost no device (except md-raid56) set that flag, if a folio 
is still under writeback, XFS/EXT4 can still modify that folio (since 
it's not locked, just under writeback) for new incoming buffered writes.

> 
>> It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
>> ("btrfs: always fallback to buffered write if the inode requires
>> checksum"), however we still leave a hole for nodatasum cases.
>>
>> For nodatasum cases we still reuse the bio from direct IO, making it to
>> cause the same problem for RAID1*/5/6 profiles, and results
>> unreliable data contents read from disk, depending on the load balance.
>>
>> Just do not trust any bio from direct IO, and never reuse those bios even
>> for nodatasum cases. Instead alloc our own bio with newly allocated
>> pages.
>>
>> For direct read, submit that new bio, and at end io time copy the
>> contents to the dio bio.
>> For direct write, copy the contents from the dio bio, then submit the
>> new one.
> 
> This basically reinvents IOCB_DONTCACHE I/O with duplicate code?

This reminds me the problem that btrfs can not handle DONTCACHE due to 
its async extents...

I definitely need to address it one day.

> 
>> Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
>> modifying the page cache when it's still under writeback) can lead to
>> raid mirror contents mismatch, the 23% performance drop should still be
>> acceptable, and bcachefs is already doing this bouncing behavior.
> 
> XFS (and EXT4 as well, but I've not tested it) wait for I/O to
> finish before allowing modifications when mapping_stable_writes returns
> true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
> is fine.

But md-raid1 doesn't set STABLE_WRITES, thus XFS/EXT4 won't wait for 
write to finish.

Wouldn't that cause two mirrors to differ from each other due to timing 
difference?

>  Direct I/O is broken, and at least for XFS I have patches
> to force DONTCACHE instead of DIRECT I/O by default in that case, but
> allowing for an opt-out for known applications (e.g. file or storage
> servers).
> 
> I'll need to rebase them, but I plan to send them out soon together
> with other T10 PI enabling patches.  Sorry, juggling a few too many
> things at the moment.
> 
>> But still, such performance drop can be very obvious, and performance
>> oriented users (who are very happy running various benchmark tools) are
>> going to notice or even complain.
> 
> I've unfortunately seen much bigger performance drops with direct I/O and
> PI on fast SSDs, but we still should be safe by default.
> 
>> Another question is, should we push this behavior to iomap layer so that other
>> fses can also benefit from it?
> 
> The right place is above iomap to pick the buffered I/O path instead.

But falling back to buffered IO performance is so miserable that wiped 
out almost one or more decades of storage performance improvement.

Thanks,
Qu

> 
> The real question is if we can finally get a version of pin_user_pages
> that prevents user modifications entirely.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 10:24   ` Qu Wenruo
@ 2025-10-20 11:45     ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-20 11:45 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On Mon, Oct 20, 2025 at 08:54:49PM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/10/20 20:30, Christoph Hellwig 写道:
> > On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
> > > There is a bug report about that direct IO (and even concurrent buffered
> > > IO) can lead to different contents of md-raid.
> > 
> > What concurrent buffered I/O?
> 
> filemap_get_folio(), for address spaces with STABEL_WRITES, there will be a
> folio_wait_stable() call to wait for writeback.
> 
> But since almost no device (except md-raid56) set that flag, if a folio is
> still under writeback, XFS/EXT4 can still modify that folio (since it's not
> locked, just under writeback) for new incoming buffered writes.

All devices with protection information set it.  Now for raid1 without
the bitmap there is no expectation that anything is in sync.  Raid 1
with a bitmap should probably set the flag as well.

> 
> > 
> > > It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
> > > ("btrfs: always fallback to buffered write if the inode requires
> > > checksum"), however we still leave a hole for nodatasum cases.
> > > 
> > > For nodatasum cases we still reuse the bio from direct IO, making it to
> > > cause the same problem for RAID1*/5/6 profiles, and results
> > > unreliable data contents read from disk, depending on the load balance.
> > > 
> > > Just do not trust any bio from direct IO, and never reuse those bios even
> > > for nodatasum cases. Instead alloc our own bio with newly allocated
> > > pages.
> > > 
> > > For direct read, submit that new bio, and at end io time copy the
> > > contents to the dio bio.
> > > For direct write, copy the contents from the dio bio, then submit the
> > > new one.
> > 
> > This basically reinvents IOCB_DONTCACHE I/O with duplicate code?
> 
> This reminds me the problem that btrfs can not handle DONTCACHE due to its
> async extents...
> 
> I definitely need to address it one day.
> 
> > 
> > > Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
> > > modifying the page cache when it's still under writeback) can lead to
> > > raid mirror contents mismatch, the 23% performance drop should still be
> > > acceptable, and bcachefs is already doing this bouncing behavior.
> > 
> > XFS (and EXT4 as well, but I've not tested it) wait for I/O to
> > finish before allowing modifications when mapping_stable_writes returns
> > true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
> > is fine.
> 
> But md-raid1 doesn't set STABLE_WRITES, thus XFS/EXT4 won't wait for write
> to finish.
> 
> Wouldn't that cause two mirrors to differ from each other due to timing
> difference?
> 
> >  Direct I/O is broken, and at least for XFS I have patches
> > to force DONTCACHE instead of DIRECT I/O by default in that case, but
> > allowing for an opt-out for known applications (e.g. file or storage
> > servers).
> > 
> > I'll need to rebase them, but I plan to send them out soon together
> > with other T10 PI enabling patches.  Sorry, juggling a few too many
> > things at the moment.
> > 
> > > But still, such performance drop can be very obvious, and performance
> > > oriented users (who are very happy running various benchmark tools) are
> > > going to notice or even complain.
> > 
> > I've unfortunately seen much bigger performance drops with direct I/O and
> > PI on fast SSDs, but we still should be safe by default.
> > 
> > > Another question is, should we push this behavior to iomap layer so that other
> > > fses can also benefit from it?
> > 
> > The right place is above iomap to pick the buffered I/O path instead.
> 
> But falling back to buffered IO performance is so miserable that wiped out
> almost one or more decades of storage performance improvement.
> 
> Thanks,
> Qu
> 
> > 
> > The real question is if we can finally get a version of pin_user_pages
> > that prevents user modifications entirely.
---end quoted text---

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig
  2025-10-20 10:24   ` Qu Wenruo
@ 2025-10-20 11:16   ` Jan Kara
  2025-10-20 11:44     ` Christoph Hellwig
  2025-10-21  3:17   ` Qu Wenruo
  2 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2025-10-20 11:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel,
	linux-block, linux-mm, martin.petersen, jack

On Mon 20-10-25 03:00:43, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
> > But still, such performance drop can be very obvious, and performance
> > oriented users (who are very happy running various benchmark tools) are
> > going to notice or even complain.
> 
> I've unfortunately seen much bigger performance drops with direct I/O and
> PI on fast SSDs, but we still should be safe by default.
> 
> > Another question is, should we push this behavior to iomap layer so that other
> > fses can also benefit from it?
> 
> The right place is above iomap to pick the buffered I/O path instead.
> 
> The real question is if we can finally get a version of pin_user_pages
> that prevents user modifications entirely.

Hmm, this is an interesting twist in the problems with pinned pages - so
far I was thinking about problems where pinned page cache page gets
modified (e.g. through DIO or RDMA) and this causes checksum failures if
it races with writeback. If I understand you right, now you are concerned
about a situation where some page is used as a buffer for direct IO write
/ RDMA and it gets modified while the DMA is running which causes checksum
mismatch? Writeprotecting the buffer before the DIO starts isn't that hard
to do (although it has a non-trivial cost) but we don't have a mechanism to
make sure the page cannot be writeably mapped while it is pinned (and
avoiding that without introducing deadlocks would be *fun*).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 11:16   ` Jan Kara
@ 2025-10-20 11:44     ` Christoph Hellwig
  2025-10-20 13:59       ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-20 11:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On Mon, Oct 20, 2025 at 01:16:39PM +0200, Jan Kara wrote:
> Hmm, this is an interesting twist in the problems with pinned pages - so
> far I was thinking about problems where pinned page cache page gets
> modified (e.g. through DIO or RDMA) and this causes checksum failures if
> it races with writeback. If I understand you right, now you are concerned
> about a situation where some page is used as a buffer for direct IO write
> / RDMA and it gets modified while the DMA is running which causes checksum
> mismatch?

Really all of the above.  Even worse this can also happen for reads,
e.g. when the parity or checksum is calculated in the user buffer.


> Writeprotecting the buffer before the DIO starts isn't that hard
> to do (although it has a non-trivial cost) but we don't have a mechanism to
> make sure the page cannot be writeably mapped while it is pinned (and
> avoiding that without introducing deadlocks would be *fun*).

Well, this goes back to the old idea of maybe bounce buffering in that
case?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 11:44     ` Christoph Hellwig
@ 2025-10-20 13:59       ` Jan Kara
  2025-10-20 14:59         ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2025-10-20 13:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On Mon 20-10-25 04:44:27, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 01:16:39PM +0200, Jan Kara wrote:
> > Hmm, this is an interesting twist in the problems with pinned pages - so
> > far I was thinking about problems where pinned page cache page gets
> > modified (e.g. through DIO or RDMA) and this causes checksum failures if
> > it races with writeback. If I understand you right, now you are concerned
> > about a situation where some page is used as a buffer for direct IO write
> > / RDMA and it gets modified while the DMA is running which causes checksum
> > mismatch?
> 
> Really all of the above.  Even worse this can also happen for reads,
> e.g. when the parity or checksum is calculated in the user buffer.

OK.

> > Writeprotecting the buffer before the DIO starts isn't that hard
> > to do (although it has a non-trivial cost) but we don't have a mechanism to
> > make sure the page cannot be writeably mapped while it is pinned (and
> > avoiding that without introducing deadlocks would be *fun*).
> 
> Well, this goes back to the old idea of maybe bounce buffering in that
> case?

The idea was to bounce buffer the page we are writing back in case we spot
a long-term pin we cannot just wait for - hence bouncing should be rare.
But in this more general setting it is challenging to not bounce buffer for
every IO (in which case you'd be basically at performance of RWF_DONTCACHE
IO or perhaps worse so why bother?). Essentially if you hand out the real
page underlying the buffer for the IO, all other attemps to do IO to that
page have to block - bouncing is no longer an option because even with
bouncing the second IO we could still corrupt data of the first IO once we
copy to the final buffer. And if we'd block waiting for the first IO to
complete, userspace could construct deadlock cycles - like racing IO to
pages A, B with IO to pages B, A. So far I'm not sure about a sane way out
of this...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 13:59       ` Jan Kara
@ 2025-10-20 14:59         ` Matthew Wilcox
  2025-10-20 15:58           ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2025-10-20 14:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote:
> The idea was to bounce buffer the page we are writing back in case we spot
> a long-term pin we cannot just wait for - hence bouncing should be rare.
> But in this more general setting it is challenging to not bounce buffer for
> every IO (in which case you'd be basically at performance of RWF_DONTCACHE
> IO or perhaps worse so why bother?). Essentially if you hand out the real
> page underlying the buffer for the IO, all other attemps to do IO to that
> page have to block - bouncing is no longer an option because even with
> bouncing the second IO we could still corrupt data of the first IO once we
> copy to the final buffer. And if we'd block waiting for the first IO to
> complete, userspace could construct deadlock cycles - like racing IO to
> pages A, B with IO to pages B, A. So far I'm not sure about a sane way out
> of this...

There isn't one.  We might have DMA-mapped this page earlier, and so a
device could write to it at any time.  Even if we remove PTE write
permissions ...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 14:59         ` Matthew Wilcox
@ 2025-10-20 15:58           ` Jan Kara
  2025-10-20 17:55             ` John Hubbard
  2025-10-20 19:00             ` David Hildenbrand
  0 siblings, 2 replies; 27+ messages in thread
From: Jan Kara @ 2025-10-20 15:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong,
	linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen,
	jack

On Mon 20-10-25 15:59:07, Matthew Wilcox wrote:
> On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote:
> > The idea was to bounce buffer the page we are writing back in case we spot
> > a long-term pin we cannot just wait for - hence bouncing should be rare.
> > But in this more general setting it is challenging to not bounce buffer for
> > every IO (in which case you'd be basically at performance of RWF_DONTCACHE
> > IO or perhaps worse so why bother?). Essentially if you hand out the real
> > page underlying the buffer for the IO, all other attemps to do IO to that
> > page have to block - bouncing is no longer an option because even with
> > bouncing the second IO we could still corrupt data of the first IO once we
> > copy to the final buffer. And if we'd block waiting for the first IO to
> > complete, userspace could construct deadlock cycles - like racing IO to
> > pages A, B with IO to pages B, A. So far I'm not sure about a sane way out
> > of this...
> 
> There isn't one.  We might have DMA-mapped this page earlier, and so a
> device could write to it at any time.  Even if we remove PTE write
> permissions ...

True but writes through DMA to the page are guarded by holding a page pin
these days so we could in theory block getting another page pin or mapping
the page writeably until the pin is released... if we can figure out a
convincing story for dealing with long-term pins from RDMA and dealing with
possible deadlocks created by this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 15:58           ` Jan Kara
@ 2025-10-20 17:55             ` John Hubbard
  2025-10-21  8:27               ` Jan Kara
  2025-10-20 19:00             ` David Hildenbrand
  1 sibling, 1 reply; 27+ messages in thread
From: John Hubbard @ 2025-10-20 17:55 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On 10/20/25 8:58 AM, Jan Kara wrote:
> On Mon 20-10-25 15:59:07, Matthew Wilcox wrote:
>> On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote:
>>> The idea was to bounce buffer the page we are writing back in case we spot
>>> a long-term pin we cannot just wait for - hence bouncing should be rare.
>>> But in this more general setting it is challenging to not bounce buffer for
>>> every IO (in which case you'd be basically at performance of RWF_DONTCACHE
>>> IO or perhaps worse so why bother?). Essentially if you hand out the real
>>> page underlying the buffer for the IO, all other attemps to do IO to that
>>> page have to block - bouncing is no longer an option because even with
>>> bouncing the second IO we could still corrupt data of the first IO once we
>>> copy to the final buffer. And if we'd block waiting for the first IO to
>>> complete, userspace could construct deadlock cycles - like racing IO to
>>> pages A, B with IO to pages B, A. So far I'm not sure about a sane way out
>>> of this...
>>
>> There isn't one.  We might have DMA-mapped this page earlier, and so a
>> device could write to it at any time.  Even if we remove PTE write
>> permissions ...
> 
> True but writes through DMA to the page are guarded by holding a page pin
> these days so we could in theory block getting another page pin or mapping

Do you mean, "setting up to do DMA is guarded by holding a FOLL_LONGTERM
page pin"? Or something else (that's new to me)?


thanks,
John Hubbard

> the page writeably until the pin is released... if we can figure out a
> convincing story for dealing with long-term pins from RDMA and dealing with
> possible deadlocks created by this.
> 
> 								Honza


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 17:55             ` John Hubbard
@ 2025-10-21  8:27               ` Jan Kara
  2025-10-21 16:56                 ` John Hubbard
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2025-10-21  8:27 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Matthew Wilcox, Christoph Hellwig, Qu Wenruo,
	linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack

On Mon 20-10-25 10:55:06, John Hubbard wrote:
> On 10/20/25 8:58 AM, Jan Kara wrote:
> > On Mon 20-10-25 15:59:07, Matthew Wilcox wrote:
> > > On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote:
> > > > The idea was to bounce buffer the page we are writing back in case we spot
> > > > a long-term pin we cannot just wait for - hence bouncing should be rare.
> > > > But in this more general setting it is challenging to not bounce buffer for
> > > > every IO (in which case you'd be basically at performance of RWF_DONTCACHE
> > > > IO or perhaps worse so why bother?). Essentially if you hand out the real
> > > > page underlying the buffer for the IO, all other attemps to do IO to that
> > > > page have to block - bouncing is no longer an option because even with
> > > > bouncing the second IO we could still corrupt data of the first IO once we
> > > > copy to the final buffer. And if we'd block waiting for the first IO to
> > > > complete, userspace could construct deadlock cycles - like racing IO to
> > > > pages A, B with IO to pages B, A. So far I'm not sure about a sane way out
> > > > of this...
> > > 
> > > There isn't one.  We might have DMA-mapped this page earlier, and so a
> > > device could write to it at any time.  Even if we remove PTE write
> > > permissions ...
> > 
> > True but writes through DMA to the page are guarded by holding a page pin
> > these days so we could in theory block getting another page pin or mapping
> 
> Do you mean, "setting up to do DMA is guarded by holding a FOLL_LONGTERM
> page pin"? Or something else (that's new to me)?

I meant to say that users that end up setting up DMA to a page also hold a
page pin (either longterm for RDMA and similar users or shortterm for
direct IO). Do you disagree?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  8:27               ` Jan Kara
@ 2025-10-21 16:56                 ` John Hubbard
  0 siblings, 0 replies; 27+ messages in thread
From: John Hubbard @ 2025-10-21 16:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong,
	linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen,
	jack

On 10/21/25 1:27 AM, Jan Kara wrote:
> On Mon 20-10-25 10:55:06, John Hubbard wrote:
>> On 10/20/25 8:58 AM, Jan Kara wrote:
>>> On Mon 20-10-25 15:59:07, Matthew Wilcox wrote:
>>>> On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote:
>>>>> The idea was to bounce buffer the page we are writing back in case we spot
>>>>> a long-term pin we cannot just wait for - hence bouncing should be rare.
>>>>> But in this more general setting it is challenging to not bounce buffer for
>>>>> every IO (in which case you'd be basically at performance of RWF_DONTCACHE
>>>>> IO or perhaps worse so why bother?). Essentially if you hand out the real
>>>>> page underlying the buffer for the IO, all other attemps to do IO to that
>>>>> page have to block - bouncing is no longer an option because even with
>>>>> bouncing the second IO we could still corrupt data of the first IO once we
>>>>> copy to the final buffer. And if we'd block waiting for the first IO to
>>>>> complete, userspace could construct deadlock cycles - like racing IO to
>>>>> pages A, B with IO to pages B, A. So far I'm not sure about a sane way out
>>>>> of this...
>>>>
>>>> There isn't one.  We might have DMA-mapped this page earlier, and so a
>>>> device could write to it at any time.  Even if we remove PTE write
>>>> permissions ...
>>>
>>> True but writes through DMA to the page are guarded by holding a page pin
>>> these days so we could in theory block getting another page pin or mapping
>>
>> Do you mean, "setting up to do DMA is guarded by holding a FOLL_LONGTERM
>> page pin"? Or something else (that's new to me)?
> 
> I meant to say that users that end up setting up DMA to a page also hold a
> page pin (either longterm for RDMA and similar users or shortterm for
> direct IO). Do you disagree?
> 
> 								Honza

Completely agree. I see what you have in mind now. I was hung up on
the "page pins won't prevent DMA from happening, once it is set up"
point, but your idea is to detect that that is already set up, and
make decisions from there...that part I get now.


thanks,
-- 
John Hubbard


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 15:58           ` Jan Kara
  2025-10-20 17:55             ` John Hubbard
@ 2025-10-20 19:00             ` David Hildenbrand
  2025-10-21  7:49               ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-10-20 19:00 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On 20.10.25 17:58, Jan Kara wrote:
> On Mon 20-10-25 15:59:07, Matthew Wilcox wrote:
>> On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote:
>>> The idea was to bounce buffer the page we are writing back in case we spot
>>> a long-term pin we cannot just wait for - hence bouncing should be rare.
>>> But in this more general setting it is challenging to not bounce buffer for
>>> every IO (in which case you'd be basically at performance of RWF_DONTCACHE
>>> IO or perhaps worse so why bother?). Essentially if you hand out the real
>>> page underlying the buffer for the IO, all other attemps to do IO to that
>>> page have to block - bouncing is no longer an option because even with
>>> bouncing the second IO we could still corrupt data of the first IO once we
>>> copy to the final buffer. And if we'd block waiting for the first IO to
>>> complete, userspace could construct deadlock cycles - like racing IO to
>>> pages A, B with IO to pages B, A. So far I'm not sure about a sane way out
>>> of this...
>>
>> There isn't one.  We might have DMA-mapped this page earlier, and so a
>> device could write to it at any time.  Even if we remove PTE write
>> permissions ...
> 
> True but writes through DMA to the page are guarded by holding a page pin
> these days so we could in theory block getting another page pin or mapping
> the page writeably until the pin is released... if we can figure out a
> convincing story for dealing with long-term pins from RDMA and dealing with
> possible deadlocks created by this.

Just FYI, because it might be interesting in this context.

For anonymous memory we have this working by only writing the folio out 
if it is completely unmapped and there are no unexpected folio 
references/pins (see pageout()), and only allowing to write to such a 
folio ("reuse") if SWP_STABLE_WRITES is not set (see do_swap_page()).

So once we start writeback the folio has no writable page table mappings 
(unmapped) and no GUP pins. Consequently, when trying to write to it we 
can just fallback to creating a page copy without causing trouble with 
GUP pins.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 19:00             ` David Hildenbrand
@ 2025-10-21  7:49               ` Christoph Hellwig
  2025-10-21  7:57                 ` David Hildenbrand
  2025-10-21  9:22                 ` Jan Kara
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-21  7:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jan Kara, Matthew Wilcox, Christoph Hellwig, Qu Wenruo,
	linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack

On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
> Just FYI, because it might be interesting in this context.
> 
> For anonymous memory we have this working by only writing the folio out if
> it is completely unmapped and there are no unexpected folio references/pins
> (see pageout()), and only allowing to write to such a folio ("reuse") if
> SWP_STABLE_WRITES is not set (see do_swap_page()).
> 
> So once we start writeback the folio has no writable page table mappings
> (unmapped) and no GUP pins. Consequently, when trying to write to it we can
> just fallback to creating a page copy without causing trouble with GUP pins.

Yeah.  But anonymous is the easy case, the pain is direct I/O to file
mappings.  Mapping the right answer is to just fail pinning them and fall 
back to (dontcache) buffered I/O.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  7:49               ` Christoph Hellwig
@ 2025-10-21  7:57                 ` David Hildenbrand
  2025-10-21  9:33                   ` Jan Kara
  2025-10-21  9:22                 ` Jan Kara
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-10-21  7:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong,
	linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen,
	jack

On 21.10.25 09:49, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
>> Just FYI, because it might be interesting in this context.
>>
>> For anonymous memory we have this working by only writing the folio out if
>> it is completely unmapped and there are no unexpected folio references/pins
>> (see pageout()), and only allowing to write to such a folio ("reuse") if
>> SWP_STABLE_WRITES is not set (see do_swap_page()).
>>
>> So once we start writeback the folio has no writable page table mappings
>> (unmapped) and no GUP pins. Consequently, when trying to write to it we can
>> just fallback to creating a page copy without causing trouble with GUP pins.
> 
> Yeah.  But anonymous is the easy case, the pain is direct I/O to file
> mappings.  Mapping the right answer is to just fail pinning them and fall
> back to (dontcache) buffered I/O.

Right, I think the rules could likely be

a) Don't start writeback to such devices if there may be GUP pins (o 
writeble PTEs)

b) Don't allow FOLL_WRITE GUP pins if there is writeback to such a device

Regarding b), I would have thought that GUP would find the PTE to not be 
writable and consequently trigger a page fault first to make it 
writable? And I'd have thought that we cannot make such a PTE writable 
while there is writeback to such a device going on (otherwise the CPU 
could just cause trouble).

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  7:57                 ` David Hildenbrand
@ 2025-10-21  9:33                   ` Jan Kara
  2025-10-21  9:43                     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2025-10-21  9:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, Jan Kara, Matthew Wilcox, Qu Wenruo,
	linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack

On Tue 21-10-25 09:57:08, David Hildenbrand wrote:
> On 21.10.25 09:49, Christoph Hellwig wrote:
> > On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
> > > Just FYI, because it might be interesting in this context.
> > > 
> > > For anonymous memory we have this working by only writing the folio out if
> > > it is completely unmapped and there are no unexpected folio references/pins
> > > (see pageout()), and only allowing to write to such a folio ("reuse") if
> > > SWP_STABLE_WRITES is not set (see do_swap_page()).
> > > 
> > > So once we start writeback the folio has no writable page table mappings
> > > (unmapped) and no GUP pins. Consequently, when trying to write to it we can
> > > just fallback to creating a page copy without causing trouble with GUP pins.
> > 
> > Yeah.  But anonymous is the easy case, the pain is direct I/O to file
> > mappings.  Mapping the right answer is to just fail pinning them and fall
> > back to (dontcache) buffered I/O.
> 
> Right, I think the rules could likely be
> 
> a) Don't start writeback to such devices if there may be GUP pins (o
> writeble PTEs)
> 
> b) Don't allow FOLL_WRITE GUP pins if there is writeback to such a device
> 
> Regarding b), I would have thought that GUP would find the PTE to not be
> writable and consequently trigger a page fault first to make it writable?
> And I'd have thought that we cannot make such a PTE writable while there is
> writeback to such a device going on (otherwise the CPU could just cause
> trouble).

See some of the cases in my reply to Christoph. It is also stuff like:

c) Don't allow FOLL_WRITE GUP pins or writeable mapping if there are *any*
pins to the page.

And we'd have to write-protect the page in the page tables at the moment we
obtain the FOLL_WRITE GUP pin to make sure the pin owner is the only thread
able to modify that page contents while the DIO is running.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  9:33                   ` Jan Kara
@ 2025-10-21  9:43                     ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-10-21  9:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong,
	linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen,
	jack

On 21.10.25 11:33, Jan Kara wrote:
> On Tue 21-10-25 09:57:08, David Hildenbrand wrote:
>> On 21.10.25 09:49, Christoph Hellwig wrote:
>>> On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
>>>> Just FYI, because it might be interesting in this context.
>>>>
>>>> For anonymous memory we have this working by only writing the folio out if
>>>> it is completely unmapped and there are no unexpected folio references/pins
>>>> (see pageout()), and only allowing to write to such a folio ("reuse") if
>>>> SWP_STABLE_WRITES is not set (see do_swap_page()).
>>>>
>>>> So once we start writeback the folio has no writable page table mappings
>>>> (unmapped) and no GUP pins. Consequently, when trying to write to it we can
>>>> just fallback to creating a page copy without causing trouble with GUP pins.
>>>
>>> Yeah.  But anonymous is the easy case, the pain is direct I/O to file
>>> mappings.  Mapping the right answer is to just fail pinning them and fall
>>> back to (dontcache) buffered I/O.
>>
>> Right, I think the rules could likely be
>>
>> a) Don't start writeback to such devices if there may be GUP pins (o
>> writeble PTEs)
>>
>> b) Don't allow FOLL_WRITE GUP pins if there is writeback to such a device
>>
>> Regarding b), I would have thought that GUP would find the PTE to not be
>> writable and consequently trigger a page fault first to make it writable?
>> And I'd have thought that we cannot make such a PTE writable while there is
>> writeback to such a device going on (otherwise the CPU could just cause
>> trouble).
> 
> See some of the cases in my reply to Christoph. It is also stuff like:
> 
> c) Don't allow FOLL_WRITE GUP pins or writeable mapping if there are *any*
> pins to the page.
> 
> And we'd have to write-protect the page in the page tables at the moment we
> obtain the FOLL_WRITE GUP pin to make sure the pin owner is the only thread
> able to modify that page contents while the DIO is running.

Oh that's nasty, but yeah I understood the problem now, thanks.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  7:49               ` Christoph Hellwig
  2025-10-21  7:57                 ` David Hildenbrand
@ 2025-10-21  9:22                 ` Jan Kara
  2025-10-21  9:37                   ` David Hildenbrand
  1 sibling, 1 reply; 27+ messages in thread
From: Jan Kara @ 2025-10-21  9:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Hildenbrand, Jan Kara, Matthew Wilcox, Qu Wenruo,
	linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack

On Tue 21-10-25 00:49:49, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
> > Just FYI, because it might be interesting in this context.
> > 
> > For anonymous memory we have this working by only writing the folio out if
> > it is completely unmapped and there are no unexpected folio references/pins
> > (see pageout()), and only allowing to write to such a folio ("reuse") if
> > SWP_STABLE_WRITES is not set (see do_swap_page()).
> > 
> > So once we start writeback the folio has no writable page table mappings
> > (unmapped) and no GUP pins. Consequently, when trying to write to it we can
> > just fallback to creating a page copy without causing trouble with GUP pins.
> 
> Yeah.  But anonymous is the easy case, the pain is direct I/O to file
> mappings.  Mapping the right answer is to just fail pinning them and fall 
> back to (dontcache) buffered I/O.

I agree file mappings are more painful but we can also have interesting
cases with anon pages:

P - anon page

Thread 1				Thread 2
setup DIO read to P			setup DIO write from P

And now you can get checksum failures for the write unless the write is
bounced (falling back to dontcache). Similarly with reads:

Thread 1				Thread 2
setup DIO read to P			setup DIO read to P

you can get read checksum mismatch unless both reads are bounced (bouncing
one of the reads is not enough because the memcpy from the bounce page to
the final buffer may break checksum computation of the IO going directly).

So to avoid checksum failures even if user screws up and buffers overlap we
need to bounce every IO even to/from anon memory. Or we need to block one
of the IOs until the other one completes - a scheme that could work is we'd
try to acquire kind of exclusive pin to all the pages (page lock?). If we
succeed, we run the IO directly. If we don't succeed, we wait for the
exclusive pins to be released, acquire standard pin (to block exclusive
pinning) and *then* submit uncached IO. But it is all rather complex and
I'm not sure it's worth it...

For file mappings things get even more complex because you can do:

P - file mapping page

Thread 1				Thread 2
setup DIO write from P			setup buffered write from Q to P

and you get checksum failures for the DIO write. So if we don't bounce the
DIO, we'd also have to teach buffered IO to avoid corrupting buffers of DIO
in flight.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  9:22                 ` Jan Kara
@ 2025-10-21  9:37                   ` David Hildenbrand
  2025-10-21  9:52                     ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-10-21  9:37 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig
  Cc: Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On 21.10.25 11:22, Jan Kara wrote:
> On Tue 21-10-25 00:49:49, Christoph Hellwig wrote:
>> On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
>>> Just FYI, because it might be interesting in this context.
>>>
>>> For anonymous memory we have this working by only writing the folio out if
>>> it is completely unmapped and there are no unexpected folio references/pins
>>> (see pageout()), and only allowing to write to such a folio ("reuse") if
>>> SWP_STABLE_WRITES is not set (see do_swap_page()).
>>>
>>> So once we start writeback the folio has no writable page table mappings
>>> (unmapped) and no GUP pins. Consequently, when trying to write to it we can
>>> just fallback to creating a page copy without causing trouble with GUP pins.
>>
>> Yeah.  But anonymous is the easy case, the pain is direct I/O to file
>> mappings.  Mapping the right answer is to just fail pinning them and fall
>> back to (dontcache) buffered I/O.
> 
> I agree file mappings are more painful but we can also have interesting
> cases with anon pages:
> 
> P - anon page
> 
> Thread 1				Thread 2
> setup DIO read to P			setup DIO write from P

Ah, I was talking about the interaction between GUP and having 
BLK_FEAT_STABLE_WRITES set on the swap backend.

I guess what you mean here is: GUP from/to anon pages to/from a device 
that has BLK_FEAT_STABLE_WRITES?

So while we are writing to the device using the anon page as a source, 
the anon page will get modified.

I did not expect that to trigger checksum failures, but I can see the 
problem now.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  9:37                   ` David Hildenbrand
@ 2025-10-21  9:52                     ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2025-10-21  9:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, Qu Wenruo,
	linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack

On Tue 21-10-25 11:37:57, David Hildenbrand wrote:
> On 21.10.25 11:22, Jan Kara wrote:
> > On Tue 21-10-25 00:49:49, Christoph Hellwig wrote:
> > > On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
> > > > Just FYI, because it might be interesting in this context.
> > > > 
> > > > For anonymous memory we have this working by only writing the folio out if
> > > > it is completely unmapped and there are no unexpected folio references/pins
> > > > (see pageout()), and only allowing to write to such a folio ("reuse") if
> > > > SWP_STABLE_WRITES is not set (see do_swap_page()).
> > > > 
> > > > So once we start writeback the folio has no writable page table mappings
> > > > (unmapped) and no GUP pins. Consequently, when trying to write to it we can
> > > > just fallback to creating a page copy without causing trouble with GUP pins.
> > > 
> > > Yeah.  But anonymous is the easy case, the pain is direct I/O to file
> > > mappings.  Mapping the right answer is to just fail pinning them and fall
> > > back to (dontcache) buffered I/O.
> > 
> > I agree file mappings are more painful but we can also have interesting
> > cases with anon pages:
> > 
> > P - anon page
> > 
> > Thread 1				Thread 2
> > setup DIO read to P			setup DIO write from P
> 
> Ah, I was talking about the interaction between GUP and having
> BLK_FEAT_STABLE_WRITES set on the swap backend.
> 
> I guess what you mean here is: GUP from/to anon pages to/from a device that
> has BLK_FEAT_STABLE_WRITES?

Correct.

> So while we are writing to the device using the anon page as a source, the
> anon page will get modified.
> 
> I did not expect that to trigger checksum failures, but I can see the
> problem now.

Sadly it can because the checksum computation may end up using different
data than the DMA will send to the device a bit later. After all this is
why BLK_FEAT_STABLE_WRITES was invented...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig
  2025-10-20 10:24   ` Qu Wenruo
  2025-10-20 11:16   ` Jan Kara
@ 2025-10-21  3:17   ` Qu Wenruo
  2025-10-21  7:48     ` Christoph Hellwig
  2 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-10-21  3:17 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack

在 2025/10/20 20:30, Christoph Hellwig 写道:>
>> But still, such performance drop can be very obvious, and performance
>> oriented users (who are very happy running various benchmark tools) are
>> going to notice or even complain.
> 
> I've unfortunately seen much bigger performance drops with direct I/O and
> PI on fast SSDs, but we still should be safe by default.

Off-topic a little, mind to share the performance drop with PI enabled 
on XFS?

With this patch I'm able to enable direct IO for inodes with checksums.
I thought it would easily improve the performance, but the truth is, 
it's not that different from buffered IO fall back.

It's still the old 200MiB/s (vs ~2GiB/s nodatasum), no matter falling 
back to buffered IO or not (and extra traces indeed shows it's really 
going direct IO path).

So I start wondering if it's the checksum itself causing the miserable 
performance numbers.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  3:17   ` Qu Wenruo
@ 2025-10-21  7:48     ` Christoph Hellwig
  2025-10-21  8:15       ` Qu Wenruo
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2025-10-21  7:48 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs,
	linux-fsdevel, linux-block, linux-mm, martin.petersen, jack

On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote:
> Off-topic a little, mind to share the performance drop with PI enabled on
> XFS?

If the bandwith of the SSDs get close or exceeds the DRAM bandwith
buffered I/O can be 50% or less of the direct I/O performance.

> With this patch I'm able to enable direct IO for inodes with checksums.
> I thought it would easily improve the performance, but the truth is, it's
> not that different from buffered IO fall back.

That's because you still copy data.

> So I start wondering if it's the checksum itself causing the miserable
> performance numbers.

Only indirectly by touching all the cachelines.  But once you copy you
touch them again.  Especially if not done in small chunks.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  7:48     ` Christoph Hellwig
@ 2025-10-21  8:15       ` Qu Wenruo
  2025-10-21 11:30         ` Johannes Thumshirn
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-10-21  8:15 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block,
	linux-mm, martin.petersen, jack



在 2025/10/21 18:18, Christoph Hellwig 写道:
> On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote:
>> Off-topic a little, mind to share the performance drop with PI enabled on
>> XFS?
> 
> If the bandwith of the SSDs get close or exceeds the DRAM bandwith
> buffered I/O can be 50% or less of the direct I/O performance.

In my case, the DRAM is way faster than the SSD (tens of GiB/s vs less 
than 5GiB/s).

> 
>> With this patch I'm able to enable direct IO for inodes with checksums.
>> I thought it would easily improve the performance, but the truth is, it's
>> not that different from buffered IO fall back.
> 
> That's because you still copy data.

Enabling the extra copy for direct IO only drops around 15~20% 
performance, but that's on no csum case.

So far the calculation matches your estimation, but...

> 
>> So I start wondering if it's the checksum itself causing the miserable
>> performance numbers.
> 
> Only indirectly by touching all the cachelines.  But once you copy you
> touch them again.  Especially if not done in small chunks.

As long as I enable checksum verification, even with the bouncing page 
direct IO, the result is not any better than buffered IO fallback, all 
around 10% (not by 10%, at 10%) of the direct IO speed (no matter 
bouncing or not).

Maybe I need to check if the proper hardware accelerated CRC32 is 
utilized...

Thanks,
Qu


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21  8:15       ` Qu Wenruo
@ 2025-10-21 11:30         ` Johannes Thumshirn
  2025-10-22  2:27           ` Qu Wenruo
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Thumshirn @ 2025-10-21 11:30 UTC (permalink / raw)
  To: WenRuo Qu, hch@infradead.org, Qu Wenruo
  Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	martin.petersen@oracle.com, jack@suse.com

On 10/21/25 10:15 AM, Qu Wenruo wrote:
>
> 在 2025/10/21 18:18, Christoph Hellwig 写道:
>> On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote:
>>> Off-topic a little, mind to share the performance drop with PI enabled on
>>> XFS?
>> If the bandwith of the SSDs get close or exceeds the DRAM bandwith
>> buffered I/O can be 50% or less of the direct I/O performance.
> In my case, the DRAM is way faster than the SSD (tens of GiB/s vs less
> than 5GiB/s).
>
>>> With this patch I'm able to enable direct IO for inodes with checksums.
>>> I thought it would easily improve the performance, but the truth is, it's
>>> not that different from buffered IO fall back.
>> That's because you still copy data.
> Enabling the extra copy for direct IO only drops around 15~20%
> performance, but that's on no csum case.
>
> So far the calculation matches your estimation, but...
>
>>> So I start wondering if it's the checksum itself causing the miserable
>>> performance numbers.
>> Only indirectly by touching all the cachelines.  But once you copy you
>> touch them again.  Especially if not done in small chunks.
> As long as I enable checksum verification, even with the bouncing page
> direct IO, the result is not any better than buffered IO fallback, all
> around 10% (not by 10%, at 10%) of the direct IO speed (no matter
> bouncing or not).
>
> Maybe I need to check if the proper hardware accelerated CRC32 is
> utilized...


You could also hack in a NULL-csum for testing. Something that writes a 
fixed value every time. This would then rule out all the cost of the 
csum generation and only test the affected IO paths.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-21 11:30         ` Johannes Thumshirn
@ 2025-10-22  2:27           ` Qu Wenruo
  2025-10-22  5:04             ` hch
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-10-22  2:27 UTC (permalink / raw)
  To: Johannes Thumshirn, WenRuo Qu, hch@infradead.org
  Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	martin.petersen@oracle.com, jack@suse.com



在 2025/10/21 22:00, Johannes Thumshirn 写道:
> On 10/21/25 10:15 AM, Qu Wenruo wrote:
>>
>> 在 2025/10/21 18:18, Christoph Hellwig 写道:
>>> On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote:
>>>> Off-topic a little, mind to share the performance drop with PI enabled on
>>>> XFS?
>>> If the bandwith of the SSDs get close or exceeds the DRAM bandwith
>>> buffered I/O can be 50% or less of the direct I/O performance.
>> In my case, the DRAM is way faster than the SSD (tens of GiB/s vs less
>> than 5GiB/s).
>>
>>>> With this patch I'm able to enable direct IO for inodes with checksums.
>>>> I thought it would easily improve the performance, but the truth is, it's
>>>> not that different from buffered IO fall back.
>>> That's because you still copy data.
>> Enabling the extra copy for direct IO only drops around 15~20%
>> performance, but that's on no csum case.
>>
>> So far the calculation matches your estimation, but...
>>
>>>> So I start wondering if it's the checksum itself causing the miserable
>>>> performance numbers.
>>> Only indirectly by touching all the cachelines.  But once you copy you
>>> touch them again.  Especially if not done in small chunks.
>> As long as I enable checksum verification, even with the bouncing page
>> direct IO, the result is not any better than buffered IO fallback, all
>> around 10% (not by 10%, at 10%) of the direct IO speed (no matter
>> bouncing or not).
>>
>> Maybe I need to check if the proper hardware accelerated CRC32 is
>> utilized...
> 
> 
> You could also hack in a NULL-csum for testing. Something that writes a
> fixed value every time. This would then rule out all the cost of the
> csum generation and only test the affected IO paths.
> 

It turns out to be checksum, and particularly my VM setup.

My VM is using kvm64 CPU type, which blocks quite a lot of CPU features, 
thus the CRC32 performance is pretty poor.

I just tried a short hack to always make direct IO to fallback to 
buffered IO, the nodatasum performance is the same as the bouncing page 
solution, so the slow down is not page cache itself but really the checksum.

With CPU features all passed to the VM, the falling-back-to-buffered 
direct IO performance is only slightly worse (10~20%) than nodatasum cases.

Really sorry for the noise.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-22  2:27           ` Qu Wenruo
@ 2025-10-22  5:04             ` hch
  2025-10-22  6:17               ` Qu Wenruo
  0 siblings, 1 reply; 27+ messages in thread
From: hch @ 2025-10-22  5:04 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Johannes Thumshirn, WenRuo Qu, hch@infradead.org,
	linux-btrfs@vger.kernel.org, djwong@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	martin.petersen@oracle.com, jack@suse.com

On Wed, Oct 22, 2025 at 12:57:51PM +1030, Qu Wenruo wrote:
> My VM is using kvm64 CPU type, which blocks quite a lot of CPU features,
> thus the CRC32 performance is pretty poor.

Yes, unaccelerated CRC32 is a bad idea.

> 
> I just tried a short hack to always make direct IO to fallback to buffered
> IO, the nodatasum performance is the same as the bouncing page solution, so
> the slow down is not page cache itself but really the checksum.
> 
> With CPU features all passed to the VM, the falling-back-to-buffered direct
> IO performance is only slightly worse (10~20%) than nodatasum cases.

I'm a bit lost, what are the exact cases you are comparing here?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-22  5:04             ` hch
@ 2025-10-22  6:17               ` Qu Wenruo
  2025-10-22  6:24                 ` hch
  0 siblings, 1 reply; 27+ messages in thread
From: Qu Wenruo @ 2025-10-22  6:17 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Johannes Thumshirn, WenRuo Qu, linux-btrfs@vger.kernel.org,
	djwong@kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
	linux-mm@kvack.org, martin.petersen@oracle.com, jack@suse.com



在 2025/10/22 15:34, hch@infradead.org 写道:
> On Wed, Oct 22, 2025 at 12:57:51PM +1030, Qu Wenruo wrote:
>> My VM is using kvm64 CPU type, which blocks quite a lot of CPU features,
>> thus the CRC32 performance is pretty poor.
> 
> Yes, unaccelerated CRC32 is a bad idea.
> 
>>
>> I just tried a short hack to always make direct IO to fallback to buffered
>> IO, the nodatasum performance is the same as the bouncing page solution, so
>> the slow down is not page cache itself but really the checksum.
>>
>> With CPU features all passed to the VM, the falling-back-to-buffered direct
>> IO performance is only slightly worse (10~20%) than nodatasum cases.
> 
> I'm a bit lost, what are the exact cases you are comparing here?

I'm just checking the impact of btrfs checksum for data writes.

I may skipped some points, but overall the idea is:

- direct writes, no data csum
   True zero-copy.

- direct writes, no data csum but with bounce pages
   Affected by memcpy()

- direct writes, no data csum but force fallback to buffered
   Affected by memcpy()

- direct writes, data sum
   Fallback to buffered IO, affected by both memcpy() and csum.

So far it looks like the bounce pages solution is wasting a lot of code 
for almost nothing, it's not any better than falling back.


And the checksum part looks can be improved for writes.

Since we always fallback to buffered IO for checksums, the content 
should not change and we can do the submission and checksum calculation 
in parallel.

Already got a prototype, results around 10% improvement inside my VM.
Will fix the bugs related to compression and send an RFC for it.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
  2025-10-22  6:17               ` Qu Wenruo
@ 2025-10-22  6:24                 ` hch
  0 siblings, 0 replies; 27+ messages in thread
From: hch @ 2025-10-22  6:24 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: hch@infradead.org, Johannes Thumshirn, WenRuo Qu,
	linux-btrfs@vger.kernel.org, djwong@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	martin.petersen@oracle.com, jack@suse.com

On Wed, Oct 22, 2025 at 04:47:56PM +1030, Qu Wenruo wrote:
> So far it looks like the bounce pages solution is wasting a lot of code for
> almost nothing, it's not any better than falling back.

That was my expectation, but I wasn't entirely sure from your wording
if that's what you measured.

> Since we always fallback to buffered IO for checksums, the content should
> not change and we can do the submission and checksum calculation in
> parallel.

Yes.

> Already got a prototype, results around 10% improvement inside my VM.
> Will fix the bugs related to compression and send an RFC for it.

Nice.  I've also started looking at my PI backlog for the XFS version.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-10-22  6:24 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig
2025-10-20 10:24   ` Qu Wenruo
2025-10-20 11:45     ` Christoph Hellwig
2025-10-20 11:16   ` Jan Kara
2025-10-20 11:44     ` Christoph Hellwig
2025-10-20 13:59       ` Jan Kara
2025-10-20 14:59         ` Matthew Wilcox
2025-10-20 15:58           ` Jan Kara
2025-10-20 17:55             ` John Hubbard
2025-10-21  8:27               ` Jan Kara
2025-10-21 16:56                 ` John Hubbard
2025-10-20 19:00             ` David Hildenbrand
2025-10-21  7:49               ` Christoph Hellwig
2025-10-21  7:57                 ` David Hildenbrand
2025-10-21  9:33                   ` Jan Kara
2025-10-21  9:43                     ` David Hildenbrand
2025-10-21  9:22                 ` Jan Kara
2025-10-21  9:37                   ` David Hildenbrand
2025-10-21  9:52                     ` Jan Kara
2025-10-21  3:17   ` Qu Wenruo
2025-10-21  7:48     ` Christoph Hellwig
2025-10-21  8:15       ` Qu Wenruo
2025-10-21 11:30         ` Johannes Thumshirn
2025-10-22  2:27           ` Qu Wenruo
2025-10-22  5:04             ` hch
2025-10-22  6:17               ` Qu Wenruo
2025-10-22  6:24                 ` hch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).