Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO

Linux XFS filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Christoph Hellwig <hch@infradead.org>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	martin.petersen@oracle.com, jack@suse.com
Subject: Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
Date: Mon, 20 Oct 2025 20:54:49 +1030	[thread overview]
Message-ID: <acbb5680-ef7d-4908-94f4-b4edb8b3c48e@gmx.com> (raw)
In-Reply-To: <aPYIS5rDfXhNNDHP@infradead.org>



在 2025/10/20 20:30, Christoph Hellwig 写道:
> On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
>> There is a bug report about that direct IO (and even concurrent buffered
>> IO) can lead to different contents of md-raid.
> 
> What concurrent buffered I/O?

filemap_get_folio(), for address spaces with STABEL_WRITES, there will 
be a folio_wait_stable() call to wait for writeback.

But since almost no device (except md-raid56) set that flag, if a folio 
is still under writeback, XFS/EXT4 can still modify that folio (since 
it's not locked, just under writeback) for new incoming buffered writes.

> 
>> It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
>> ("btrfs: always fallback to buffered write if the inode requires
>> checksum"), however we still leave a hole for nodatasum cases.
>>
>> For nodatasum cases we still reuse the bio from direct IO, making it to
>> cause the same problem for RAID1*/5/6 profiles, and results
>> unreliable data contents read from disk, depending on the load balance.
>>
>> Just do not trust any bio from direct IO, and never reuse those bios even
>> for nodatasum cases. Instead alloc our own bio with newly allocated
>> pages.
>>
>> For direct read, submit that new bio, and at end io time copy the
>> contents to the dio bio.
>> For direct write, copy the contents from the dio bio, then submit the
>> new one.
> 
> This basically reinvents IOCB_DONTCACHE I/O with duplicate code?

This reminds me the problem that btrfs can not handle DONTCACHE due to 
its async extents...

I definitely need to address it one day.

> 
>> Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
>> modifying the page cache when it's still under writeback) can lead to
>> raid mirror contents mismatch, the 23% performance drop should still be
>> acceptable, and bcachefs is already doing this bouncing behavior.
> 
> XFS (and EXT4 as well, but I've not tested it) wait for I/O to
> finish before allowing modifications when mapping_stable_writes returns
> true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
> is fine.

But md-raid1 doesn't set STABLE_WRITES, thus XFS/EXT4 won't wait for 
write to finish.

Wouldn't that cause two mirrors to differ from each other due to timing 
difference?

>  Direct I/O is broken, and at least for XFS I have patches
> to force DONTCACHE instead of DIRECT I/O by default in that case, but
> allowing for an opt-out for known applications (e.g. file or storage
> servers).
> 
> I'll need to rebase them, but I plan to send them out soon together
> with other T10 PI enabling patches.  Sorry, juggling a few too many
> things at the moment.
> 
>> But still, such performance drop can be very obvious, and performance
>> oriented users (who are very happy running various benchmark tools) are
>> going to notice or even complain.
> 
> I've unfortunately seen much bigger performance drops with direct I/O and
> PI on fast SSDs, but we still should be safe by default.
> 
>> Another question is, should we push this behavior to iomap layer so that other
>> fses can also benefit from it?
> 
> The right place is above iomap to pick the buffered I/O path instead.

But falling back to buffered IO performance is so miserable that wiped 
out almost one or more decades of storage performance improvement.

Thanks,
Qu

> 
> The real question is if we can finally get a version of pin_user_pages
> that prevents user modifications entirely.

next prev parent reply	other threads:[~2025-10-20 10:25 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-20  9:19 [PATCH] btrfs: never trust the bio from direct IO Qu Wenruo
2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was " Christoph Hellwig
2025-10-20 10:24   ` Qu Wenruo [this message]
2025-10-20 11:45     ` Christoph Hellwig
2025-10-20 11:16   ` Jan Kara
2025-10-20 11:44     ` Christoph Hellwig
2025-10-20 13:59       ` Jan Kara
2025-10-20 14:59         ` Matthew Wilcox
2025-10-20 15:58           ` Jan Kara
2025-10-20 17:55             ` John Hubbard
2025-10-21  8:27               ` Jan Kara
2025-10-21 16:56                 ` John Hubbard
2025-10-20 19:00             ` David Hildenbrand
2025-10-21  7:49               ` Christoph Hellwig
2025-10-21  7:57                 ` David Hildenbrand
2025-10-21  9:33                   ` Jan Kara
2025-10-21  9:43                     ` David Hildenbrand
2025-10-21  9:22                 ` Jan Kara
2025-10-21  9:37                   ` David Hildenbrand
2025-10-21  9:52                     ` Jan Kara
2025-10-21  3:17   ` Qu Wenruo
2025-10-21  7:48     ` Christoph Hellwig
2025-10-21  8:15       ` Qu Wenruo
2025-10-21 11:30         ` Johannes Thumshirn
2025-10-22  2:27           ` Qu Wenruo
2025-10-22  5:04             ` hch
2025-10-22  6:17               ` Qu Wenruo
2025-10-22  6:24                 ` hch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acbb5680-ef7d-4908-94f4-b4edb8b3c48e@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox