From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Christoph Hellwig <hch@infradead.org>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-block@vger.kernel.org, linux-mm@kvack.org,
martin.petersen@oracle.com, jack@suse.com
Subject: Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
Date: Mon, 20 Oct 2025 20:54:49 +1030 [thread overview]
Message-ID: <acbb5680-ef7d-4908-94f4-b4edb8b3c48e@gmx.com> (raw)
In-Reply-To: <aPYIS5rDfXhNNDHP@infradead.org>
在 2025/10/20 20:30, Christoph Hellwig 写道:
> On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
>> There is a bug report about that direct IO (and even concurrent buffered
>> IO) can lead to different contents of md-raid.
>
> What concurrent buffered I/O?
filemap_get_folio(), for address spaces with STABEL_WRITES, there will
be a folio_wait_stable() call to wait for writeback.
But since almost no device (except md-raid56) set that flag, if a folio
is still under writeback, XFS/EXT4 can still modify that folio (since
it's not locked, just under writeback) for new incoming buffered writes.
>
>> It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
>> ("btrfs: always fallback to buffered write if the inode requires
>> checksum"), however we still leave a hole for nodatasum cases.
>>
>> For nodatasum cases we still reuse the bio from direct IO, making it to
>> cause the same problem for RAID1*/5/6 profiles, and results
>> unreliable data contents read from disk, depending on the load balance.
>>
>> Just do not trust any bio from direct IO, and never reuse those bios even
>> for nodatasum cases. Instead alloc our own bio with newly allocated
>> pages.
>>
>> For direct read, submit that new bio, and at end io time copy the
>> contents to the dio bio.
>> For direct write, copy the contents from the dio bio, then submit the
>> new one.
>
> This basically reinvents IOCB_DONTCACHE I/O with duplicate code?
This reminds me the problem that btrfs can not handle DONTCACHE due to
its async extents...
I definitely need to address it one day.
>
>> Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
>> modifying the page cache when it's still under writeback) can lead to
>> raid mirror contents mismatch, the 23% performance drop should still be
>> acceptable, and bcachefs is already doing this bouncing behavior.
>
> XFS (and EXT4 as well, but I've not tested it) wait for I/O to
> finish before allowing modifications when mapping_stable_writes returns
> true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
> is fine.
But md-raid1 doesn't set STABLE_WRITES, thus XFS/EXT4 won't wait for
write to finish.
Wouldn't that cause two mirrors to differ from each other due to timing
difference?
> Direct I/O is broken, and at least for XFS I have patches
> to force DONTCACHE instead of DIRECT I/O by default in that case, but
> allowing for an opt-out for known applications (e.g. file or storage
> servers).
>
> I'll need to rebase them, but I plan to send them out soon together
> with other T10 PI enabling patches. Sorry, juggling a few too many
> things at the moment.
>
>> But still, such performance drop can be very obvious, and performance
>> oriented users (who are very happy running various benchmark tools) are
>> going to notice or even complain.
>
> I've unfortunately seen much bigger performance drops with direct I/O and
> PI on fast SSDs, but we still should be safe by default.
>
>> Another question is, should we push this behavior to iomap layer so that other
>> fses can also benefit from it?
>
> The right place is above iomap to pick the buffered I/O path instead.
But falling back to buffered IO performance is so miserable that wiped
out almost one or more decades of storage performance improvement.
Thanks,
Qu
>
> The real question is if we can finally get a version of pin_user_pages
> that prevents user modifications entirely.
next prev parent reply other threads:[~2025-10-20 10:25 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-20 9:19 [PATCH] btrfs: never trust the bio from direct IO Qu Wenruo
2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was " Christoph Hellwig
2025-10-20 10:24 ` Qu Wenruo [this message]
2025-10-20 11:45 ` Christoph Hellwig
2025-10-20 11:16 ` Jan Kara
2025-10-20 11:44 ` Christoph Hellwig
2025-10-20 13:59 ` Jan Kara
2025-10-20 14:59 ` Matthew Wilcox
2025-10-20 15:58 ` Jan Kara
2025-10-20 17:55 ` John Hubbard
2025-10-21 8:27 ` Jan Kara
2025-10-21 16:56 ` John Hubbard
2025-10-20 19:00 ` David Hildenbrand
2025-10-21 7:49 ` Christoph Hellwig
2025-10-21 7:57 ` David Hildenbrand
2025-10-21 9:33 ` Jan Kara
2025-10-21 9:43 ` David Hildenbrand
2025-10-21 9:22 ` Jan Kara
2025-10-21 9:37 ` David Hildenbrand
2025-10-21 9:52 ` Jan Kara
2025-10-21 3:17 ` Qu Wenruo
2025-10-21 7:48 ` Christoph Hellwig
2025-10-21 8:15 ` Qu Wenruo
2025-10-21 11:30 ` Johannes Thumshirn
2025-10-22 2:27 ` Qu Wenruo
2025-10-22 5:04 ` hch
2025-10-22 6:17 ` Qu Wenruo
2025-10-22 6:24 ` hch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=acbb5680-ef7d-4908-94f4-b4edb8b3c48e@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox