From: Christoph Hellwig <hch@infradead.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-block@vger.kernel.org, linux-mm@kvack.org,
martin.petersen@oracle.com, jack@suse.com
Subject: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
Date: Mon, 20 Oct 2025 03:00:43 -0700 [thread overview]
Message-ID: <aPYIS5rDfXhNNDHP@infradead.org> (raw)
In-Reply-To: <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
> There is a bug report about that direct IO (and even concurrent buffered
> IO) can lead to different contents of md-raid.
What concurrent buffered I/O?
> It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
> ("btrfs: always fallback to buffered write if the inode requires
> checksum"), however we still leave a hole for nodatasum cases.
>
> For nodatasum cases we still reuse the bio from direct IO, making it to
> cause the same problem for RAID1*/5/6 profiles, and results
> unreliable data contents read from disk, depending on the load balance.
>
> Just do not trust any bio from direct IO, and never reuse those bios even
> for nodatasum cases. Instead alloc our own bio with newly allocated
> pages.
>
> For direct read, submit that new bio, and at end io time copy the
> contents to the dio bio.
> For direct write, copy the contents from the dio bio, then submit the
> new one.
This basically reinvents IOCB_DONTCACHE I/O with duplicate code?
> Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
> modifying the page cache when it's still under writeback) can lead to
> raid mirror contents mismatch, the 23% performance drop should still be
> acceptable, and bcachefs is already doing this bouncing behavior.
XFS (and EXT4 as well, but I've not tested it) wait for I/O to
finish before allowing modifications when mapping_stable_writes returns
true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
is fine. Direct I/O is broken, and at least for XFS I have patches
to force DONTCACHE instead of DIRECT I/O by default in that case, but
allowing for an opt-out for known applications (e.g. file or storage
servers).
I'll need to rebase them, but I plan to send them out soon together
with other T10 PI enabling patches. Sorry, juggling a few too many
things at the moment.
> But still, such performance drop can be very obvious, and performance
> oriented users (who are very happy running various benchmark tools) are
> going to notice or even complain.
I've unfortunately seen much bigger performance drops with direct I/O and
PI on fast SSDs, but we still should be safe by default.
> Another question is, should we push this behavior to iomap layer so that other
> fses can also benefit from it?
The right place is above iomap to pick the buffered I/O path instead.
The real question is if we can finally get a version of pin_user_pages
that prevents user modifications entirely.
next parent reply other threads:[~2025-10-20 10:00 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
2025-10-20 10:00 ` Christoph Hellwig [this message]
2025-10-20 10:24 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Qu Wenruo
2025-10-20 11:45 ` Christoph Hellwig
2025-10-20 11:16 ` Jan Kara
2025-10-20 11:44 ` Christoph Hellwig
2025-10-20 13:59 ` Jan Kara
2025-10-20 14:59 ` Matthew Wilcox
2025-10-20 15:58 ` Jan Kara
2025-10-20 17:55 ` John Hubbard
2025-10-21 8:27 ` Jan Kara
2025-10-21 16:56 ` John Hubbard
2025-10-20 19:00 ` David Hildenbrand
2025-10-21 7:49 ` Christoph Hellwig
2025-10-21 7:57 ` David Hildenbrand
2025-10-21 9:33 ` Jan Kara
2025-10-21 9:43 ` David Hildenbrand
2025-10-21 9:22 ` Jan Kara
2025-10-21 9:37 ` David Hildenbrand
2025-10-21 9:52 ` Jan Kara
2025-10-21 3:17 ` Qu Wenruo
2025-10-21 7:48 ` Christoph Hellwig
2025-10-21 8:15 ` Qu Wenruo
2025-10-21 11:30 ` Johannes Thumshirn
2025-10-22 2:27 ` Qu Wenruo
2025-10-22 5:04 ` hch
2025-10-22 6:17 ` Qu Wenruo
2025-10-22 6:24 ` hch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPYIS5rDfXhNNDHP@infradead.org \
--to=hch@infradead.org \
--cc=djwong@kernel.org \
--cc=jack@suse.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).