O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Christoph Hellwig <hch@infradead.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	martin.petersen@oracle.com, jack@suse.com
Subject: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
Date: Mon, 20 Oct 2025 03:00:43 -0700	[thread overview]
Message-ID: <aPYIS5rDfXhNNDHP@infradead.org> (raw)
In-Reply-To: <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>

On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
> There is a bug report about that direct IO (and even concurrent buffered
> IO) can lead to different contents of md-raid.

What concurrent buffered I/O?

> It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
> ("btrfs: always fallback to buffered write if the inode requires
> checksum"), however we still leave a hole for nodatasum cases.
> 
> For nodatasum cases we still reuse the bio from direct IO, making it to
> cause the same problem for RAID1*/5/6 profiles, and results
> unreliable data contents read from disk, depending on the load balance.
> 
> Just do not trust any bio from direct IO, and never reuse those bios even
> for nodatasum cases. Instead alloc our own bio with newly allocated
> pages.
> 
> For direct read, submit that new bio, and at end io time copy the
> contents to the dio bio.
> For direct write, copy the contents from the dio bio, then submit the
> new one.

This basically reinvents IOCB_DONTCACHE I/O with duplicate code?

> Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
> modifying the page cache when it's still under writeback) can lead to
> raid mirror contents mismatch, the 23% performance drop should still be
> acceptable, and bcachefs is already doing this bouncing behavior.

XFS (and EXT4 as well, but I've not tested it) wait for I/O to
finish before allowing modifications when mapping_stable_writes returns
true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
is fine.  Direct I/O is broken, and at least for XFS I have patches
to force DONTCACHE instead of DIRECT I/O by default in that case, but
allowing for an opt-out for known applications (e.g. file or storage
servers).

I'll need to rebase them, but I plan to send them out soon together
with other T10 PI enabling patches.  Sorry, juggling a few too many
things at the moment.

> But still, such performance drop can be very obvious, and performance
> oriented users (who are very happy running various benchmark tools) are
> going to notice or even complain.

I've unfortunately seen much bigger performance drops with direct I/O and
PI on fast SSDs, but we still should be safe by default.

> Another question is, should we push this behavior to iomap layer so that other
> fses can also benefit from it?

The right place is above iomap to pick the buffered I/O path instead.

The real question is if we can finally get a version of pin_user_pages
that prevents user modifications entirely.

next      parent reply	other threads:[~2025-10-20 10:00 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
2025-10-20 10:00 ` Christoph Hellwig [this message]
2025-10-20 10:24   ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Qu Wenruo
2025-10-20 11:45     ` Christoph Hellwig
2025-10-20 11:16   ` Jan Kara
2025-10-20 11:44     ` Christoph Hellwig
2025-10-20 13:59       ` Jan Kara
2025-10-20 14:59         ` Matthew Wilcox
2025-10-20 15:58           ` Jan Kara
2025-10-20 17:55             ` John Hubbard
2025-10-21  8:27               ` Jan Kara
2025-10-21 16:56                 ` John Hubbard
2025-10-20 19:00             ` David Hildenbrand
2025-10-21  7:49               ` Christoph Hellwig
2025-10-21  7:57                 ` David Hildenbrand
2025-10-21  9:33                   ` Jan Kara
2025-10-21  9:43                     ` David Hildenbrand
2025-10-21  9:22                 ` Jan Kara
2025-10-21  9:37                   ` David Hildenbrand
2025-10-21  9:52                     ` Jan Kara
2025-10-21  3:17   ` Qu Wenruo
2025-10-21  7:48     ` Christoph Hellwig
2025-10-21  8:15       ` Qu Wenruo
2025-10-21 11:30         ` Johannes Thumshirn
2025-10-22  2:27           ` Qu Wenruo
2025-10-22  5:04             ` hch
2025-10-22  6:17               ` Qu Wenruo
2025-10-22  6:24                 ` hch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPYIS5rDfXhNNDHP@infradead.org \
    --to=hch@infradead.org \
    --cc=djwong@kernel.org \
    --cc=jack@suse.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).