Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Christoph Hellwig <hch@infradead.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Christoph Hellwig <hch@infradead.org>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org, djwong@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	martin.petersen@oracle.com, jack@suse.com
Subject: Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
Date: Mon, 20 Oct 2025 04:45:35 -0700	[thread overview]
Message-ID: <aPYg31lAv8YKxJJJ@infradead.org> (raw)
In-Reply-To: <acbb5680-ef7d-4908-94f4-b4edb8b3c48e@gmx.com>

On Mon, Oct 20, 2025 at 08:54:49PM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/10/20 20:30, Christoph Hellwig 写道:
> > On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote:
> > > There is a bug report about that direct IO (and even concurrent buffered
> > > IO) can lead to different contents of md-raid.
> > 
> > What concurrent buffered I/O?
> 
> filemap_get_folio(), for address spaces with STABEL_WRITES, there will be a
> folio_wait_stable() call to wait for writeback.
> 
> But since almost no device (except md-raid56) set that flag, if a folio is
> still under writeback, XFS/EXT4 can still modify that folio (since it's not
> locked, just under writeback) for new incoming buffered writes.

All devices with protection information set it.  Now for raid1 without
the bitmap there is no expectation that anything is in sync.  Raid 1
with a bitmap should probably set the flag as well.

> 
> > 
> > > It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7
> > > ("btrfs: always fallback to buffered write if the inode requires
> > > checksum"), however we still leave a hole for nodatasum cases.
> > > 
> > > For nodatasum cases we still reuse the bio from direct IO, making it to
> > > cause the same problem for RAID1*/5/6 profiles, and results
> > > unreliable data contents read from disk, depending on the load balance.
> > > 
> > > Just do not trust any bio from direct IO, and never reuse those bios even
> > > for nodatasum cases. Instead alloc our own bio with newly allocated
> > > pages.
> > > 
> > > For direct read, submit that new bio, and at end io time copy the
> > > contents to the dio bio.
> > > For direct write, copy the contents from the dio bio, then submit the
> > > new one.
> > 
> > This basically reinvents IOCB_DONTCACHE I/O with duplicate code?
> 
> This reminds me the problem that btrfs can not handle DONTCACHE due to its
> async extents...
> 
> I definitely need to address it one day.
> 
> > 
> > > Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows
> > > modifying the page cache when it's still under writeback) can lead to
> > > raid mirror contents mismatch, the 23% performance drop should still be
> > > acceptable, and bcachefs is already doing this bouncing behavior.
> > 
> > XFS (and EXT4 as well, but I've not tested it) wait for I/O to
> > finish before allowing modifications when mapping_stable_writes returns
> > true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that
> > is fine.
> 
> But md-raid1 doesn't set STABLE_WRITES, thus XFS/EXT4 won't wait for write
> to finish.
> 
> Wouldn't that cause two mirrors to differ from each other due to timing
> difference?
> 
> >  Direct I/O is broken, and at least for XFS I have patches
> > to force DONTCACHE instead of DIRECT I/O by default in that case, but
> > allowing for an opt-out for known applications (e.g. file or storage
> > servers).
> > 
> > I'll need to rebase them, but I plan to send them out soon together
> > with other T10 PI enabling patches.  Sorry, juggling a few too many
> > things at the moment.
> > 
> > > But still, such performance drop can be very obvious, and performance
> > > oriented users (who are very happy running various benchmark tools) are
> > > going to notice or even complain.
> > 
> > I've unfortunately seen much bigger performance drops with direct I/O and
> > PI on fast SSDs, but we still should be safe by default.
> > 
> > > Another question is, should we push this behavior to iomap layer so that other
> > > fses can also benefit from it?
> > 
> > The right place is above iomap to pick the buffered I/O path instead.
> 
> But falling back to buffered IO performance is so miserable that wiped out
> almost one or more decades of storage performance improvement.
> 
> Thanks,
> Qu
> 
> > 
> > The real question is if we can finally get a version of pin_user_pages
> > that prevents user modifications entirely.
---end quoted text---

next prev parent reply	other threads:[~2025-10-20 11:45 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig
2025-10-20 10:24   ` Qu Wenruo
2025-10-20 11:45     ` Christoph Hellwig [this message]
2025-10-20 11:16   ` Jan Kara
2025-10-20 11:44     ` Christoph Hellwig
2025-10-20 13:59       ` Jan Kara
2025-10-20 14:59         ` Matthew Wilcox
2025-10-20 15:58           ` Jan Kara
2025-10-20 17:55             ` John Hubbard
2025-10-21  8:27               ` Jan Kara
2025-10-21 16:56                 ` John Hubbard
2025-10-20 19:00             ` David Hildenbrand
2025-10-21  7:49               ` Christoph Hellwig
2025-10-21  7:57                 ` David Hildenbrand
2025-10-21  9:33                   ` Jan Kara
2025-10-21  9:43                     ` David Hildenbrand
2025-10-21  9:22                 ` Jan Kara
2025-10-21  9:37                   ` David Hildenbrand
2025-10-21  9:52                     ` Jan Kara
2025-10-21  3:17   ` Qu Wenruo
2025-10-21  7:48     ` Christoph Hellwig
2025-10-21  8:15       ` Qu Wenruo
2025-10-21 11:30         ` Johannes Thumshirn
2025-10-22  2:27           ` Qu Wenruo
2025-10-22  5:04             ` hch
2025-10-22  6:17               ` Qu Wenruo
2025-10-22  6:24                 ` hch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPYg31lAv8YKxJJJ@infradead.org \
    --to=hch@infradead.org \
    --cc=djwong@kernel.org \
    --cc=jack@suse.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).