linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@lst.de>
To: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>, Keith Busch <kbusch@kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Carlos Maiolino <cem@kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: Re: fall back from direct to buffered I/O when stable writes are required
Date: Mon, 3 Nov 2025 13:21:11 +0100	[thread overview]
Message-ID: <20251103122111.GA17600@lst.de> (raw)
In-Reply-To: <kpk2od2fuqofdoneqse2l3gvn7wbqx3y4vckmnvl6gc2jcaw4m@hsxqmxshckpj>

On Mon, Nov 03, 2025 at 12:14:06PM +0100, Jan Kara wrote:
> > Yes, it's pretty clear that the result in non-deterministic in what you
> > get.  But that result still does not result in corruption, because
> > there is a clear boundary ( either the sector size, or for NVMe
> > optionally even a larger bodunary) that designates the atomicy boundary.
> 
> Well, is that boundary really guaranteed? I mean if you modify the buffer
> under IO couldn't it happen that the DMA sees part of the sector new and
> part of the sector old? I agree the window is small but I think the real
> guarantee is architecture dependent and likely cacheline granularity or
> something like that.

If you actually modify it: yes.  But I think Keith' argument was just
about regular racing reads vs writes.

> > pretty clearly not an application bug.  It's also pretty clear that
> > at least some applications (qemu and other VMs) have been doings this
> > for 20+ years.
> 
> Well, I'm mostly of the opinion that modifying IO buffers in flight is an
> application bug (as much as most current storage stacks tolerate it) but on
> the other hand returning IO errors later or even corrupting RAID5 on resync
> is, in my opinion, not a sane error handling on the kernel side either so I
> think we need to do better.

Yes.  Also if you look at the man page which is about official as it gets
for the semantics you can't find anything requiring the buffers to be
stable (but all kinds of other odd rants).

> I also think the performance cost of the unconditional bounce buffering is
> so heavy that it's just a polite way of pushing the app to do proper IO
> buffer synchronization itself (assuming it cares about IO performance but
> given it bothered with direct IO it presumably does). 
>
> So the question is how to get out of this mess with the least disruption
> possible which IMO also means providing easy way for well-behaved apps to
> avoid the overhead.

Remember the cases where this matters is checksumming and parity, where
we touch all the cache lines anyway and consume the DRAM bandwidth,
although bounce buffering upgrades this from pure reads to also writes.
So the overhead is heavy, but if we handle it the right way, that is
doing the checksum/parity calculation while the cache line is still hot
it should not be prohibitive.  And getting this right in the direct
I/O code means that the low-level code could stop bounce buffering
for buffered I/O, providing a major speedup there.

I've been thinking a bit more on how to better get the copy close to the
checksumming at least for PI, and to avoid the extra copies for RAID5
buffered I/O. M maybe a better way is to mark a bio as trusted/untrusted
so that the checksumming/raid code can bounce buffer it, and I start to
like that idea.  A complication is that PI could relax that requirement
if we support PI passthrough from userspace (currently only for block
device, but I plan to add file system support), where the device checks
it, but we can't do that for parity RAID.


  reply	other threads:[~2025-11-03 12:21 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-29  7:15 fall back from direct to buffered I/O when stable writes are required Christoph Hellwig
2025-10-29  7:15 ` [PATCH 1/4] fs: replace FOP_DIO_PARALLEL_WRITE with a fmode bits Christoph Hellwig
2025-10-29 16:01   ` Darrick J. Wong
2025-11-04  7:00   ` Nirjhar Roy (IBM)
2025-11-05 14:04     ` Christoph Hellwig
2025-11-11  9:44   ` Christian Brauner
2025-10-29  7:15 ` [PATCH 2/4] fs: return writeback errors for IOCB_DONTCACHE in generic_write_sync Christoph Hellwig
2025-10-29 16:01   ` Darrick J. Wong
2025-10-29 16:37     ` Christoph Hellwig
2025-10-29 18:12       ` Darrick J. Wong
2025-10-30  5:59         ` Christoph Hellwig
2025-11-04 12:04       ` Nirjhar Roy (IBM)
2025-11-04 15:53         ` Christoph Hellwig
2025-10-29  7:15 ` [PATCH 3/4] xfs: use IOCB_DONTCACHE when falling back to buffered writes Christoph Hellwig
2025-10-29 15:57   ` Darrick J. Wong
2025-11-04 12:33   ` Nirjhar Roy (IBM)
2025-11-04 15:52     ` Christoph Hellwig
2025-10-29  7:15 ` [PATCH 4/4] xfs: fallback to buffered I/O for direct I/O when stable writes are required Christoph Hellwig
2025-10-29 15:53   ` Darrick J. Wong
2025-10-29 16:35     ` Christoph Hellwig
2025-10-29 21:23       ` Qu Wenruo
2025-10-30  5:58         ` Christoph Hellwig
2025-10-30  6:37           ` Qu Wenruo
2025-10-30  6:49             ` Christoph Hellwig
2025-10-30  6:53               ` Qu Wenruo
2025-10-30  6:55                 ` Christoph Hellwig
2025-10-30  7:14                   ` Qu Wenruo
2025-10-30  7:17                     ` Christoph Hellwig
2025-11-10 13:38   ` Nirjhar Roy (IBM)
2025-11-10 13:59     ` Christoph Hellwig
2025-11-12  7:13       ` Nirjhar Roy (IBM)
2025-10-29 15:58 ` fall back from direct to buffered " Bart Van Assche
2025-10-29 16:14   ` Darrick J. Wong
2025-10-29 16:33   ` Christoph Hellwig
2025-10-30 11:20 ` Dave Chinner
2025-10-30 12:00   ` Geoff Back
2025-10-30 12:54     ` Jan Kara
2025-10-30 14:35     ` Christoph Hellwig
2025-10-30 22:02     ` Dave Chinner
2025-10-30 14:33   ` Christoph Hellwig
2025-10-30 23:18     ` Dave Chinner
2025-10-31 13:00       ` Christoph Hellwig
2025-10-31 15:57         ` Keith Busch
2025-10-31 16:47           ` Christoph Hellwig
2025-11-03 11:14             ` Jan Kara
2025-11-03 12:21               ` Christoph Hellwig [this message]
2025-11-03 22:47                 ` Keith Busch
2025-11-04 23:38                 ` Darrick J. Wong
2025-11-05 14:11                   ` Christoph Hellwig
2025-11-05 21:44                     ` Darrick J. Wong
2025-11-06  9:50                       ` Johannes Thumshirn
2025-11-06 12:49                         ` hch
2025-11-12 14:18                           ` Ming Lei
2025-11-12 14:38                             ` hch
2025-11-13 17:39                 ` Kevin Wolf
2025-11-14  5:39                   ` Christoph Hellwig
2025-11-14  9:29                     ` Kevin Wolf
2025-11-14 12:01                       ` Christoph Hellwig
2025-11-14 12:31                         ` Kevin Wolf
2025-11-14 15:36                           ` Christoph Hellwig
2025-11-14 16:55                             ` Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251103122111.GA17600@lst.de \
    --to=hch@lst.de \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).