From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Carlos Maiolino <cem@kernel.org>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org,
linux-block@vger.kernel.org
Subject: Re: [PATCH 4/4] xfs: fallback to buffered I/O for direct I/O when stable writes are required
Date: Wed, 29 Oct 2025 08:53:06 -0700 [thread overview]
Message-ID: <20251029155306.GC3356773@frogsfrogsfrogs> (raw)
In-Reply-To: <20251029071537.1127397-5-hch@lst.de>
On Wed, Oct 29, 2025 at 08:15:05AM +0100, Christoph Hellwig wrote:
> Inodes can be marked as requiring stable writes, which is a setting
> usually inherited from block devices that require stable writes. Block
> devices require stable writes when the drivers have to sample the data
> more than once, e.g. to calculate a checksum or parity in one pass, and
> then send the data on to a hardware device, and modifying the data
> in-flight can lead to inconsistent checksums or parity.
>
> For buffered I/O, the writeback code implements this by not allowing
> modifications while folios are marked as under writeback, but for
> direct I/O, the kernel currently does not have any way to prevent the
> user application from modifying the in-flight memory, so modifications
> can easily corrupt data despite the block driver setting the stable
> write flag. Even worse, corruption can happen on reads as well,
> where concurrent modifications can cause checksum mismatches, or
> failures to rebuild parity. One application known to trigger this
> behavior is Qemu when running Windows VMs, but there might be many
> others as well. xfstests can also hit this behavior, not only in the
> specifically crafted patch for this (generic/761), but also in
> various other tests that mostly stress races between different I/O
> modes, which generic/095 being the most trivial and easy to hit
> one.
>
> Fix XFS to fall back to uncached buffered I/O when the block device
> requires stable writes to fix these races.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/xfs_file.c | 54 +++++++++++++++++++++++++++++++++++++++--------
> fs/xfs/xfs_iops.c | 6 ++++++
> 2 files changed, 51 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e09ae86e118e..0668af07966a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -230,6 +230,12 @@ xfs_file_dio_read(
> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
> ssize_t ret;
>
> + if (mapping_stable_writes(iocb->ki_filp->f_mapping)) {
> + xfs_info_once(ip->i_mount,
> + "falling back from direct to buffered I/O for read");
> + return -ENOTBLK;
> + }
> +
> trace_xfs_file_direct_read(iocb, to);
>
> if (!iov_iter_count(to))
> @@ -302,13 +308,22 @@ xfs_file_read_iter(
> if (xfs_is_shutdown(mp))
> return -EIO;
>
> - if (IS_DAX(inode))
> + if (IS_DAX(inode)) {
> ret = xfs_file_dax_read(iocb, to);
> - else if (iocb->ki_flags & IOCB_DIRECT)
> + goto done;
> + }
> +
> + if (iocb->ki_flags & IOCB_DIRECT) {
> ret = xfs_file_dio_read(iocb, to);
> - else
> - ret = xfs_file_buffered_read(iocb, to);
> + if (ret != -ENOTBLK)
> + goto done;
> +
> + iocb->ki_flags &= ~IOCB_DIRECT;
> + iocb->ki_flags |= IOCB_DONTCACHE;
> + }
>
> + ret = xfs_file_buffered_read(iocb, to);
> +done:
> if (ret > 0)
> XFS_STATS_ADD(mp, xs_read_bytes, ret);
> return ret;
> @@ -883,6 +898,7 @@ xfs_file_dio_write(
> struct iov_iter *from)
> {
> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
> + struct xfs_mount *mp = ip->i_mount;
> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> size_t count = iov_iter_count(from);
>
> @@ -890,15 +906,21 @@ xfs_file_dio_write(
> if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
> return -EINVAL;
>
> + if (mapping_stable_writes(iocb->ki_filp->f_mapping)) {
> + xfs_info_once(mp,
> + "falling back from direct to buffered I/O for write");
> + return -ENOTBLK;
> + }
/me wonders if the other filesystems will have to implement this same
fallback and hence this should be a common helper ala
dio_warn_stale_pagecache? But we'll get there when we get there.
> +
> /*
> * For always COW inodes we also must check the alignment of each
> * individual iovec segment, as they could end up with different
> * I/Os due to the way bio_iov_iter_get_pages works, and we'd
> * then overwrite an already written block.
> */
> - if (((iocb->ki_pos | count) & ip->i_mount->m_blockmask) ||
> + if (((iocb->ki_pos | count) & mp->m_blockmask) ||
> (xfs_is_always_cow_inode(ip) &&
> - (iov_iter_alignment(from) & ip->i_mount->m_blockmask)))
> + (iov_iter_alignment(from) & mp->m_blockmask)))
> return xfs_file_dio_write_unaligned(ip, iocb, from);
> if (xfs_is_zoned_inode(ip))
> return xfs_file_dio_write_zoned(ip, iocb, from);
> @@ -1555,10 +1577,24 @@ xfs_file_open(
> {
> if (xfs_is_shutdown(XFS_M(inode->i_sb)))
> return -EIO;
> +
> + /*
> + * If the underlying devices requires stable writes, we have to fall
> + * back to (uncached) buffered I/O for direct I/O reads and writes, as
> + * the kernel can't prevent applications from modifying the memory under
> + * I/O. We still claim to support O_DIRECT as we want opens for that to
> + * succeed and fall back.
> + *
> + * As atomic writes are only supported for direct I/O, they can't be
> + * supported either in this case.
> + */
> file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
> - file->f_mode |= FMODE_DIO_PARALLEL_WRITE;
> - if (xfs_get_atomic_write_min(XFS_I(inode)) > 0)
> - file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
> + if (!mapping_stable_writes(file->f_mapping)) {
> + file->f_mode |= FMODE_DIO_PARALLEL_WRITE;
Hrm. So parallel directio writes are disabled for writes to files on
stable_pages devices because we have to fall back to buffered writes.
Those serialize on i_rwsem so that's why we don't set
FMODE_DIO_PARALLEL_WRITE, correct? There's not some more subtle reason
for not supporting it, right?
If the answers are {yes, yes} then I've understood this well enough for
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> + if (xfs_get_atomic_write_min(XFS_I(inode)) > 0)
> + file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
> + }
> +
> return generic_file_open(inode, file);
> }
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index caff0125faea..bd49ac6b31de 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -672,6 +672,12 @@ xfs_report_atomic_write(
> struct xfs_inode *ip,
> struct kstat *stat)
> {
> + /*
> + * If the stable writes flag is set, we have to fall back to buffered
> + * I/O, which doesn't support atomic writes.
> + */
> + if (mapping_stable_writes(VFS_I(ip)->i_mapping))
> + return;
> generic_fill_statx_atomic_writes(stat,
> xfs_get_atomic_write_min(ip),
> xfs_get_atomic_write_max(ip),
> --
> 2.47.3
>
>
next prev parent reply other threads:[~2025-10-29 15:53 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-29 7:15 fall back from direct to buffered I/O when stable writes are required Christoph Hellwig
2025-10-29 7:15 ` [PATCH 1/4] fs: replace FOP_DIO_PARALLEL_WRITE with a fmode bits Christoph Hellwig
2025-10-29 16:01 ` Darrick J. Wong
2025-11-04 7:00 ` Nirjhar Roy (IBM)
2025-11-05 14:04 ` Christoph Hellwig
2025-11-11 9:44 ` Christian Brauner
2025-10-29 7:15 ` [PATCH 2/4] fs: return writeback errors for IOCB_DONTCACHE in generic_write_sync Christoph Hellwig
2025-10-29 16:01 ` Darrick J. Wong
2025-10-29 16:37 ` Christoph Hellwig
2025-10-29 18:12 ` Darrick J. Wong
2025-10-30 5:59 ` Christoph Hellwig
2025-11-04 12:04 ` Nirjhar Roy (IBM)
2025-11-04 15:53 ` Christoph Hellwig
2025-10-29 7:15 ` [PATCH 3/4] xfs: use IOCB_DONTCACHE when falling back to buffered writes Christoph Hellwig
2025-10-29 15:57 ` Darrick J. Wong
2025-11-04 12:33 ` Nirjhar Roy (IBM)
2025-11-04 15:52 ` Christoph Hellwig
2025-10-29 7:15 ` [PATCH 4/4] xfs: fallback to buffered I/O for direct I/O when stable writes are required Christoph Hellwig
2025-10-29 15:53 ` Darrick J. Wong [this message]
2025-10-29 16:35 ` Christoph Hellwig
2025-10-29 21:23 ` Qu Wenruo
2025-10-30 5:58 ` Christoph Hellwig
2025-10-30 6:37 ` Qu Wenruo
2025-10-30 6:49 ` Christoph Hellwig
2025-10-30 6:53 ` Qu Wenruo
2025-10-30 6:55 ` Christoph Hellwig
2025-10-30 7:14 ` Qu Wenruo
2025-10-30 7:17 ` Christoph Hellwig
2025-11-10 13:38 ` Nirjhar Roy (IBM)
2025-11-10 13:59 ` Christoph Hellwig
2025-11-12 7:13 ` Nirjhar Roy (IBM)
2025-10-29 15:58 ` fall back from direct to buffered " Bart Van Assche
2025-10-29 16:14 ` Darrick J. Wong
2025-10-29 16:33 ` Christoph Hellwig
2025-10-30 11:20 ` Dave Chinner
2025-10-30 12:00 ` Geoff Back
2025-10-30 12:54 ` Jan Kara
2025-10-30 14:35 ` Christoph Hellwig
2025-10-30 22:02 ` Dave Chinner
2025-10-30 14:33 ` Christoph Hellwig
2025-10-30 23:18 ` Dave Chinner
2025-10-31 13:00 ` Christoph Hellwig
2025-10-31 15:57 ` Keith Busch
2025-10-31 16:47 ` Christoph Hellwig
2025-11-03 11:14 ` Jan Kara
2025-11-03 12:21 ` Christoph Hellwig
2025-11-03 22:47 ` Keith Busch
2025-11-04 23:38 ` Darrick J. Wong
2025-11-05 14:11 ` Christoph Hellwig
2025-11-05 21:44 ` Darrick J. Wong
2025-11-06 9:50 ` Johannes Thumshirn
2025-11-06 12:49 ` hch
2025-11-12 14:18 ` Ming Lei
2025-11-12 14:38 ` hch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251029155306.GC3356773@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=brauner@kernel.org \
--cc=cem@kernel.org \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).