From: Christoph Hellwig <hch@lst.de>
To: Carlos Maiolino <cem@kernel.org>, Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org,
linux-block@vger.kernel.org
Subject: [PATCH 4/4] xfs: fallback to buffered I/O for direct I/O when stable writes are required
Date: Wed, 29 Oct 2025 08:15:05 +0100 [thread overview]
Message-ID: <20251029071537.1127397-5-hch@lst.de> (raw)
In-Reply-To: <20251029071537.1127397-1-hch@lst.de>
Inodes can be marked as requiring stable writes, which is a setting
usually inherited from block devices that require stable writes. Block
devices require stable writes when the drivers have to sample the data
more than once, e.g. to calculate a checksum or parity in one pass, and
then send the data on to a hardware device, and modifying the data
in-flight can lead to inconsistent checksums or parity.
For buffered I/O, the writeback code implements this by not allowing
modifications while folios are marked as under writeback, but for
direct I/O, the kernel currently does not have any way to prevent the
user application from modifying the in-flight memory, so modifications
can easily corrupt data despite the block driver setting the stable
write flag. Even worse, corruption can happen on reads as well,
where concurrent modifications can cause checksum mismatches, or
failures to rebuild parity. One application known to trigger this
behavior is Qemu when running Windows VMs, but there might be many
others as well. xfstests can also hit this behavior, not only in the
specifically crafted patch for this (generic/761), but also in
various other tests that mostly stress races between different I/O
modes, which generic/095 being the most trivial and easy to hit
one.
Fix XFS to fall back to uncached buffered I/O when the block device
requires stable writes to fix these races.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_file.c | 54 +++++++++++++++++++++++++++++++++++++++--------
fs/xfs/xfs_iops.c | 6 ++++++
2 files changed, 51 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e09ae86e118e..0668af07966a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -230,6 +230,12 @@ xfs_file_dio_read(
struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
ssize_t ret;
+ if (mapping_stable_writes(iocb->ki_filp->f_mapping)) {
+ xfs_info_once(ip->i_mount,
+ "falling back from direct to buffered I/O for read");
+ return -ENOTBLK;
+ }
+
trace_xfs_file_direct_read(iocb, to);
if (!iov_iter_count(to))
@@ -302,13 +308,22 @@ xfs_file_read_iter(
if (xfs_is_shutdown(mp))
return -EIO;
- if (IS_DAX(inode))
+ if (IS_DAX(inode)) {
ret = xfs_file_dax_read(iocb, to);
- else if (iocb->ki_flags & IOCB_DIRECT)
+ goto done;
+ }
+
+ if (iocb->ki_flags & IOCB_DIRECT) {
ret = xfs_file_dio_read(iocb, to);
- else
- ret = xfs_file_buffered_read(iocb, to);
+ if (ret != -ENOTBLK)
+ goto done;
+
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ iocb->ki_flags |= IOCB_DONTCACHE;
+ }
+ ret = xfs_file_buffered_read(iocb, to);
+done:
if (ret > 0)
XFS_STATS_ADD(mp, xs_read_bytes, ret);
return ret;
@@ -883,6 +898,7 @@ xfs_file_dio_write(
struct iov_iter *from)
{
struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
+ struct xfs_mount *mp = ip->i_mount;
struct xfs_buftarg *target = xfs_inode_buftarg(ip);
size_t count = iov_iter_count(from);
@@ -890,15 +906,21 @@ xfs_file_dio_write(
if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
return -EINVAL;
+ if (mapping_stable_writes(iocb->ki_filp->f_mapping)) {
+ xfs_info_once(mp,
+ "falling back from direct to buffered I/O for write");
+ return -ENOTBLK;
+ }
+
/*
* For always COW inodes we also must check the alignment of each
* individual iovec segment, as they could end up with different
* I/Os due to the way bio_iov_iter_get_pages works, and we'd
* then overwrite an already written block.
*/
- if (((iocb->ki_pos | count) & ip->i_mount->m_blockmask) ||
+ if (((iocb->ki_pos | count) & mp->m_blockmask) ||
(xfs_is_always_cow_inode(ip) &&
- (iov_iter_alignment(from) & ip->i_mount->m_blockmask)))
+ (iov_iter_alignment(from) & mp->m_blockmask)))
return xfs_file_dio_write_unaligned(ip, iocb, from);
if (xfs_is_zoned_inode(ip))
return xfs_file_dio_write_zoned(ip, iocb, from);
@@ -1555,10 +1577,24 @@ xfs_file_open(
{
if (xfs_is_shutdown(XFS_M(inode->i_sb)))
return -EIO;
+
+ /*
+ * If the underlying devices requires stable writes, we have to fall
+ * back to (uncached) buffered I/O for direct I/O reads and writes, as
+ * the kernel can't prevent applications from modifying the memory under
+ * I/O. We still claim to support O_DIRECT as we want opens for that to
+ * succeed and fall back.
+ *
+ * As atomic writes are only supported for direct I/O, they can't be
+ * supported either in this case.
+ */
file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
- file->f_mode |= FMODE_DIO_PARALLEL_WRITE;
- if (xfs_get_atomic_write_min(XFS_I(inode)) > 0)
- file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
+ if (!mapping_stable_writes(file->f_mapping)) {
+ file->f_mode |= FMODE_DIO_PARALLEL_WRITE;
+ if (xfs_get_atomic_write_min(XFS_I(inode)) > 0)
+ file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
+ }
+
return generic_file_open(inode, file);
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index caff0125faea..bd49ac6b31de 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -672,6 +672,12 @@ xfs_report_atomic_write(
struct xfs_inode *ip,
struct kstat *stat)
{
+ /*
+ * If the stable writes flag is set, we have to fall back to buffered
+ * I/O, which doesn't support atomic writes.
+ */
+ if (mapping_stable_writes(VFS_I(ip)->i_mapping))
+ return;
generic_fill_statx_atomic_writes(stat,
xfs_get_atomic_write_min(ip),
xfs_get_atomic_write_max(ip),
--
2.47.3
next prev parent reply other threads:[~2025-10-29 7:16 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-29 7:15 fall back from direct to buffered I/O when stable writes are required Christoph Hellwig
2025-10-29 7:15 ` [PATCH 1/4] fs: replace FOP_DIO_PARALLEL_WRITE with a fmode bits Christoph Hellwig
2025-10-29 16:01 ` Darrick J. Wong
2025-11-04 7:00 ` Nirjhar Roy (IBM)
2025-11-05 14:04 ` Christoph Hellwig
2025-11-11 9:44 ` Christian Brauner
2025-10-29 7:15 ` [PATCH 2/4] fs: return writeback errors for IOCB_DONTCACHE in generic_write_sync Christoph Hellwig
2025-10-29 16:01 ` Darrick J. Wong
2025-10-29 16:37 ` Christoph Hellwig
2025-10-29 18:12 ` Darrick J. Wong
2025-10-30 5:59 ` Christoph Hellwig
2025-11-04 12:04 ` Nirjhar Roy (IBM)
2025-11-04 15:53 ` Christoph Hellwig
2025-10-29 7:15 ` [PATCH 3/4] xfs: use IOCB_DONTCACHE when falling back to buffered writes Christoph Hellwig
2025-10-29 15:57 ` Darrick J. Wong
2025-11-04 12:33 ` Nirjhar Roy (IBM)
2025-11-04 15:52 ` Christoph Hellwig
2025-10-29 7:15 ` Christoph Hellwig [this message]
2025-10-29 15:53 ` [PATCH 4/4] xfs: fallback to buffered I/O for direct I/O when stable writes are required Darrick J. Wong
2025-10-29 16:35 ` Christoph Hellwig
2025-10-29 21:23 ` Qu Wenruo
2025-10-30 5:58 ` Christoph Hellwig
2025-10-30 6:37 ` Qu Wenruo
2025-10-30 6:49 ` Christoph Hellwig
2025-10-30 6:53 ` Qu Wenruo
2025-10-30 6:55 ` Christoph Hellwig
2025-10-30 7:14 ` Qu Wenruo
2025-10-30 7:17 ` Christoph Hellwig
2025-11-10 13:38 ` Nirjhar Roy (IBM)
2025-11-10 13:59 ` Christoph Hellwig
2025-11-12 7:13 ` Nirjhar Roy (IBM)
2025-10-29 15:58 ` fall back from direct to buffered " Bart Van Assche
2025-10-29 16:14 ` Darrick J. Wong
2025-10-29 16:33 ` Christoph Hellwig
2025-10-30 11:20 ` Dave Chinner
2025-10-30 12:00 ` Geoff Back
2025-10-30 12:54 ` Jan Kara
2025-10-30 14:35 ` Christoph Hellwig
2025-10-30 22:02 ` Dave Chinner
2025-10-30 14:33 ` Christoph Hellwig
2025-10-30 23:18 ` Dave Chinner
2025-10-31 13:00 ` Christoph Hellwig
2025-10-31 15:57 ` Keith Busch
2025-10-31 16:47 ` Christoph Hellwig
2025-11-03 11:14 ` Jan Kara
2025-11-03 12:21 ` Christoph Hellwig
2025-11-03 22:47 ` Keith Busch
2025-11-04 23:38 ` Darrick J. Wong
2025-11-05 14:11 ` Christoph Hellwig
2025-11-05 21:44 ` Darrick J. Wong
2025-11-06 9:50 ` Johannes Thumshirn
2025-11-06 12:49 ` hch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251029071537.1127397-5-hch@lst.de \
--to=hch@lst.de \
--cc=brauner@kernel.org \
--cc=cem@kernel.org \
--cc=jack@suse.cz \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).