* [PATCH v9 18/22] xfs: add fs-verity ioctls
From: Andrey Albershteyn @ 2026-04-28 8:33 UTC (permalink / raw)
To: linux-xfs, fsverity, linux-fsdevel, ebiggers
Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260428083332.768693-1-aalbersh@kernel.org>
Add fs-verity ioctls to enable, dump metadata (descriptor and Merkle
tree pages) and obtain file's digest.
[djwong: remove unnecessary casting]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
fs/xfs/xfs_ioctl.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index facffdc8dca8..e633d56cad00 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -46,6 +46,7 @@
#include <linux/mount.h>
#include <linux/fileattr.h>
+#include <linux/fsverity.h>
/* Return 0 on success or positive error */
int
@@ -1426,6 +1427,19 @@ xfs_file_ioctl(
case XFS_IOC_VERIFY_MEDIA:
return xfs_ioc_verify_media(filp, arg);
+ case FS_IOC_ENABLE_VERITY:
+ if (!xfs_has_verity(mp))
+ return -EOPNOTSUPP;
+ return fsverity_ioctl_enable(filp, arg);
+ case FS_IOC_MEASURE_VERITY:
+ if (!xfs_has_verity(mp))
+ return -EOPNOTSUPP;
+ return fsverity_ioctl_measure(filp, arg);
+ case FS_IOC_READ_VERITY_METADATA:
+ if (!xfs_has_verity(mp))
+ return -EOPNOTSUPP;
+ return fsverity_ioctl_read_metadata(filp, arg);
+
default:
return -ENOTTY;
}
--
2.51.2
^ permalink raw reply related
* [PATCH v9 19/22] xfs: advertise fs-verity being available on filesystem
From: Andrey Albershteyn @ 2026-04-28 8:33 UTC (permalink / raw)
To: linux-xfs, fsverity, linux-fsdevel, ebiggers
Cc: Darrick J. Wong, hch, linux-ext4, linux-f2fs-devel, linux-btrfs,
linux-unionfs, Andrey Albershteyn, Andrey Albershteyn
In-Reply-To: <20260428083332.768693-1-aalbersh@kernel.org>
From: "Darrick J. Wong" <djwong@kernel.org>
Advertise that this filesystem supports fsverity.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 1 +
fs/xfs/libxfs/xfs_sb.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index d165de607d17..ebf17a0b0722 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -250,6 +250,7 @@ typedef struct xfs_fsop_resblks {
#define XFS_FSOP_GEOM_FLAGS_PARENT (1 << 25) /* linux parent pointers */
#define XFS_FSOP_GEOM_FLAGS_METADIR (1 << 26) /* metadata directories */
#define XFS_FSOP_GEOM_FLAGS_ZONED (1 << 27) /* zoned rt device */
+#define XFS_FSOP_GEOM_FLAGS_VERITY (1 << 28) /* fs-verity */
/*
* Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a15510ebd2f1..222bbe5559df 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1590,6 +1590,8 @@ xfs_fs_geometry(
geo->flags |= XFS_FSOP_GEOM_FLAGS_METADIR;
if (xfs_has_zoned(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_ZONED;
+ if (xfs_has_verity(mp))
+ geo->flags |= XFS_FSOP_GEOM_FLAGS_VERITY;
geo->rtsectsize = sbp->sb_blocksize;
geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
--
2.51.2
^ permalink raw reply related
* [PATCH v9 20/22] xfs: check and repair the verity inode flag state
From: Andrey Albershteyn @ 2026-04-28 8:33 UTC (permalink / raw)
To: linux-xfs, fsverity, linux-fsdevel, ebiggers
Cc: Darrick J. Wong, hch, linux-ext4, linux-f2fs-devel, linux-btrfs,
linux-unionfs, Andrey Albershteyn
In-Reply-To: <20260428083332.768693-1-aalbersh@kernel.org>
From: "Darrick J. Wong" <djwong@kernel.org>
If an inode has the incore verity iflag set, make sure that we can
actually activate fsverity on that inode. If activation fails due to
a fsverity metadata validation error, clear the flag. The usage model
for fsverity requires that any program that cares about verity state is
required to call statx/getflags to check that the flag is set after
opening the file, so clearing the flag will not compromise that model.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
fs/xfs/scrub/attr.c | 7 +++++
fs/xfs/scrub/common.c | 53 +++++++++++++++++++++++++++++++++++++
fs/xfs/scrub/common.h | 2 ++
fs/xfs/scrub/inode.c | 7 +++++
fs/xfs/scrub/inode_repair.c | 36 +++++++++++++++++++++++++
5 files changed, 105 insertions(+)
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 390ac2e11ee0..daf7962c2374 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -649,6 +649,13 @@ xchk_xattr(
if (!xfs_inode_hasattr(sc->ip))
return -ENOENT;
+ /*
+ * If this is a verity file that won't activate, we cannot check the
+ * merkle tree geometry.
+ */
+ if (xchk_inode_verity_broken(sc->ip))
+ xchk_set_incomplete(sc);
+
/* Allocate memory for xattr checking. */
error = xchk_setup_xattr_buf(sc, 0);
if (error == -ENOMEM)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 20e63069088b..6cc6bea9c554 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -45,6 +45,8 @@
#include "scrub/health.h"
#include "scrub/tempfile.h"
+#include <linux/fsverity.h>
+
/* Common code for the metadata scrubbers. */
/*
@@ -1743,3 +1745,54 @@ xchk_inode_count_blocks(
return xfs_bmap_count_blocks(sc->tp, sc->ip, whichfork, nextents,
count);
}
+
+/*
+ * If this inode has S_VERITY set on it, read the verity info. If the reading
+ * fails with anything other than ENOMEM, the file is corrupt, which we can
+ * detect later with fsverity_active.
+ *
+ * Callers must hold the IOLOCK and must not hold the ILOCK of sc->ip because
+ * activation reads inode data.
+ */
+int
+xchk_inode_setup_verity(
+ struct xfs_scrub *sc)
+{
+ int error;
+
+ if (!fsverity_active(VFS_I(sc->ip)))
+ return 0;
+
+ error = fsverity_ensure_verity_info(VFS_I(sc->ip));
+ switch (error) {
+ case 0:
+ /* fsverity is active */
+ break;
+ case -ENODATA:
+ case -EMSGSIZE:
+ case -EINVAL:
+ case -EFSCORRUPTED:
+ case -EFBIG:
+ /*
+ * The nonzero errno codes above are the error codes that can
+ * be returned from fsverity on metadata validation errors.
+ */
+ return 0;
+ default:
+ /* runtime errors */
+ return error;
+ }
+
+ return 0;
+}
+
+/*
+ * Is this a verity file that failed to activate? Callers must have tried to
+ * activate fsverity via xchk_inode_setup_verity.
+ */
+bool
+xchk_inode_verity_broken(
+ struct xfs_inode *ip)
+{
+ return fsverity_active(VFS_I(ip)) && !fsverity_get_info(VFS_I(ip));
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index f2ecc68538f0..aa16d310bd6d 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -264,6 +264,8 @@ int xchk_inode_is_allocated(struct xfs_scrub *sc, xfs_agino_t agino,
bool *inuse);
int xchk_inode_count_blocks(struct xfs_scrub *sc, int whichfork,
xfs_extnum_t *nextents, xfs_filblks_t *count);
+int xchk_inode_setup_verity(struct xfs_scrub *sc);
+bool xchk_inode_verity_broken(struct xfs_inode *ip);
bool xchk_inode_is_dirtree_root(const struct xfs_inode *ip);
bool xchk_inode_is_sb_rooted(const struct xfs_inode *ip);
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 948d04dcba2a..8ce6917e22b4 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -36,6 +36,10 @@ xchk_prepare_iscrub(
xchk_ilock(sc, XFS_IOLOCK_EXCL);
+ error = xchk_inode_setup_verity(sc);
+ if (error)
+ return error;
+
error = xchk_trans_alloc(sc, 0);
if (error)
return error;
@@ -833,6 +837,9 @@ xchk_inode(
if (S_ISREG(VFS_I(sc->ip)->i_mode))
xchk_inode_check_reflink_iflag(sc, sc->ip->i_ino);
+ if (xchk_inode_verity_broken(sc->ip))
+ xchk_ino_set_corrupt(sc, sc->sm->sm_ino);
+
xchk_inode_check_unlinked(sc);
xchk_inode_xref(sc, sc->ip->i_ino, &di);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 9738b9ce3f2d..3761e3922466 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -573,6 +573,8 @@ xrep_dinode_flags(
dip->di_nrext64_pad = 0;
else if (dip->di_version >= 3)
dip->di_v3_pad = 0;
+ if (!xfs_has_verity(mp) || !S_ISREG(mode))
+ flags2 &= ~XFS_DIFLAG2_VERITY;
if (flags2 & XFS_DIFLAG2_METADATA) {
xfs_failaddr_t fa;
@@ -1613,6 +1615,10 @@ xrep_dinode_core(
if (iget_error)
return iget_error;
+ error = xchk_inode_setup_verity(sc);
+ if (error)
+ return error;
+
error = xchk_trans_alloc(sc, 0);
if (error)
return error;
@@ -2032,6 +2038,27 @@ xrep_inode_unlinked(
return 0;
}
+/*
+ * If this file is a fsverity file, xchk_prepare_iscrub or xrep_dinode_core
+ * should have activated it. If it's still not active, then there's something
+ * wrong with the verity descriptor and we should turn it off.
+ */
+STATIC int
+xrep_inode_verity(
+ struct xfs_scrub *sc)
+{
+ struct inode *inode = VFS_I(sc->ip);
+
+ if (xchk_inode_verity_broken(sc->ip)) {
+ sc->ip->i_diflags2 &= ~XFS_DIFLAG2_VERITY;
+ inode->i_flags &= ~S_VERITY;
+
+ xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+ }
+
+ return 0;
+}
+
/* Repair an inode's fields. */
int
xrep_inode(
@@ -2081,6 +2108,15 @@ xrep_inode(
return error;
}
+ /*
+ * Disable fsverity if it cannot be activated. Activation failure
+ * prohibits the file from being opened, so there cannot be another
+ * program with an open fd to what it thinks is a verity file.
+ */
+ error = xrep_inode_verity(sc);
+ if (error)
+ return error;
+
/* Reconnect incore unlinked list */
error = xrep_inode_unlinked(sc);
if (error)
--
2.51.2
^ permalink raw reply related
* [PATCH v9 21/22] xfs: introduce health state for corrupted fsverity metadata
From: Andrey Albershteyn @ 2026-04-28 8:33 UTC (permalink / raw)
To: linux-xfs, fsverity, linux-fsdevel, ebiggers
Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260428083332.768693-1-aalbersh@kernel.org>
Report corrupted fsverity descriptor through health system.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 1 +
fs/xfs/libxfs/xfs_health.h | 4 +++-
fs/xfs/xfs_fsverity.c | 13 ++++++++++---
fs/xfs/xfs_health.c | 1 +
4 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index ebf17a0b0722..cece31ecee81 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -422,6 +422,7 @@ struct xfs_bulkstat {
#define XFS_BS_SICK_SYMLINK (1 << 6) /* symbolic link remote target */
#define XFS_BS_SICK_PARENT (1 << 7) /* parent pointers */
#define XFS_BS_SICK_DIRTREE (1 << 8) /* directory tree structure */
+#define XFS_BS_SICK_FSVERITY (1 << 9) /* fsverity metadata */
/*
* Project quota id helpers (previously projid was 16bit only
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 1d45cf5789e8..932b447190da 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -104,6 +104,7 @@ struct xfs_rtgroup;
/* Don't propagate sick status to ag health summary during inactivation */
#define XFS_SICK_INO_FORGET (1 << 12)
#define XFS_SICK_INO_DIRTREE (1 << 13) /* directory tree structure */
+#define XFS_SICK_INO_FSVERITY (1 << 14) /* fsverity metadata */
/* Primary evidence of health problems in a given group. */
#define XFS_SICK_FS_PRIMARY (XFS_SICK_FS_COUNTERS | \
@@ -140,7 +141,8 @@ struct xfs_rtgroup;
XFS_SICK_INO_XATTR | \
XFS_SICK_INO_SYMLINK | \
XFS_SICK_INO_PARENT | \
- XFS_SICK_INO_DIRTREE)
+ XFS_SICK_INO_DIRTREE | \
+ XFS_SICK_INO_FSVERITY)
#define XFS_SICK_INO_ZAPPED (XFS_SICK_INO_BMBTD_ZAPPED | \
XFS_SICK_INO_BMBTA_ZAPPED | \
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index 298d712b5ba2..82f5ca542c97 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -84,16 +84,23 @@ xfs_fsverity_get_descriptor(
return error;
desc_size = be32_to_cpu(d_desc_size);
- if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE))
+ if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE)) {
+ xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
return -ERANGE;
- if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos))
+ }
+
+ if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos)) {
+ xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
return -ERANGE;
+ }
if (!buf_size)
return desc_size;
- if (XFS_IS_CORRUPT(mp, desc_size > buf_size))
+ if (XFS_IS_CORRUPT(mp, desc_size > buf_size)) {
+ xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
return -ERANGE;
+ }
desc_pos = round_down(desc_size_pos - desc_size, blocksize);
error = fsverity_pagecache_read(inode, buf, desc_size, desc_pos);
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 239b843e83d4..be66760fb120 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -625,6 +625,7 @@ static const struct ioctl_sick_map ino_map[] = {
{ XFS_SICK_INO_DIR_ZAPPED, XFS_BS_SICK_DIR },
{ XFS_SICK_INO_SYMLINK_ZAPPED, XFS_BS_SICK_SYMLINK },
{ XFS_SICK_INO_DIRTREE, XFS_BS_SICK_DIRTREE },
+ { XFS_SICK_INO_FSVERITY, XFS_BS_SICK_FSVERITY },
};
/* Fill out bulkstat health info. */
--
2.51.2
^ permalink raw reply related
* [PATCH v9 22/22] xfs: enable ro-compat fs-verity flag
From: Andrey Albershteyn @ 2026-04-28 8:33 UTC (permalink / raw)
To: linux-xfs, fsverity, linux-fsdevel, ebiggers
Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260428083332.768693-1-aalbersh@kernel.org>
Finalize fs-verity integration in XFS by making kernel fs-verity
aware with ro-compat flag.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add spaces]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
fs/xfs/libxfs/xfs_format.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 4dff29659e40..0ce46c234b9c 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -378,8 +378,9 @@ xfs_sb_has_compat_feature(
#define XFS_SB_FEAT_RO_COMPAT_ALL \
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
- XFS_SB_FEAT_RO_COMPAT_REFLINK| \
- XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+ XFS_SB_FEAT_RO_COMPAT_REFLINK | \
+ XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+ XFS_SB_FEAT_RO_COMPAT_VERITY)
#define XFS_SB_FEAT_RO_COMPAT_UNKNOWN ~XFS_SB_FEAT_RO_COMPAT_ALL
static inline bool
xfs_sb_has_ro_compat_feature(
--
2.51.2
^ permalink raw reply related
* [PATCH v3] generic/790: test post-EOF gap zeroing persistence
From: Zhang Yi @ 2026-04-28 8:57 UTC (permalink / raw)
To: fstests, zlang
Cc: linux-ext4, linux-fsdevel, bfoster, jack, yi.zhang, yi.zhang,
yizhang089, yangerkun
From: Zhang Yi <yi.zhang@huawei.com>
Test that extending a file past a non-block-aligned EOF correctly
zero-fills the gap [old_EOF, block_boundary), and that this zeroing
persists through a filesystem shutdown+remount cycle.
Stale data beyond EOF can persist on disk when append write data blocks
are flushed before the on-disk file size update, or when concurrent
append writeback and mmap writes persist non-zero data past EOF.
Subsequent post-EOF operations (append write, fallocate, truncate up)
must zero-fill and persist the gap to prevent exposing stale data.
The test pollutes the file's last physical block (via FIEMAP + raw
device write) with a sentinel pattern beyond i_size, then performs each
extend operation and verifies the gap is zeroed both in memory and on
disk.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
v2->v3:
- Add error check for the raw device pwrite, a failed pwrite would
silently leave the test continuing with an unpolluted block,
producing false-positive passes.
- Add sync_range -a to wait until the extending I/O completes and to
ensure file size update is persisted before shutdown, preventing
unexpected file size errors.
v1->v2:
- Add _require_no_realtime to prevent testing on XFS realtime devices,
where file data may reside on $SCRATCH_RTDEV.
- Add _exclude_fs btrfs since FIEMAP returns logical addresses, not
physical device offsets, writing to these offsets on $SCRATCH_DEV
would corrupt the filesystem in multi-device setups. Besides, since
btrfs doesn't support shutdown right now, we can support it later.
- Add -v flag to od in _check_gap_zero() to prevent line folding of
identical consecutive lines.
- Add expected_new_sz parameter to _test_eof_zeroing(), verify file
size was not rolled back after shutdown+remount cycle, and also drop
the unnecessary file size check before the shutdown as well.
- Clarify the comment regarding when stale data beyond EOF can persist.
tests/generic/790 | 168 ++++++++++++++++++++++++++++++++++++++++++
tests/generic/790.out | 4 +
2 files changed, 172 insertions(+)
create mode 100755 tests/generic/790
create mode 100644 tests/generic/790.out
diff --git a/tests/generic/790 b/tests/generic/790
new file mode 100755
index 00000000..6daf3793
--- /dev/null
+++ b/tests/generic/790
@@ -0,0 +1,168 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Huawei. All Rights Reserved.
+#
+# FS QA Test No. 790
+#
+# Test that extending a file past a non-block-aligned EOF correctly zero-fills
+# the gap [old_EOF, block_boundary), and that this zeroing persists through a
+# filesystem shutdown+remount cycle.
+#
+# Stale data beyond EOF can persist on disk when:
+# 1) append write data blocks are flushed before the on-disk file size update,
+# and the system crashes in this window.
+# 2) concurrent append writeback and mmap writes persist non-zero data past EOF.
+#
+# Subsequent post-EOF operations (append write, fallocate, truncate up) must
+# zero-fill and persist the gap to prevent exposing stale data.
+#
+# The test pollutes the file's last physical block (via FIEMAP + raw device
+# write) with a sentinel pattern beyond i_size, then performs each extend
+# operation and verifies the gap is zeroed both in memory and on disk.
+#
+. ./common/preamble
+_begin_fstest auto quick rw shutdown
+
+. ./common/filter
+
+_require_scratch
+_require_block_device $SCRATCH_DEV
+_require_no_realtime
+_require_scratch_shutdown
+_require_metadata_journaling $SCRATCH_DEV
+
+# FIEMAP on Btrfs returns logical addresses within the filesystem's address
+# space, not physical device offsets. Writing to these offsets on $SCRATCH_DEV
+# would corrupt the filesystem in multi-device setups.
+_exclude_fs btrfs
+
+_require_xfs_io_command "fiemap"
+_require_xfs_io_command "falloc"
+_require_xfs_io_command "pwrite"
+_require_xfs_io_command "truncate"
+_require_xfs_io_command "sync_range"
+
+# Check that gap region [offset, offset+nbytes) is entirely zero
+_check_gap_zero()
+{
+ local file="$1"
+ local offset="$2"
+ local nbytes="$3"
+ local label="$4"
+ local data
+ local stripped
+
+ data=$(od -A n -t x1 -v -j $offset -N $nbytes "$file" 2>/dev/null)
+
+ # Remove whitespace and check if any byte is non-zero
+ stripped=$(printf '%s' "$data" | tr -d ' \n\t')
+ if [ -n "$stripped" ] && ! echo "$stripped" | grep -qE "^0+$"; then
+ echo "FAIL: non-zero data in gap [$offset,$((offset + nbytes))) $label"
+ _hexdump -N $((offset + nbytes)) "$file"
+ return 1
+ fi
+ return 0
+}
+
+# Get the physical block offset (in bytes) of the file's first block on device
+_get_phys_offset()
+{
+ local file="$1"
+ local fiemap_output
+ local phys_blk
+
+ fiemap_output=$($XFS_IO_PROG -r -c "fiemap -v" "$file" 2>/dev/null)
+ phys_blk=$(echo "$fiemap_output" | _filter_xfs_io_fiemap | head -1 | awk '{print $3}')
+ if [ -z "$phys_blk" ]; then
+ echo ""
+ return
+ fi
+ # Convert 512-byte blocks to bytes
+ echo $((phys_blk * 512))
+}
+
+_test_eof_zeroing()
+{
+ local test_name="$1"
+ local extend_cmd="$2"
+ local expected_new_sz="$3"
+ local file=$SCRATCH_MNT/testfile_${test_name}
+
+ echo "$test_name" | tee -a $seqres.full
+
+ # Compute non-block-aligned EOF offset
+ local gap_bytes=16
+ local eof_offset=$((blksz - gap_bytes))
+
+ # Step 1: Write one full block to ensure the filesystem allocates a
+ # physical block for the file instead of using inline data.
+ $XFS_IO_PROG -f -c "pwrite -S 0x5a 0 $blksz" -c fsync \
+ "$file" >> $seqres.full 2>&1
+
+ # Step 2: Get physical block offset on device via FIEMAP
+ local phys_offset
+ phys_offset=$(_get_phys_offset "$file")
+ if [ -z "$phys_offset" ]; then
+ _fail "$test_name: failed to get physical block offset via fiemap"
+ fi
+
+ # Step 3: Truncate file to non-block-aligned size and fsync.
+ # The on-disk region [eof_offset, blksz) may or may not be
+ # zeroed by the filesystem at this point.
+ $XFS_IO_PROG -c "truncate $eof_offset" -c fsync \
+ "$file" >> $seqres.full 2>&1
+
+ # Step 4: Unmount and restore the physical block to all-0x5a on disk.
+ # This bypasses the kernel's pagecache EOF-zeroing to ensure
+ # the stale pattern is present on disk. Then remount.
+ _scratch_unmount
+ $XFS_IO_PROG -d -c "pwrite -S 0x5a $phys_offset $blksz" \
+ $SCRATCH_DEV >> $seqres.full 2>&1
+ if [ $? -ne 0 ]; then
+ _fail "$test_name: failed to inject stale data on disk"
+ fi
+ _scratch_mount >> $seqres.full 2>&1
+
+ # Step 5: Execute the extend operation.
+ $XFS_IO_PROG -c "$extend_cmd" "$file" >> $seqres.full 2>&1
+
+ # Step 6: Verify gap [eof_offset, blksz) is zeroed BEFORE shutdown
+ _check_gap_zero "$file" $eof_offset $gap_bytes "before shutdown" || return 1
+
+ # Step 7: Sync the extended range and shutdown the filesystem with
+ # journal flush. This persists the file size extending, and
+ # the filesystem should persist the zeroed data in the gap
+ # range as well.
+ if [ "$extend_cmd" != "${extend_cmd#pwrite}" ]; then
+ $XFS_IO_PROG -c "sync_range -w $blksz $blksz" \
+ -c "sync_range -a $blksz $blksz" \
+ "$file" >> $seqres.full 2>&1
+ fi
+ _scratch_shutdown -f
+
+ # Step 8: Remount and verify gap is still zeroed
+ _scratch_cycle_mount
+
+ # Verify file size was not rolled back after shutdown+remount
+ local sz
+ sz=$(stat -c %s "$file")
+ if [ "$sz" -ne "$expected_new_sz" ]; then
+ _fail "$test_name: file size rolled back after shutdown+remount: $sz != $expected_new_sz"
+ fi
+
+ _check_gap_zero "$file" $eof_offset $gap_bytes "after shutdown+remount" || return 1
+}
+
+_scratch_mkfs >> $seqres.full 2>&1
+_scratch_mount
+
+blksz=$(_get_block_size $SCRATCH_MNT)
+
+# Test three variants of EOF-extending operations
+_test_eof_zeroing "append_write" "pwrite -S 0x42 $blksz $blksz" $((blksz * 2))
+_test_eof_zeroing "truncate_up" "truncate $((blksz * 2))" $((blksz * 2))
+_test_eof_zeroing "fallocate" "falloc $blksz $blksz" $((blksz * 2))
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/790.out b/tests/generic/790.out
new file mode 100644
index 00000000..e5e2cc09
--- /dev/null
+++ b/tests/generic/790.out
@@ -0,0 +1,4 @@
+QA output created by 790
+append_write
+truncate_up
+fallocate
--
2.52.0
^ permalink raw reply related
* Re: [PATCH v2] generic/790: test post-EOF gap zeroing persistence
From: Zhang Yi @ 2026-04-28 8:52 UTC (permalink / raw)
To: Brian Foster
Cc: fstests, zlang, linux-ext4, linux-fsdevel, jack, yi.zhang,
yizhang089, yangerkun
In-Reply-To: <ae9euwyg53TRutsK@bfoster>
On 4/27/2026 9:03 PM, Brian Foster wrote:
> On Sat, Apr 25, 2026 at 11:06:27AM +0800, Zhang Yi wrote:
>> On 4/24/2026 9:09 PM, Brian Foster wrote:
>>> On Fri, Apr 24, 2026 at 05:22:28PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Test that extending a file past a non-block-aligned EOF correctly
>>>> zero-fills the gap [old_EOF, block_boundary), and that this zeroing
>>>> persists through a filesystem shutdown+remount cycle.
>>>>
>>>> Stale data beyond EOF can persist on disk when append write data blocks
>>>> are flushed before the on-disk file size update, or when concurrent
>>>> append writeback and mmap writes persist non-zero data past EOF.
>>>> Subsequent post-EOF operations (append write, fallocate, truncate up)
>>>> must zero-fill and persist the gap to prevent exposing stale data.
>>>>
>>>> The test pollutes the file's last physical block (via FIEMAP + raw
>>>> device write) with a sentinel pattern beyond i_size, then performs each
>>>> extend operation and verifies the gap is zeroed both in memory and on
>>>> disk.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>> v1->v2:
>>>> - Add _require_no_realtime to prevent testing on XFS realtime devices,
>>>> where file data may reside on $SCRATCH_RTDEV.
>>>> - Add _exclude_fs btrfs since FIEMAP returns logical addresses, not
>>>> physical device offsets, writing to these offsets on $SCRATCH_DEV
>>>> would corrupt the filesystem in multi-device setups. Besides, since
>>>> btrfs doesn't support shutdown right now, we can support it later.
>>>> - Add -v flag to od in _check_gap_zero() to prevent line folding of
>>>> identical consecutive lines.
>>>> - Add expected_new_sz parameter to _test_eof_zeroing(), verify file
>>>> size was not rolled back after shutdown+remount cycle, and also drop
>>>> the unnecessary file size check before the shutdown as well.
>>>> - Clarify the comment regarding when stale data beyond EOF can persist.
>>>>
>>>
>>> Thanks for the tweaks. This all LGTM from a review standpoint. I gave it
>>> a quick test on latest master and I see a few failures in a couple runs:
>>>
>>> - On XFS (mkfs defaults) I saw one unexpected i_size failure and one
>>> zeroing failure, both on write extension fwiw.
>>
>> Previously, I only discovered the zeroing failure of append write. This
>> is because xfs_file_write_checks() -> xfs_file_write_zero_eof() only
>> zeroes the gap range in the page cache, without providing any
>> synchronous or asynchronous persistence (instead, truncate up does
>> synchronously writeback in xfs_vn_setattr_size(), and ext4 achieves
>> persistence via asynchronous writeback in data=ordered mode). So I think
>> this is a XFS problem.
>>
>
> Ah Ok.. the truncate flush has been there for a while to combat the NULL
> files problem, and the zeroing check ties into that. I suspect the
> reason we don't see this often is typical ascending offset writeback
> order, whereas this test is doing a targeted sync range of a later
> offset in the file (i.e. technically not overlapping with the range that
> was zeroed internally).
Yes, this use case is quite rare.
>
> I guess the simple thing to do there would be to similarly flush on
> write extension. I'd want to think about that just a bit because we
> haven't historically flushed there and I know the zeroing improvements
> I've made in iomap have caused a couple performance regressions along
> the way related to aggressive flushing that had to be fixed.
I am also concerned that synchronous write-back might introduce
performance regressions. Although the probability is relatively low
(I'm not sure), perhaps some users who use post-EOF writing will
complain.
In this corner case, ext4+iomap also encountered the same dilemma, as it
can no longer use data=ordered to achieve asynchronous persistence.
FYI, the current approach of ext4+iomap is to submit an I/O immediately
after zeroing the EOF block, but it does not wait for the I/O to
complete. In the subsequent endio process, it waits for this zero I/O to
complete before updating i_disksize beyond the end of the gap.
Best Regards,
Yi
>
> Maybe this is less of an issue now, but regardless if my assumptions are
> correct here than I agree this isn't a test issue.
>
>> Regarding the i_size failure, I did not directly reproduce this issue.
>> After analysis, I believe it is because the test case did not include
>> the -a option in sync_range, meaning it did not wait for IO writeback
>> completion and file size update persistence. I reproduced this issue by
>> adding a delay in the XFS end IO path. This is a problem with the test
>> case, and I will fix it in v3. Thank you for pointing this out.
>>
>
> Makes sense. It will be interesting to see if we can still reproduce the
> above issue with this change.
>
>>> - On ext4 I saw a few unexpected i_size failures (both with mkfs
>>> defaults and 1k block size).
>>>
>>
>> This is an ext4 issue on the shutdown path. Since ext4 set the shutdown
>> flag too early, it was unable to write back ordered zero data when
>> flushing the journal, which led to a journal abort and prevented the file
>> size update from being persisted. I have submitted a patch to fix this
>> issue. Please see below link for details.
>>
>> https://lore.kernel.org/linux-ext4/20260424104201.1930823-1-yi.zhang@huaweicloud.com/
>>
>
> Ok, thanks.
>
> Brian
>
>> Thanks
>> Yi.
>>
>>> I haven't dug into anything beyond that. Does this match what you're
>>> seeing on current kernels or are these unexpected failures?
>>>
>>> Brian
>>>
>>>> tests/generic/790 | 164 ++++++++++++++++++++++++++++++++++++++++++
>>>> tests/generic/790.out | 4 ++
>>>> 2 files changed, 168 insertions(+)
>>>> create mode 100755 tests/generic/790
>>>> create mode 100644 tests/generic/790.out
>>>>
>>>> diff --git a/tests/generic/790 b/tests/generic/790
>>>> new file mode 100755
>>>> index 00000000..2adc06f8
>>>> --- /dev/null
>>>> +++ b/tests/generic/790
>>>> @@ -0,0 +1,164 @@
>>>> +#! /bin/bash
>>>> +# SPDX-License-Identifier: GPL-2.0
>>>> +# Copyright (c) 2026 Huawei. All Rights Reserved.
>>>> +#
>>>> +# FS QA Test No. 790
>>>> +#
>>>> +# Test that extending a file past a non-block-aligned EOF correctly zero-fills
>>>> +# the gap [old_EOF, block_boundary), and that this zeroing persists through a
>>>> +# filesystem shutdown+remount cycle.
>>>> +#
>>>> +# Stale data beyond EOF can persist on disk when:
>>>> +# 1) append write data blocks are flushed before the on-disk file size update,
>>>> +# and the system crashes in this window.
>>>> +# 2) concurrent append writeback and mmap writes persist non-zero data past EOF.
>>>> +#
>>>> +# Subsequent post-EOF operations (append write, fallocate, truncate up) must
>>>> +# zero-fill and persist the gap to prevent exposing stale data.
>>>> +#
>>>> +# The test pollutes the file's last physical block (via FIEMAP + raw device
>>>> +# write) with a sentinel pattern beyond i_size, then performs each extend
>>>> +# operation and verifies the gap is zeroed both in memory and on disk.
>>>> +#
>>>> +. ./common/preamble
>>>> +_begin_fstest auto quick rw shutdown
>>>> +
>>>> +. ./common/filter
>>>> +
>>>> +_require_scratch
>>>> +_require_block_device $SCRATCH_DEV
>>>> +_require_no_realtime
>>>> +_require_scratch_shutdown
>>>> +_require_metadata_journaling $SCRATCH_DEV
>>>> +
>>>> +# FIEMAP on Btrfs returns logical addresses within the filesystem's address
>>>> +# space, not physical device offsets. Writing to these offsets on $SCRATCH_DEV
>>>> +# would corrupt the filesystem in multi-device setups.
>>>> +_exclude_fs btrfs
>>>> +
>>>> +_require_xfs_io_command "fiemap"
>>>> +_require_xfs_io_command "falloc"
>>>> +_require_xfs_io_command "pwrite"
>>>> +_require_xfs_io_command "truncate"
>>>> +_require_xfs_io_command "sync_range"
>>>> +
>>>> +# Check that gap region [offset, offset+nbytes) is entirely zero
>>>> +_check_gap_zero()
>>>> +{
>>>> + local file="$1"
>>>> + local offset="$2"
>>>> + local nbytes="$3"
>>>> + local label="$4"
>>>> + local data
>>>> + local stripped
>>>> +
>>>> + data=$(od -A n -t x1 -v -j $offset -N $nbytes "$file" 2>/dev/null)
>>>> +
>>>> + # Remove whitespace and check if any byte is non-zero
>>>> + stripped=$(printf '%s' "$data" | tr -d ' \n\t')
>>>> + if [ -n "$stripped" ] && ! echo "$stripped" | grep -qE "^0+$"; then
>>>> + echo "FAIL: non-zero data in gap [$offset,$((offset + nbytes))) $label"
>>>> + _hexdump -N $((offset + nbytes)) "$file"
>>>> + return 1
>>>> + fi
>>>> + return 0
>>>> +}
>>>> +
>>>> +# Get the physical block offset (in bytes) of the file's first block on device
>>>> +_get_phys_offset()
>>>> +{
>>>> + local file="$1"
>>>> + local fiemap_output
>>>> + local phys_blk
>>>> +
>>>> + fiemap_output=$($XFS_IO_PROG -r -c "fiemap -v" "$file" 2>/dev/null)
>>>> + phys_blk=$(echo "$fiemap_output" | _filter_xfs_io_fiemap | head -1 | awk '{print $3}')
>>>> + if [ -z "$phys_blk" ]; then
>>>> + echo ""
>>>> + return
>>>> + fi
>>>> + # Convert 512-byte blocks to bytes
>>>> + echo $((phys_blk * 512))
>>>> +}
>>>> +
>>>> +_test_eof_zeroing()
>>>> +{
>>>> + local test_name="$1"
>>>> + local extend_cmd="$2"
>>>> + local expected_new_sz="$3"
>>>> + local file=$SCRATCH_MNT/testfile_${test_name}
>>>> +
>>>> + echo "$test_name" | tee -a $seqres.full
>>>> +
>>>> + # Compute non-block-aligned EOF offset
>>>> + local gap_bytes=16
>>>> + local eof_offset=$((blksz - gap_bytes))
>>>> +
>>>> + # Step 1: Write one full block to ensure the filesystem allocates a
>>>> + # physical block for the file instead of using inline data.
>>>> + $XFS_IO_PROG -f -c "pwrite -S 0x5a 0 $blksz" -c fsync \
>>>> + "$file" >> $seqres.full 2>&1
>>>> +
>>>> + # Step 2: Get physical block offset on device via FIEMAP
>>>> + local phys_offset
>>>> + phys_offset=$(_get_phys_offset "$file")
>>>> + if [ -z "$phys_offset" ]; then
>>>> + _fail "$test_name: failed to get physical block offset via fiemap"
>>>> + fi
>>>> +
>>>> + # Step 3: Truncate file to non-block-aligned size and fsync.
>>>> + # The on-disk region [eof_offset, blksz) may or may not be
>>>> + # zeroed by the filesystem at this point.
>>>> + $XFS_IO_PROG -c "truncate $eof_offset" -c fsync \
>>>> + "$file" >> $seqres.full 2>&1
>>>> +
>>>> + # Step 4: Unmount and restore the physical block to all-0x5a on disk.
>>>> + # This bypasses the kernel's pagecache EOF-zeroing to ensure
>>>> + # the stale pattern is present on disk. Then remount.
>>>> + _scratch_unmount
>>>> + $XFS_IO_PROG -d -c "pwrite -S 0x5a $phys_offset $blksz" \
>>>> + $SCRATCH_DEV >> $seqres.full 2>&1
>>>> + _scratch_mount >> $seqres.full 2>&1
>>>> +
>>>> + # Step 5: Execute the extend operation.
>>>> + $XFS_IO_PROG -c "$extend_cmd" "$file" >> $seqres.full 2>&1
>>>> +
>>>> + # Step 6: Verify gap [eof_offset, blksz) is zeroed BEFORE shutdown
>>>> + _check_gap_zero "$file" $eof_offset $gap_bytes "before shutdown" || return 1
>>>> +
>>>> + # Step 7: Sync the extended range and shutdown the filesystem with
>>>> + # journal flush. This persists the file size extending, and
>>>> + # the filesystem should persist the zeroed data in the gap
>>>> + # range as well.
>>>> + if [ "$extend_cmd" != "${extend_cmd#pwrite}" ]; then
>>>> + $XFS_IO_PROG -c "sync_range -w $blksz $blksz" \
>>>> + "$file" >> $seqres.full 2>&1
>>>> + fi
>>>> + _scratch_shutdown -f
>>>> +
>>>> + # Step 8: Remount and verify gap is still zeroed
>>>> + _scratch_cycle_mount
>>>> +
>>>> + # Verify file size was not rolled back after shutdown+remount
>>>> + local sz
>>>> + sz=$(stat -c %s "$file")
>>>> + if [ "$sz" -ne "$expected_new_sz" ]; then
>>>> + _fail "$test_name: file size rolled back after shutdown+remount: $sz != $expected_new_sz"
>>>> + fi
>>>> +
>>>> + _check_gap_zero "$file" $eof_offset $gap_bytes "after shutdown+remount" || return 1
>>>> +}
>>>> +
>>>> +_scratch_mkfs >> $seqres.full 2>&1
>>>> +_scratch_mount
>>>> +
>>>> +blksz=$(_get_block_size $SCRATCH_MNT)
>>>> +
>>>> +# Test three variants of EOF-extending operations
>>>> +_test_eof_zeroing "append_write" "pwrite -S 0x42 $blksz $blksz" $((blksz * 2))
>>>> +_test_eof_zeroing "truncate_up" "truncate $((blksz * 2))" $((blksz * 2))
>>>> +_test_eof_zeroing "fallocate" "falloc $blksz $blksz" $((blksz * 2))
>>>> +
>>>> +# success, all done
>>>> +status=0
>>>> +exit
>>>> diff --git a/tests/generic/790.out b/tests/generic/790.out
>>>> new file mode 100644
>>>> index 00000000..e5e2cc09
>>>> --- /dev/null
>>>> +++ b/tests/generic/790.out
>>>> @@ -0,0 +1,4 @@
>>>> +QA output created by 790
>>>> +append_write
>>>> +truncate_up
>>>> +fallocate
>>>> --
>>>> 2.52.0
>>>>
>>
>
^ permalink raw reply
* [PATCH v2] iomap: add simple read path for small direct I/O
From: Fengnan Chang @ 2026-04-28 11:47 UTC (permalink / raw)
To: brauner, djwong, hch, ojaswin, dgc, linux-xfs, linux-fsdevel,
linux-ext4, linux-kernel, lidiangang
Cc: Fengnan Chang
When running 4K random read workloads on high-performance Gen5 NVMe
SSDs, the software overhead in the iomap direct I/O path
(__iomap_dio_rw) becomes a significant bottleneck.
Using io_uring with poll mode for a 4K randread test on a raw block
device:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /dev/nvme10n1
Result: ~3.2M IOPS
Running the exact same workload on ext4 and XFS:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /mnt/testfile
Result: ~1.84M IOPS
Profiling the ext4 workload reveals that a significant portion of CPU
time is spent on memory allocation and the iomap state machine
iteration:
5.33% [kernel] [k] __iomap_dio_rw
3.26% [kernel] [k] iomap_iter
2.37% [kernel] [k] iomap_dio_bio_iter
2.35% [kernel] [k] kfree
1.33% [kernel] [k] iomap_dio_complete
Introduce simple reads to reduce the overhead of iomap, simple read path
is triggered when the request satisfies:
- I/O size is <= inode blocksize (fits in a single block, no splits).
- No custom `iomap_dio_ops` (dops) registered by the filesystem.
After this optimization, the heavy generic functions disappear from the
profile, replaced by a single streamlined execution path:
4.83% [kernel] [k] iomap_dio_simple_read
With this patch, 4K random read IOPS on ext4 increases from 1.84M to
2.19M in the original single-core io_uring poll-mode workload.
Below are the test results using fio:
fs workload qd simple=0 simple=1 gain
ext4 libaio 1 18,738 18,761 +0.12%
ext4 libaio 128 455,383 471,473 +3.53%
ext4 libaio 256 453,273 468,555 +3.37%
ext4 libaio 512 447,320 469,036 +4.85%
ext4 io_uring 1 18,798 18,824 +0.14%
ext4 io_uring 128 503,834 528,353 +4.87%
ext4 io_uring 256 503,635 527,617 +4.76%
ext4 io_uring 512 501,802 527,882 +5.20%
ext4 io_uring_poll 1 19,246 19,270 +0.12%
ext4 io_uring_poll 128 1,463,343 1,565,019 +6.95%
ext4 io_uring_poll 256 1,651,112 1,888,182 +14.36%
ext4 io_uring_poll 512 1,632,641 1,893,259 +15.96%
xfs libaio 1 18,715 18,734 +0.10%
xfs libaio 128 452,974 473,459 +4.52%
xfs libaio 256 454,435 470,855 +3.61%
xfs libaio 512 456,796 473,047 +3.56%
xfs io_uring 1 18,755 18,795 +0.21%
xfs io_uring 128 509,459 534,819 +4.98%
xfs io_uring 256 509,853 536,051 +5.14%
xfs io_uring 512 507,926 533,558 +5.05%
xfs io_uring_poll 1 19,230 19,269 +0.20%
xfs io_uring_poll 128 1,467,398 1,567,840 +6.84%
xfs io_uring_poll 256 1,636,852 1,878,917 +14.79%
xfs io_uring_poll 512 1,639,495 1,874,813 +14.35%
Assisted-by: Gemini:gemini-3.1-pro-preview
Assisted-by: Codex:gpt-5-5
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
fs/iomap/direct-io.c | 382 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 371 insertions(+), 11 deletions(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e911daedff65a..807d8c628a464 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -9,6 +9,9 @@
#include <linux/iomap.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/fserror.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
#include "internal.h"
#include "trace.h"
@@ -236,20 +239,26 @@ static void iomap_dio_done(struct iomap_dio *dio)
iomap_dio_complete_work(&dio->aio.work);
}
-static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+static inline void iomap_dio_bio_release_pages(struct bio *bio,
+ unsigned int dio_flags, bool error)
{
- struct iomap_dio *dio = bio->bi_private;
-
- if (dio->flags & IOMAP_DIO_BOUNCE) {
- bio_iov_iter_unbounce(bio, !!dio->error,
- dio->flags & IOMAP_DIO_USER_BACKED);
+ if (dio_flags & IOMAP_DIO_BOUNCE) {
+ bio_iov_iter_unbounce(bio, error,
+ dio_flags & IOMAP_DIO_USER_BACKED);
bio_put(bio);
- } else if (dio->flags & IOMAP_DIO_USER_BACKED) {
+ } else if (dio_flags & IOMAP_DIO_USER_BACKED) {
bio_check_pages_dirty(bio);
} else {
bio_release_pages(bio, false);
bio_put(bio);
}
+}
+
+static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+{
+ struct iomap_dio *dio = bio->bi_private;
+
+ iomap_dio_bio_release_pages(bio, dio->flags, !!dio->error);
/* Do not touch bio below, we just gave up our reference. */
@@ -387,6 +396,14 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
return ret;
}
+static inline unsigned int iomap_dio_alignment(struct inode *inode,
+ struct block_device *bdev, unsigned int dio_flags)
+{
+ if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
+ return i_blocksize(inode);
+ return bdev_logical_block_size(bdev);
+}
+
static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
{
const struct iomap *iomap = &iter->iomap;
@@ -405,10 +422,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
* File systems that write out of place and always allocate new blocks
* need each bio to be block aligned as that's the unit of allocation.
*/
- if (dio->flags & IOMAP_DIO_FSBLOCK_ALIGNED)
- alignment = fs_block_size;
- else
- alignment = bdev_logical_block_size(iomap->bdev);
+ alignment = iomap_dio_alignment(inode, iomap->bdev, dio->flags);
if ((pos | length) & (alignment - 1))
return -EINVAL;
@@ -880,12 +894,350 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
}
EXPORT_SYMBOL_GPL(__iomap_dio_rw);
+struct iomap_dio_simple_read {
+ struct kiocb *iocb;
+ size_t size;
+ unsigned int dio_flags;
+ atomic_t state;
+ union {
+ struct task_struct *waiter;
+ struct work_struct work;
+ };
+ /*
+ * Align @bio to a cacheline boundary so that, combined with the
+ * front_pad passed to bioset_init(), the bio sits at the start of
+ * a cacheline in memory returned by the (HWCACHE-aligned) bio
+ * slab. This keeps the hot fields block layer touches on submit
+ * and completion (bi_iter, bi_status, ...) within a single line.
+ */
+ struct bio bio ____cacheline_aligned_in_smp;
+};
+
+static struct bio_set iomap_dio_simple_read_pool;
+
+/*
+ * In the async simple read path, we need to prevent bio_endio() from
+ * triggering iocb->ki_complete() before the submitter has returned
+ * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
+ *
+ * We use a three-state rendezvous to synchronize the submitter and end_io:
+ *
+ * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
+ *
+ * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
+ * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
+ * ki_complete().
+ *
+ * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
+ * submit path. end_io sets this state and does nothing else. The submitter
+ * will see this state and handle the completion synchronously (bypassing
+ * ki_complete() and returning the actual result).
+ */
+enum {
+ IOMAP_DIO_SIMPLE_SUBMITTING = 0,
+ IOMAP_DIO_SIMPLE_QUEUED,
+ IOMAP_DIO_SIMPLE_DONE,
+};
+
+static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
+ struct bio *bio, ssize_t ret)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct iomap_dio_simple_read *sr = bio->bi_private;
+
+ if (likely(!ret)) {
+ ret = sr->size;
+ iocb->ki_pos += ret;
+ } else {
+ fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
+ sr->size, ret, GFP_NOFS);
+ }
+
+ iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
+
+ return ret;
+}
+
+static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
+ struct bio *bio)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ WRITE_ONCE(iocb->private, NULL);
+
+ ret = iomap_dio_simple_read_finish(iocb, bio,
+ blk_status_to_errno(bio->bi_status));
+
+ inode_dio_end(inode);
+ trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
+ return ret;
+}
+
+static void iomap_dio_simple_read_complete_work(struct work_struct *work)
+{
+ struct iomap_dio_simple_read *sr =
+ container_of(work, struct iomap_dio_simple_read, work);
+ struct kiocb *iocb = sr->iocb;
+ ssize_t ret;
+
+ ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
+ iocb->ki_complete(iocb, ret);
+}
+
+static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
+{
+ struct kiocb *iocb = sr->iocb;
+
+ if (unlikely(sr->bio.bi_status)) {
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
+ queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
+ return;
+ }
+
+ iomap_dio_simple_read_complete_work(&sr->work);
+}
+
+static void iomap_dio_simple_read_end_io(struct bio *bio)
+{
+ struct iomap_dio_simple_read *sr = bio->bi_private;
+
+ if (sr->waiter) {
+ struct task_struct *waiter = sr->waiter;
+
+ WRITE_ONCE(sr->waiter, NULL);
+ blk_wake_io_task(waiter);
+ return;
+ }
+
+ if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
+ atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+ IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
+ iomap_dio_simple_read_async_done(sr);
+}
+
+static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
+ struct iov_iter *iter, unsigned int dio_flags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ size_t count = iov_iter_count(iter);
+
+ if (iov_iter_rw(iter) != READ)
+ return false;
+ /*
+ * Simple read is an optimization for small IO. Filter out large IO
+ * early as it's the most common case to fail for typical direct IO
+ * workloads.
+ */
+ if (count > inode->i_sb->s_blocksize)
+ return false;
+ if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
+ return false;
+ if (iocb->ki_pos + count > i_size_read(inode))
+ return false;
+
+ return true;
+}
+
+static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
+ struct iov_iter *iter, const struct iomap_ops *ops,
+ void *private, unsigned int dio_flags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ size_t count = iov_iter_count(iter);
+ int nr_pages;
+ struct iomap_dio_simple_read *sr;
+ unsigned int alignment;
+ struct iomap_iter iomi = {
+ .inode = inode,
+ .pos = iocb->ki_pos,
+ .len = count,
+ .flags = IOMAP_DIRECT,
+ .private = private,
+ };
+ struct bio *bio;
+ bool wait_for_completion = is_sync_kiocb(iocb);
+ ssize_t ret;
+
+ if (dio_flags & IOMAP_DIO_BOUNCE)
+ nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
+ else
+ nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ iomi.flags |= IOMAP_NOWAIT;
+
+ ret = kiocb_write_and_wait(iocb, count);
+ if (ret)
+ return ret;
+
+ inode_dio_begin(inode);
+
+ ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
+ &iomi.iomap, &iomi.srcmap);
+ if (ret) {
+ inode_dio_end(inode);
+ return ret;
+ }
+
+ if (iomi.iomap.type != IOMAP_MAPPED ||
+ iomi.iomap.offset > iomi.pos ||
+ iomi.iomap.offset + iomi.iomap.length < iomi.pos + count) {
+ ret = -ENOTBLK;
+ goto out_iomap_end;
+ }
+
+ alignment = iomap_dio_alignment(inode, iomi.iomap.bdev, dio_flags);
+ if ((iomi.pos | count) & (alignment - 1)) {
+ ret = -EINVAL;
+ goto out_iomap_end;
+ }
+
+ if (unlikely(!inode->i_sb->s_dio_done_wq)) {
+ ret = sb_init_dio_done_wq(inode->i_sb);
+ if (ret < 0)
+ goto out_iomap_end;
+ }
+
+ trace_iomap_dio_rw_begin(iocb, iter, dio_flags, 0);
+
+ if (user_backed_iter(iter))
+ dio_flags |= IOMAP_DIO_USER_BACKED;
+
+ bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
+ REQ_OP_READ | REQ_SYNC | REQ_IDLE,
+ GFP_KERNEL, &iomap_dio_simple_read_pool);
+ sr = container_of(bio, struct iomap_dio_simple_read, bio);
+
+ fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos >> inode->i_blkbits,
+ GFP_KERNEL);
+ sr->iocb = iocb;
+ sr->dio_flags = dio_flags;
+
+ bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
+ bio->bi_ioprio = iocb->ki_ioprio;
+ bio->bi_private = sr;
+ bio->bi_end_io = iomap_dio_simple_read_end_io;
+
+ if (dio_flags & IOMAP_DIO_BOUNCE)
+ ret = bio_iov_iter_bounce(bio, iter);
+ else
+ ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
+ if (unlikely(ret))
+ goto out_bio_put;
+
+ if (bio->bi_iter.bi_size != count) {
+ iov_iter_revert(iter, bio->bi_iter.bi_size);
+ ret = -ENOTBLK;
+ goto out_bio_release_pages;
+ }
+
+ sr->size = bio->bi_iter.bi_size;
+
+ if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
+ !(dio_flags & IOMAP_DIO_BOUNCE))
+ bio_set_pages_dirty(bio);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ bio->bi_opf |= REQ_NOWAIT;
+ if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
+ bio->bi_opf |= REQ_POLLED;
+ bio_set_polled(bio, iocb);
+ WRITE_ONCE(iocb->private, bio);
+ }
+
+ if (wait_for_completion) {
+ sr->waiter = current;
+ blk_crypto_submit_bio(bio);
+ } else {
+ atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
+ sr->waiter = NULL;
+ blk_crypto_submit_bio(bio);
+ ret = -EIOCBQUEUED;
+ }
+
+ if (ops->iomap_end)
+ ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
+ &iomi.iomap);
+
+ if (wait_for_completion) {
+ for (;;) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (!READ_ONCE(sr->waiter))
+ break;
+ blk_io_schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+
+ ret = iomap_dio_simple_read_finish(iocb, bio,
+ blk_status_to_errno(bio->bi_status));
+ inode_dio_end(inode);
+ trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
+ ret > 0 ? ret : 0);
+ } else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+ IOMAP_DIO_SIMPLE_QUEUED) ==
+ IOMAP_DIO_SIMPLE_DONE) {
+ ret = iomap_dio_simple_read_complete(iocb, bio);
+ } else {
+ trace_iomap_dio_rw_queued(inode, iomi.pos, count);
+ }
+
+ return ret;
+
+out_bio_release_pages:
+ if (dio_flags & IOMAP_DIO_BOUNCE)
+ bio_iov_iter_unbounce(bio, true, false);
+ else
+ bio_release_pages(bio, false);
+out_bio_put:
+ bio_put(bio);
+out_iomap_end:
+ if (ops->iomap_end)
+ ops->iomap_end(inode, iomi.pos, count, 0, iomi.flags,
+ &iomi.iomap);
+ inode_dio_end(inode);
+ return ret;
+}
+
ssize_t
iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags, void *private, size_t done_before)
{
struct iomap_dio *dio;
+ ssize_t ret;
+
+ /*
+ * Fast path for small, block-aligned reads that map to a single
+ * contiguous on-disk extent.
+ *
+ * @dops must be NULL: a non-NULL @dops means the caller wants its
+ * ->end_io / ->submit_io hooks invoked, and in particular wants its
+ * bios to be allocated from the filesystem-private @dops->bio_set
+ * (whose front_pad sizes a filesystem-private wrapper around the
+ * bio). The fast path instead allocates from the shared
+ * iomap_dio_simple_read_pool, whose front_pad matches
+ * struct iomap_dio_simple_read; the two wrappers are not
+ * interchangeable, so we must fall back to __iomap_dio_rw() in
+ * that case.
+ *
+ * @done_before must be zero: a non-zero caller-accumulated residual
+ * cannot be carried through a single-bio inline completion.
+ *
+ * -ENOTBLK is the private sentinel returned by iomap_dio_simple_read()
+ * when it decides the request does not fit the fast path.
+ * In that case we proceed to the generic __iomap_dio_rw() slow
+ * path. Any other errno is a real result and is propagated as-is,
+ * in particular -EAGAIN for IOCB_NOWAIT must reach the caller.
+ */
+ if (!dops && !done_before &&
+ iomap_dio_simple_read_supported(iocb, iter, dio_flags)) {
+ ret = iomap_dio_simple_read(iocb, iter, ops, private, dio_flags);
+ if (ret != -ENOTBLK)
+ return ret;
+ }
dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, private,
done_before);
@@ -894,3 +1246,11 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
return iomap_dio_complete(dio);
}
EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+ return bioset_init(&iomap_dio_simple_read_pool, 4,
+ offsetof(struct iomap_dio_simple_read, bio),
+ BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE);
+}
+fs_initcall(iomap_dio_init);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* [PATCH] dept: update documentation function names to match implementation
From: Yunseong Kim @ 2026-04-28 16:26 UTC (permalink / raw)
To: bagasdotme
Cc: 2407018371, Dai.Ngo, Liam.Howlett, a.hindborg, ada.coupriediaz,
adilger.kernel, akpm, alex.gaynor, alexander.shishkin, aliceryhl,
amir73il, andi.shyti, andrii, anna, arnd, ast, baolin.wang,
bigeasy, bjorn3_gh, boqun.feng, bp, brauner, broonie, bsegall,
byungchul, catalin.marinas, chenhuacai, chris.p.wilson,
christian.koenig, chuck.lever, cl, clrkwllms, corbet, da.gomez,
dakr, damien.lemoal, dan.j.williams, daniel.vetter, dave.hansen,
david, dennis, dietmar.eggemann, djwong, dri-devel, duyuyang,
dwmw, francesco, frederic, gary, geert+renesas, geert, gregkh,
guoweikang.kernel, gustavo, gwan-gyeong.mun, hamohammed.sa,
hannes, harry.yoo, hch, her0gyugyu, hpa, jack, jglisse,
jiangshanlai, jlayton, joel.granados, joel, joelagnelf,
johannes.berg, josef, josh, jpoimboe, juri.lelli, kees,
kernel-team, kernel_team, kevin.brodsky, kristina.martsenko,
lillian, linaro-mm-sig, link, linux-arch, linux-arm-kernel,
linux-block, linux-doc, linux-ext4, linux-fsdevel, linux-i2c,
linux-ide, linux-kernel, linux-media, linux-mm, linux-modules,
linux-nfs, linux-rt-devel, linux, longman, lorenzo.stoakes,
lossin, luto, mark.rutland, masahiroy, mathieu.desnoyers,
matthew.brost, max.byungchul.park, mcgrof, melissa.srw, mgorman,
mhocko, miguel.ojeda.sandonis, minchan, mingo, mjguzik,
neeraj.upadhyay, neil, neilb, netdev, ngupta, ojeda, okorniev,
oleg, paulmck, penberg, peterz, petr.pavlu, qiang.zhang, rcu,
richard.weiyang, rientjes, rodrigosiqueiramelo, rostedt, rppt,
rust-for-linux, samitolvanen, sashal, shakeel.butt, sj,
sumit.semwal, surenb, tglx, thomas.weissschuh, tim.c.chen, tj,
tmgross, tom, torvalds, trondmy, tytso, urezki, usamaarif642,
vbabka, vdavydov.dev, vincent.guittot, vschneid, wangfushuai,
wangkefeng.wang, will, willy, wsa+renesas, x86, yeoreum.yun, ysk,
yunseong.kim, yuzhao, ziy, Yunseong Kim
In-Reply-To: <aTN38kJjBftxnjm9@archie.me>
Synchronize function names in the documentation with the actual
implementation to fix naming inconsistencies.
Signed-off-by: Yunseong Kim <yunseong.kim@est.tech>
---
Documentation/dev-tools/dept.rst | 2 +-
Documentation/dev-tools/dept_api.rst | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/dev-tools/dept.rst b/Documentation/dev-tools/dept.rst
index 333166464543..31b2fe629fab 100644
--- a/Documentation/dev-tools/dept.rst
+++ b/Documentation/dev-tools/dept.rst
@@ -97,7 +97,7 @@ No. What about the following?
mutex_lock A
mutex_lock A <- DEADLOCK
- wait_for_complete B <- DEADLOCK
+ wait_for_completion B <- DEADLOCK
complete B
mutex_unlock A
mutex_unlock A
diff --git a/Documentation/dev-tools/dept_api.rst b/Documentation/dev-tools/dept_api.rst
index 409116a62849..74e7b1424ad5 100644
--- a/Documentation/dev-tools/dept_api.rst
+++ b/Documentation/dev-tools/dept_api.rst
@@ -113,7 +113,7 @@ Do not use these APIs directly. The raw APIs of dept are:
dept_stage_wait(map, key, ip, wait_func, time);
dept_request_event_wait_commit();
dept_clean_stage();
- dept_stage_event(task, ip);
+ dept_ttwu_stage_wait(task, ip);
dept_ecxt_enter(map, evt_flags, ip, ecxt_func, evt_func, sub_local);
dept_ecxt_holding(map, evt_flags);
dept_request_event(map, ext_wgen);
--
2.53.0
^ permalink raw reply related
* Re: [BUG] ext4: BUG_ON in ext4_write_inline_data (fs/ext4/inline.c:240)
From: Demi Marie Obenour @ 2026-04-28 20:50 UTC (permalink / raw)
To: Theodore Tso
Cc: Zw Tang, Andreas Dilger, libaokun, jack, ojaswin, linux-ext4,
linux-kernel, yi.zhang, syzkaller-bugs
In-Reply-To: <20260426032211.GD22489@macsyma-wired.lan>
[-- Attachment #1.1.1: Type: text/plain, Size: 3806 bytes --]
On 4/25/26 23:22, Theodore Tso wrote:
> On Sat, Apr 25, 2026 at 02:00:23PM -0400, Demi Marie Obenour wrote:
>>
>> Changing block devices that are mounted is also reachable via USB.
>> Yes, some distros may disable automount, but users who have stuff to
>> get done will mount USB devices anyway. Telling users "don't do this"
>> very rarely works in practice.
>
> How can an unprivileged user change the contents of a USB device while
> it is mounted?
>
> Are you positing evil USB devices that can return block contents A at
> time t, and block contents B at time t+1?
Correct.
> The threat model that we are using is that if the USB device is set to
> a particular state *before* the file system is mounted, and then the
> KGB scatters the USB device in the parking lot, and then someone picks
> up the USB device in the Raytheon parking lot, and says, "hey, free
> hardware", takes it into the classified machinem room, inserts it into
> the server, and mounts it. This might be considered likely or not
> likely, but speaking as someone who has been in a top secret machine
> room at a defense contractor, they were *way* less protected than what
> I've seen at a financial services company, or at a data center at a
> hyperscaler.
>
> But be that as it may, even *then* you're not modifying the block
> device while it is mounted.
>
>> 2. Harden the kernel filesystem drivers against malicious devices,
>> including TOCTOU.
>
> Malicious devices that have their own microcomputer and can change the
> block contents under the control of the attacker is *just* not
> something I care about. I also don't think it's a particularly
> realistic threat model.
This is an example of a BadUSB attack, which has been known since
at least 2014. USB sticks *do* have their own microcontrollers to
run their firmware. At least in the past this firmware has been
programmable and not been digitally signed. This means that the USB
stick *can* be reprogrammed to perform a TOCTOU attack on a filesystem,
or indeed to implement a completely different kind of USB device.
There are also attacks possible in confidential computing scenarios.
dm-verity protects against both tampering and replay attacks,
but it is read-only. dm-crypt and dm-integrity are writable, but
dm-crypt does not block tampering and dm-verity does not block replay
of a previously stored sector.
Protecting against replay attacks requires a Merkle tree. The only
Linux filesystems that I know have one are ZFS, bcachefs, and BTRFS.
The first two are out of tree and the third is not shipped in RHEL
at least.
If dm-integrity is used, an attacker can return an old value that
was stored at a given block in the past. With only dm-crypt, an
attacker can replace any cipher block (typically 16 bytes) with
either its old contents or garbage the attacker doesn't control.
If this can be used to compromise a confidential VM, this violates
the confidential computing security boundary.
Finally, this can be used to violate kernel lockdown, breaking the
guarantees UEFI secure boot is supposed to provide. Whether that is
worth caring about is a totally different question of course.
Of course, you are free to choose which (if any) of these attacks you
care about. One can that USB sticks should be mounted in userspace,
UEFI secure boot with Microsoft keys is irrelevant as long as
administrator -> kernel isn't a security boundary on Windows, and that
confidential computing only makes sense for stateless workloads (which
can use dm-verity) until there is a way to trust storage devices.
But it's always best to be aware that an attack vector exists,
whether or not one chooses to address it.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [BUG] ext4: BUG_ON in ext4_write_inline_data (fs/ext4/inline.c:240)
From: Theodore Tso @ 2026-04-29 4:40 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Zw Tang, Andreas Dilger, libaokun, jack, ojaswin, linux-ext4,
linux-kernel, yi.zhang, syzkaller-bugs
In-Reply-To: <5981069b-8c54-4b5b-a808-5ebdc8cd7265@gmail.com>
On Tue, Apr 28, 2026 at 04:50:14PM -0400, Demi Marie Obenour wrote:
> This is an example of a BadUSB attack, which has been known since
> at least 2014. USB sticks *do* have their own microcontrollers to
> run their firmware. At least in the past this firmware has been
> programmable and not been digitally signed. This means that the USB
> stick *can* be reprogrammed to perform a TOCTOU attack on a filesystem,
> or indeed to implement a completely different kind of USB device.
Honestly, if that's what you are worried about, then best solution is
put epoxy in every single USB port. I've since financial institutions
that have done precisely this, and both Android and Chrome OS supports
enterprise security policies which does the equivalent in software.
> Protecting against replay attacks requires a Merkle tree. The only
> Linux filesystems that I know have one are ZFS, bcachefs, and BTRFS.
> The first two are out of tree and the third is not shipped in RHEL
> at least.
FYI, there was a patch for btrfs, but it was never landed. It's
unclear how many people would be willing to pay the performance tax of
using hmac-sha256 for every single data and metadata block write.
> Of course, you are free to choose which (if any) of these attacks you
> care about. One can that USB sticks should be mounted in userspace,
> UEFI secure boot with Microsoft keys is irrelevant as long as
> administrator -> kernel isn't a security boundary on Windows, and that
> confidential computing only makes sense for stateless workloads (which
> can use dm-verity) until there is a way to trust storage devices.
> But it's always best to be aware that an attack vector exists,
> whether or not one chooses to address it.
Sure, but a drive-by comment on a patch review advocating that we slow
down ext4 to protect a single instance where the attacker has
read/write access to a mounted block device, when the file system
doesn't have generalized protections against that whole class of
attacks.... isn't particularly helpful.
By the way, I'm not aware of *any* company that has been interested in
funding work to protect against this class of attacks. Given that
most file system developers prefer food with their meals, and have
enough *other* unfunded mandates from our user community, it doesn't
seem likely that we're going to see much forward progress towards your
desires/interests.
Cheers,
- Ted
^ permalink raw reply
* Re: [BUG] ext4: BUG_ON in ext4_write_inline_data (fs/ext4/inline.c:240)
From: Demi Marie Obenour @ 2026-04-29 5:32 UTC (permalink / raw)
To: Theodore Tso
Cc: Zw Tang, Andreas Dilger, libaokun, jack, ojaswin, linux-ext4,
linux-kernel, yi.zhang, syzkaller-bugs, qubes-users,
Spectrum OS Discussion
In-Reply-To: <20260429044030.GB16497@macsyma-wired.lan>
[-- Attachment #1.1.1: Type: text/plain, Size: 4642 bytes --]
On 4/29/26 00:40, Theodore Tso wrote:
> On Tue, Apr 28, 2026 at 04:50:14PM -0400, Demi Marie Obenour wrote:
>> This is an example of a BadUSB attack, which has been known since
>> at least 2014. USB sticks *do* have their own microcontrollers to
>> run their firmware. At least in the past this firmware has been
>> programmable and not been digitally signed. This means that the USB
>> stick *can* be reprogrammed to perform a TOCTOU attack on a filesystem,
>> or indeed to implement a completely different kind of USB device.
>
> Honestly, if that's what you are worried about, then best solution is
> put epoxy in every single USB port. I've since financial institutions
> that have done precisely this, and both Android and Chrome OS supports
> enterprise security policies which does the equivalent in software.
That works well in enterprises using laptops, tablets, or mobile
devices. Enterprises can require that all devices have built-in,
non-USB input devices or touchscreens. Furthermore, corporate
environments use network-based backup and file sharing. So there
is hardly a need for USB except for security tokens, smart card
readers, OS installation media, and miscellaneous hardware devices
that generally will not be hotplugged. Only the third is a block
device, it is trusted, and it can be a USB stick with a physical
write-protect switch and signed firmware.
>> Protecting against replay attacks requires a Merkle tree. The only
>> Linux filesystems that I know have one are ZFS, bcachefs, and BTRFS.
>> The first two are out of tree and the third is not shipped in RHEL
>> at least.
>
> FYI, there was a patch for btrfs, but it was never landed. It's
> unclear how many people would be willing to pay the performance tax of
> using hmac-sha256 for every single data and metadata block write.
I thought BTRFS already had one. In any case corrupting an encrypted
disk will cause a checksum failure with fairly high probability,
as a 128-bit region is completely scrambled.
>> Of course, you are free to choose which (if any) of these attacks you
>> care about. One can that USB sticks should be mounted in userspace,
>> UEFI secure boot with Microsoft keys is irrelevant as long as
>> administrator -> kernel isn't a security boundary on Windows, and that
>> confidential computing only makes sense for stateless workloads (which
>> can use dm-verity) until there is a way to trust storage devices.
>> But it's always best to be aware that an attack vector exists,
>> whether or not one chooses to address it.
>
> Sure, but a drive-by comment on a patch review advocating that we slow
> down ext4 to protect a single instance where the attacker has
> read/write access to a mounted block device, when the file system
> doesn't have generalized protections against that whole class of
> attacks.... isn't particularly helpful.
My main goal is to point out that the attack vector does exist.
If nothing else, hopefully this will persuade distro maintainers
to switch to using libguestfs + FUSE as the default way to mount
USB drives. That isolates the driver in a VM.
Qubes OS users can already mount the device in a disposable VM,
and presumably many of them already do that. Again, this provides
strong isolation and severely limits the impact of an exploit.
I do wonder if this could be used against confidential computing
workloads. That said, work there would more likely be put into
allowing them to attest their storage.
> By the way, I'm not aware of *any* company that has been interested in
> funding work to protect against this class of attacks. Given that
> most file system developers prefer food with their meals, and have
> enough *other* unfunded mandates from our user community, it doesn't
> seem likely that we're going to see much forward progress towards your
> desires/interests.
I'm not surprised at all.
The people who would benefit the most from this work are consumers
who are running desktop Linux on general purpose computers they own.
I have spent most of my career providing secure solutions for these
people. I worked on Qubes OS for several years, and I now work
on Spectrum.
Unfortunately, this group has very little budget and so very little
market power. Therefore, work that benefits them is perpetually
and massively underfunded. Grants do exist, and crowdfunding might
also be an option. However, unless one is very passionate about the
client space, it is hard to resist much better-paying jobs in the
server world.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v3] generic/790: test post-EOF gap zeroing persistence
From: Brian Foster @ 2026-04-29 11:24 UTC (permalink / raw)
To: Zhang Yi
Cc: fstests, zlang, linux-ext4, linux-fsdevel, jack, yi.zhang,
yizhang089, yangerkun
In-Reply-To: <20260428085750.1072612-1-yi.zhang@huaweicloud.com>
On Tue, Apr 28, 2026 at 04:57:50PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Test that extending a file past a non-block-aligned EOF correctly
> zero-fills the gap [old_EOF, block_boundary), and that this zeroing
> persists through a filesystem shutdown+remount cycle.
>
> Stale data beyond EOF can persist on disk when append write data blocks
> are flushed before the on-disk file size update, or when concurrent
> append writeback and mmap writes persist non-zero data past EOF.
> Subsequent post-EOF operations (append write, fallocate, truncate up)
> must zero-fill and persist the gap to prevent exposing stale data.
>
> The test pollutes the file's last physical block (via FIEMAP + raw
> device write) with a sentinel pattern beyond i_size, then performs each
> extend operation and verifies the gap is zeroed both in memory and on
> disk.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
Reviewed-by: Brian Foster <bfoster@redhat.com>
> v2->v3:
> - Add error check for the raw device pwrite, a failed pwrite would
> silently leave the test continuing with an unpolluted block,
> producing false-positive passes.
> - Add sync_range -a to wait until the extending I/O completes and to
> ensure file size update is persisted before shutdown, preventing
> unexpected file size errors.
> v1->v2:
> - Add _require_no_realtime to prevent testing on XFS realtime devices,
> where file data may reside on $SCRATCH_RTDEV.
> - Add _exclude_fs btrfs since FIEMAP returns logical addresses, not
> physical device offsets, writing to these offsets on $SCRATCH_DEV
> would corrupt the filesystem in multi-device setups. Besides, since
> btrfs doesn't support shutdown right now, we can support it later.
> - Add -v flag to od in _check_gap_zero() to prevent line folding of
> identical consecutive lines.
> - Add expected_new_sz parameter to _test_eof_zeroing(), verify file
> size was not rolled back after shutdown+remount cycle, and also drop
> the unnecessary file size check before the shutdown as well.
> - Clarify the comment regarding when stale data beyond EOF can persist.
>
> tests/generic/790 | 168 ++++++++++++++++++++++++++++++++++++++++++
> tests/generic/790.out | 4 +
> 2 files changed, 172 insertions(+)
> create mode 100755 tests/generic/790
> create mode 100644 tests/generic/790.out
>
> diff --git a/tests/generic/790 b/tests/generic/790
> new file mode 100755
> index 00000000..6daf3793
> --- /dev/null
> +++ b/tests/generic/790
> @@ -0,0 +1,168 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2026 Huawei. All Rights Reserved.
> +#
> +# FS QA Test No. 790
> +#
> +# Test that extending a file past a non-block-aligned EOF correctly zero-fills
> +# the gap [old_EOF, block_boundary), and that this zeroing persists through a
> +# filesystem shutdown+remount cycle.
> +#
> +# Stale data beyond EOF can persist on disk when:
> +# 1) append write data blocks are flushed before the on-disk file size update,
> +# and the system crashes in this window.
> +# 2) concurrent append writeback and mmap writes persist non-zero data past EOF.
> +#
> +# Subsequent post-EOF operations (append write, fallocate, truncate up) must
> +# zero-fill and persist the gap to prevent exposing stale data.
> +#
> +# The test pollutes the file's last physical block (via FIEMAP + raw device
> +# write) with a sentinel pattern beyond i_size, then performs each extend
> +# operation and verifies the gap is zeroed both in memory and on disk.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick rw shutdown
> +
> +. ./common/filter
> +
> +_require_scratch
> +_require_block_device $SCRATCH_DEV
> +_require_no_realtime
> +_require_scratch_shutdown
> +_require_metadata_journaling $SCRATCH_DEV
> +
> +# FIEMAP on Btrfs returns logical addresses within the filesystem's address
> +# space, not physical device offsets. Writing to these offsets on $SCRATCH_DEV
> +# would corrupt the filesystem in multi-device setups.
> +_exclude_fs btrfs
> +
> +_require_xfs_io_command "fiemap"
> +_require_xfs_io_command "falloc"
> +_require_xfs_io_command "pwrite"
> +_require_xfs_io_command "truncate"
> +_require_xfs_io_command "sync_range"
> +
> +# Check that gap region [offset, offset+nbytes) is entirely zero
> +_check_gap_zero()
> +{
> + local file="$1"
> + local offset="$2"
> + local nbytes="$3"
> + local label="$4"
> + local data
> + local stripped
> +
> + data=$(od -A n -t x1 -v -j $offset -N $nbytes "$file" 2>/dev/null)
> +
> + # Remove whitespace and check if any byte is non-zero
> + stripped=$(printf '%s' "$data" | tr -d ' \n\t')
> + if [ -n "$stripped" ] && ! echo "$stripped" | grep -qE "^0+$"; then
> + echo "FAIL: non-zero data in gap [$offset,$((offset + nbytes))) $label"
> + _hexdump -N $((offset + nbytes)) "$file"
> + return 1
> + fi
> + return 0
> +}
> +
> +# Get the physical block offset (in bytes) of the file's first block on device
> +_get_phys_offset()
> +{
> + local file="$1"
> + local fiemap_output
> + local phys_blk
> +
> + fiemap_output=$($XFS_IO_PROG -r -c "fiemap -v" "$file" 2>/dev/null)
> + phys_blk=$(echo "$fiemap_output" | _filter_xfs_io_fiemap | head -1 | awk '{print $3}')
> + if [ -z "$phys_blk" ]; then
> + echo ""
> + return
> + fi
> + # Convert 512-byte blocks to bytes
> + echo $((phys_blk * 512))
> +}
> +
> +_test_eof_zeroing()
> +{
> + local test_name="$1"
> + local extend_cmd="$2"
> + local expected_new_sz="$3"
> + local file=$SCRATCH_MNT/testfile_${test_name}
> +
> + echo "$test_name" | tee -a $seqres.full
> +
> + # Compute non-block-aligned EOF offset
> + local gap_bytes=16
> + local eof_offset=$((blksz - gap_bytes))
> +
> + # Step 1: Write one full block to ensure the filesystem allocates a
> + # physical block for the file instead of using inline data.
> + $XFS_IO_PROG -f -c "pwrite -S 0x5a 0 $blksz" -c fsync \
> + "$file" >> $seqres.full 2>&1
> +
> + # Step 2: Get physical block offset on device via FIEMAP
> + local phys_offset
> + phys_offset=$(_get_phys_offset "$file")
> + if [ -z "$phys_offset" ]; then
> + _fail "$test_name: failed to get physical block offset via fiemap"
> + fi
> +
> + # Step 3: Truncate file to non-block-aligned size and fsync.
> + # The on-disk region [eof_offset, blksz) may or may not be
> + # zeroed by the filesystem at this point.
> + $XFS_IO_PROG -c "truncate $eof_offset" -c fsync \
> + "$file" >> $seqres.full 2>&1
> +
> + # Step 4: Unmount and restore the physical block to all-0x5a on disk.
> + # This bypasses the kernel's pagecache EOF-zeroing to ensure
> + # the stale pattern is present on disk. Then remount.
> + _scratch_unmount
> + $XFS_IO_PROG -d -c "pwrite -S 0x5a $phys_offset $blksz" \
> + $SCRATCH_DEV >> $seqres.full 2>&1
> + if [ $? -ne 0 ]; then
> + _fail "$test_name: failed to inject stale data on disk"
> + fi
> + _scratch_mount >> $seqres.full 2>&1
> +
> + # Step 5: Execute the extend operation.
> + $XFS_IO_PROG -c "$extend_cmd" "$file" >> $seqres.full 2>&1
> +
> + # Step 6: Verify gap [eof_offset, blksz) is zeroed BEFORE shutdown
> + _check_gap_zero "$file" $eof_offset $gap_bytes "before shutdown" || return 1
> +
> + # Step 7: Sync the extended range and shutdown the filesystem with
> + # journal flush. This persists the file size extending, and
> + # the filesystem should persist the zeroed data in the gap
> + # range as well.
> + if [ "$extend_cmd" != "${extend_cmd#pwrite}" ]; then
> + $XFS_IO_PROG -c "sync_range -w $blksz $blksz" \
> + -c "sync_range -a $blksz $blksz" \
> + "$file" >> $seqres.full 2>&1
> + fi
> + _scratch_shutdown -f
> +
> + # Step 8: Remount and verify gap is still zeroed
> + _scratch_cycle_mount
> +
> + # Verify file size was not rolled back after shutdown+remount
> + local sz
> + sz=$(stat -c %s "$file")
> + if [ "$sz" -ne "$expected_new_sz" ]; then
> + _fail "$test_name: file size rolled back after shutdown+remount: $sz != $expected_new_sz"
> + fi
> +
> + _check_gap_zero "$file" $eof_offset $gap_bytes "after shutdown+remount" || return 1
> +}
> +
> +_scratch_mkfs >> $seqres.full 2>&1
> +_scratch_mount
> +
> +blksz=$(_get_block_size $SCRATCH_MNT)
> +
> +# Test three variants of EOF-extending operations
> +_test_eof_zeroing "append_write" "pwrite -S 0x42 $blksz $blksz" $((blksz * 2))
> +_test_eof_zeroing "truncate_up" "truncate $((blksz * 2))" $((blksz * 2))
> +_test_eof_zeroing "fallocate" "falloc $blksz $blksz" $((blksz * 2))
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/790.out b/tests/generic/790.out
> new file mode 100644
> index 00000000..e5e2cc09
> --- /dev/null
> +++ b/tests/generic/790.out
> @@ -0,0 +1,4 @@
> +QA output created by 790
> +append_write
> +truncate_up
> +fallocate
> --
> 2.52.0
>
^ permalink raw reply
* [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers
From: Darrick J. Wong @ 2026-04-29 14:11 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, fuse-devel
Cc: Miklos Szeredi, Bernd Schubert, Joanne Koong, Theodore Ts'o,
Neal Gompa, Amir Goldstein, Christian Brauner
In-Reply-To: <20260223224617.GA2390314@frogsfrogsfrogs>
Hi everyone,
This is the eighth public draft of a prototype to connect the Linux
fuse driver to fs-iomap for regular file IO operations to and from files
whose contents persist to locally attached storage devices. With this
release, I show that it's possible to build a fuse server for a real
filesystem (ext4) that runs entirely in userspace yet maintains most of
its performance.
This effort is now separate from the one to run fuse servers in a
constrained environment via systemd. Putting fuse servers in a
container gets you all the blast radii reduction advantages and provides
a pathway to removing less popular filesystem drivers to reduce
maintenance work in the kernel; now we want trade relaxation of that
isolation for better performance.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server can upsert mappings
into the kernel for cached access (== zero upcalls for rereads and pure
overwrites!) and the iomap cache revalidation code works.
At this stage I still get about 95% of the kernel ext4 driver's
streaming directio performance on streaming IO, and 110% of its
streaming buffered IO performance. Random buffered IO is about 85% as
fast as the kernel. Random direct IO is about 80% as fast as the
kernel; see the cover letter for the fuse2fs iomap changes for more
details. Unwritten extent conversions on random direct writes are
especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with (now dynamic) debugging turned on!
This series has been rebased to 7.1-rc1 since the seventh RFC, but it
has not otherwise changed much. Most changes happened in userspace:
1. I've written some example fuse-iomap servers, so I now have a vehicle
for testing that out of place writes works (they do) and that inline
data works.
2. Ted has started merging the very large quantity of fuse2fs
improvements into e2fsprogs.
3. I reordered the systemd service container patchset towards master
because the maintainer indicated that he wanted to merge it.
There are some questions remaining:
a. I would like to continue the discussion about how the design review
of this code should be structured, and how might I go about creating
new userspace filesystem servers -- lightweight new ones based off
the existing userspace tools? Or by merging lklfuse?
b. fuse2fs doesn't support the ext4 journal. Urk.
c. I've dropped the fstests and BPF parts of the patchbomb because v7
was just way too long. I'm also not including some extra
enhancements to fuse4fs, also for brevity.
I would like to get the main parts of this submission reviewed for 7.2
now that this has been collecting comments and tweaks in non-rfc status
for 5.5 months.
Kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container
libfuse:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/log/?h=fuse-iomap-cache
e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-memory-reclaim
fstests:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuse2fs
--Darrick
Unreviewed patches:
[PATCHSET v8 1/8] fuse: general bug fixes
[PATCH 3/4] fuse: update file mode when updating acls
[PATCH 4/4] fuse: propagate default and file acls on creation
[PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support
[PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile
[PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support
[PATCH 1/2] fuse: move the passthrough-specific code back to
[PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO
[PATCH 01/33] fuse: implement the basic iomap mechanisms
[PATCH 02/33] fuse_trace: implement the basic iomap mechanisms
[PATCH 03/33] fuse: make debugging configurable at runtime
[PATCH 04/33] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add
[PATCH 05/33] fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to
[PATCH 06/33] fuse: enable SYNCFS and ensure we flush everything
[PATCH 07/33] fuse: clean up per-file type inode initialization
[PATCH 08/33] fuse: create a per-inode flag for setting exclusive
[PATCH 10/33] fuse_trace: create a per-inode flag for toggling iomap
[PATCH 11/33] fuse: isolate the other regular file IO paths from
[PATCH 12/33] fuse: implement basic iomap reporting such as FIEMAP
[PATCH 13/33] fuse_trace: implement basic iomap reporting such as
[PATCH 14/33] fuse: implement direct IO with iomap
[PATCH 15/33] fuse_trace: implement direct IO with iomap
[PATCH 16/33] fuse: implement buffered IO with iomap
[PATCH 17/33] fuse_trace: implement buffered IO with iomap
[PATCH 18/33] fuse: use an unrestricted backing device with iomap
[PATCH 20/33] fuse: advertise support for iomap
[PATCH 21/33] fuse: query filesystem geometry when using iomap
[PATCH 22/33] fuse_trace: query filesystem geometry when using iomap
[PATCH 23/33] fuse: implement fadvise for iomap files
[PATCH 24/33] fuse: invalidate ranges of block devices being used for
[PATCH 25/33] fuse_trace: invalidate ranges of block devices being
[PATCH 26/33] fuse: implement inline data file IO via iomap
[PATCH 27/33] fuse_trace: implement inline data file IO via iomap
[PATCH 28/33] fuse: allow more statx fields
[PATCH 29/33] fuse: support atomic writes with iomap
[PATCH 30/33] fuse_trace: support atomic writes with iomap
[PATCH 31/33] fuse: disable direct fs reclaim for any fuse server
[PATCH 32/33] fuse: enable swapfile activation on iomap
[PATCH 33/33] fuse: implement freeze and shutdowns for iomap
[PATCHSET v8 5/8] fuse: allow servers to specify root node id
[PATCH 1/3] fuse: make the root nodeid dynamic
[PATCH 2/3] fuse_trace: make the root nodeid dynamic
[PATCH 3/3] fuse: allow setting of root nodeid
[PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when
[PATCH 1/9] fuse: enable caching of timestamps
[PATCH 2/9] fuse: force a ctime update after a fileattr_set call when
[PATCH 3/9] fuse: allow local filesystems to set some VFS iflags
[PATCH 4/9] fuse_trace: allow local filesystems to set some VFS
[PATCH 5/9] fuse: cache atime when in iomap mode
[PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap
[PATCH 7/9] fuse_trace: let the kernel handle KILL_SUID/KILL_SGID for
[PATCH 8/9] fuse: update ctime when updating acls on an iomap inode
[PATCH 9/9] fuse: always cache ACLs when using iomap
[PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO
[PATCH 01/12] fuse: cache iomaps
[PATCH 02/12] fuse_trace: cache iomaps
[PATCH 03/12] fuse: use the iomap cache for iomap_begin
[PATCH 04/12] fuse_trace: use the iomap cache for iomap_begin
[PATCH 05/12] fuse: invalidate iomap cache after file updates
[PATCH 06/12] fuse_trace: invalidate iomap cache after file updates
[PATCH 07/12] fuse: enable iomap cache management
[PATCH 08/12] fuse_trace: enable iomap cache management
[PATCH 09/12] fuse: overlay iomap inode info in struct fuse_inode
[PATCH 10/12] fuse: constrain iomap mapping cache size
[PATCH 11/12] fuse_trace: constrain iomap mapping cache size
[PATCH 12/12] fuse: enable iomap
[PATCHSET v8 8/8] fuse: run fuse servers as a contained service
[PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap
[PATCH 2/2] fuse: set iomap backing device block size
[PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file
[PATCH 01/25] libfuse: bump kernel and library ABI versions
[PATCH 02/25] libfuse: wait in do_destroy until all open files are
[PATCH 03/25] libfuse: add kernel gates for FUSE_IOMAP
[PATCH 04/25] libfuse: add fuse commands for iomap_begin and end
[PATCH 05/25] libfuse: add upper level iomap commands
[PATCH 06/25] libfuse: add a lowlevel notification to add a new
[PATCH 07/25] libfuse: add upper-level iomap add device function
[PATCH 08/25] libfuse: add iomap ioend low level handler
[PATCH 09/25] libfuse: add upper level iomap ioend commands
[PATCH 10/25] libfuse: add a reply function to send FUSE_ATTR_* to
[PATCH 11/25] libfuse: connect high level fuse library to
[PATCH 12/25] libfuse: support enabling exclusive mode for files
[PATCH 13/25] libfuse: support direct I/O through iomap
[PATCH 14/25] libfuse: don't allow hardlinking of iomap files in the
[PATCH 15/25] libfuse: allow discovery of the kernel's iomap
[PATCH 16/25] libfuse: add lower level iomap_config implementation
[PATCH 17/25] libfuse: add upper level iomap_config implementation
[PATCH 18/25] libfuse: add low level code to invalidate iomap block
[PATCH 19/25] libfuse: add upper-level API to invalidate parts of an
[PATCH 20/25] libfuse: add atomic write support
[PATCH 21/25] libfuse: allow disabling of fs memory reclaim and write
[PATCH 22/25] libfuse: create a helper to transform an open regular
[PATCH 23/25] libfuse: add swapfile support for iomap files
[PATCH 24/25] libfuse: add lower-level filesystem freeze, thaw,
[PATCH 25/25] libfuse: add upper-level filesystem freeze, thaw,
[PATCHSET v8 2/6] libfuse: allow servers to specify root node id
[PATCH 1/1] libfuse: allow root_nodeid mount option
[PATCHSET v8 3/6] libfuse: implement syncfs
[PATCH 1/2] libfuse: add strictatime/lazytime mount options
[PATCH 2/2] libfuse: set sync, immutable,
[PATCHSET v8 4/6] libfuse: add some service helper commands for iomap
[PATCH 1/3] mount_service: delegate iomap privilege from
[PATCH 2/3] libfuse: enable setting iomap block device block size
[PATCH 3/3] mount_service: create loop devices for regular files
[PATCHSET v8 5/6] fuse: add sample iomap fuse servers
[PATCH 1/7] example/iomap_ll: create a simple iomap server
[PATCH 2/7] example/iomap_ll: track block state
[PATCH 3/7] example/iomap_ll: implement atomic writes
[PATCH 4/7] example/iomap_inline_ll: create a simple server to test
[PATCH 5/7] example/iomap_ow_ll: create a simple iomap out of place
[PATCH 6/7] example/iomap_ow_ll: implement atomic writes
[PATCH 7/7] example/iomap_service_ll: create a sample systemd service
[PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file
[PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse
[PATCH 2/9] libfuse: add upper-level iomap cache management
[PATCH 3/9] libfuse: allow constraining of iomap mapping cache size
[PATCH 4/9] libfuse: add upper-level iomap mapping cache constraint
[PATCH 5/9] libfuse: enable iomap
[PATCH 6/9] example/iomap_ll: cache mappings for later
[PATCH 7/9] example/iomap_inline_ll: cache iomappings in the kernel
[PATCH 8/9] example/iomap_ow_ll: cache iomappings in the kernel
[PATCH 9/9] example/iomap_service_ll: cache iomappings in the kernel
[PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support
[PATCH 1/5] libext2fs: invalidate cached blocks when freeing them
[PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte
[PATCH 3/5] libext2fs: allow unix_write_byte when the write would be
[PATCH 4/5] libext2fs: allow clients to ask to write full superblocks
[PATCH 5/5] libext2fs: allow callers to disallow I/O to file data
[PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file
[PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping
[PATCH 02/19] fuse2fs: add iomap= mount option
[PATCH 03/19] fuse2fs: implement iomap configuration
[PATCH 04/19] fuse2fs: register block devices for use with iomap
[PATCH 05/19] fuse2fs: implement directio file reads
[PATCH 06/19] fuse2fs: add extent dump function for debugging
[PATCH 07/19] fuse2fs: implement direct write support
[PATCH 08/19] fuse2fs: turn on iomap for pagecache IO
[PATCH 09/19] fuse2fs: don't zero bytes in punch hole
[PATCH 10/19] fuse2fs: don't do file data block IO when iomap is
[PATCH 11/19] fuse2fs: try to create loop device when ext4 device is
[PATCH 12/19] fuse2fs: enable file IO to inline data files
[PATCH 13/19] fuse2fs: set iomap-related inode flags
[PATCH 14/19] fuse2fs: configure block device block size
[PATCH 15/19] fuse4fs: separate invalidation
[PATCH 16/19] fuse2fs: implement statx
[PATCH 17/19] fuse2fs: enable atomic writes
[PATCH 18/19] fuse4fs: disable fs reclaim and write throttling
[PATCH 19/19] fuse2fs: implement freeze and shutdown requests
[PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services
[PATCH 1/3] fuse4fs: configure iomap when running as a service
[PATCH 2/3] fuse4fs: set iomap backing device blocksize
[PATCH 3/3] fuse4fs: ask for loop devices when opening via
[PATCHSET v8 4/6] fuse4fs: specify the root node id
[PATCH 1/1] fuse4fs: don't use inode number translation when possible
[PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when
[PATCH 01/10] fuse2fs: add strictatime/lazytime mount options
[PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap
[PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates
[PATCH 04/10] fuse2fs: better debugging for file mode updates
[PATCH 05/10] fuse2fs: debug timestamp updates
[PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode
[PATCH 07/10] fuse2fs: add tracing for retrieving timestamps
[PATCH 08/10] fuse2fs: enable syncfs
[PATCH 09/10] fuse2fs: set sync, immutable,
[PATCH 10/10] fuse4fs: increase attribute timeout in iomap mode
[PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file
[PATCH 1/4] fuse2fs: enable caching of iomaps
[PATCH 2/4] fuse2fs: constrain iomap mapping cache size
[PATCH 3/4] fuse4fs: upsert first file mapping to kernel on open
[PATCH 4/4] fuse2fs: enable iomap
^ permalink raw reply
* [PATCHBLIZZARD v8] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers
From: Darrick J. Wong @ 2026-04-29 14:12 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4, fuse-devel
Cc: Miklos Szeredi, Bernd Schubert, Joanne Koong, Theodore Ts'o,
Neal Gompa, Amir Goldstein, Christian Brauner
[let's send this as a separate thread]
Hi everyone,
This is the eighth public draft of a prototype to connect the Linux
fuse driver to fs-iomap for regular file IO operations to and from files
whose contents persist to locally attached storage devices. With this
release, I show that it's possible to build a fuse server for a real
filesystem (ext4) that runs entirely in userspace yet maintains most of
its performance.
This effort is now separate from the one to run fuse servers in a
constrained environment via systemd. Putting fuse servers in a
container gets you all the blast radii reduction advantages and provides
a pathway to removing less popular filesystem drivers to reduce
maintenance work in the kernel; now we want trade relaxation of that
isolation for better performance.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server can upsert mappings
into the kernel for cached access (== zero upcalls for rereads and pure
overwrites!) and the iomap cache revalidation code works.
At this stage I still get about 95% of the kernel ext4 driver's
streaming directio performance on streaming IO, and 110% of its
streaming buffered IO performance. Random buffered IO is about 85% as
fast as the kernel. Random direct IO is about 80% as fast as the
kernel; see the cover letter for the fuse2fs iomap changes for more
details. Unwritten extent conversions on random direct writes are
especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with (now dynamic) debugging turned on!
This series has been rebased to 7.1-rc1 since the seventh RFC, but it
has not otherwise changed much. Most changes happened in userspace
this time:
1. I've written some example fuse-iomap servers, so I now have a vehicle
for testing that out of place writes works (they do) and that inline
data works.
2. Ted has started merging the very large quantity of fuse2fs
improvements into e2fsprogs.
3. I reordered the systemd service container patchset towards master
because the maintainer indicated that he wanted to merge it.
There are some questions remaining:
a. I would like to continue the discussion about how the design review
of this code should be structured, and how might I go about creating
new userspace filesystem servers -- lightweight new ones based off
the existing userspace tools? Or by merging lklfuse?
b. fuse2fs doesn't support the ext4 journal. Urk.
c. I've dropped the fstests and BPF parts of the patchbomb because v7
was just way too long. I'm also not including some extra
enhancements to fuse4fs, also for brevity.
I would like to get the main parts of this submission reviewed for 7.2
now that this has been collecting comments and tweaks in non-rfc status
for 5.5 months.
Kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container
libfuse:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/log/?h=fuse-iomap-cache
e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-memory-reclaim
fstests:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuse2fs
--Darrick
Unreviewed patches:
[PATCHSET v8 1/8] fuse: general bug fixes
[PATCH 3/4] fuse: update file mode when updating acls
[PATCH 4/4] fuse: propagate default and file acls on creation
[PATCHSET v8 2/8] iomap: cleanups ahead of adding fuse support
[PATCH 2/2] iomap: allow NULL swap info bdev when activating swapfile
[PATCHSET v8 3/8] fuse: cleanups ahead of adding fuse support
[PATCH 1/2] fuse: move the passthrough-specific code back to
[PATCHSET v8 4/8] fuse: allow servers to use iomap for better file IO
[PATCH 01/33] fuse: implement the basic iomap mechanisms
[PATCH 02/33] fuse_trace: implement the basic iomap mechanisms
[PATCH 03/33] fuse: make debugging configurable at runtime
[PATCH 04/33] fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add
[PATCH 05/33] fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to
[PATCH 06/33] fuse: enable SYNCFS and ensure we flush everything
[PATCH 07/33] fuse: clean up per-file type inode initialization
[PATCH 08/33] fuse: create a per-inode flag for setting exclusive
[PATCH 10/33] fuse_trace: create a per-inode flag for toggling iomap
[PATCH 11/33] fuse: isolate the other regular file IO paths from
[PATCH 12/33] fuse: implement basic iomap reporting such as FIEMAP
[PATCH 13/33] fuse_trace: implement basic iomap reporting such as
[PATCH 14/33] fuse: implement direct IO with iomap
[PATCH 15/33] fuse_trace: implement direct IO with iomap
[PATCH 16/33] fuse: implement buffered IO with iomap
[PATCH 17/33] fuse_trace: implement buffered IO with iomap
[PATCH 18/33] fuse: use an unrestricted backing device with iomap
[PATCH 20/33] fuse: advertise support for iomap
[PATCH 21/33] fuse: query filesystem geometry when using iomap
[PATCH 22/33] fuse_trace: query filesystem geometry when using iomap
[PATCH 23/33] fuse: implement fadvise for iomap files
[PATCH 24/33] fuse: invalidate ranges of block devices being used for
[PATCH 25/33] fuse_trace: invalidate ranges of block devices being
[PATCH 26/33] fuse: implement inline data file IO via iomap
[PATCH 27/33] fuse_trace: implement inline data file IO via iomap
[PATCH 28/33] fuse: allow more statx fields
[PATCH 29/33] fuse: support atomic writes with iomap
[PATCH 30/33] fuse_trace: support atomic writes with iomap
[PATCH 31/33] fuse: disable direct fs reclaim for any fuse server
[PATCH 32/33] fuse: enable swapfile activation on iomap
[PATCH 33/33] fuse: implement freeze and shutdowns for iomap
[PATCHSET v8 5/8] fuse: allow servers to specify root node id
[PATCH 1/3] fuse: make the root nodeid dynamic
[PATCH 2/3] fuse_trace: make the root nodeid dynamic
[PATCH 3/3] fuse: allow setting of root nodeid
[PATCHSET v8 6/8] fuse: handle timestamps and ACLs correctly when
[PATCH 1/9] fuse: enable caching of timestamps
[PATCH 2/9] fuse: force a ctime update after a fileattr_set call when
[PATCH 3/9] fuse: allow local filesystems to set some VFS iflags
[PATCH 4/9] fuse_trace: allow local filesystems to set some VFS
[PATCH 5/9] fuse: cache atime when in iomap mode
[PATCH 6/9] fuse: let the kernel handle KILL_SUID/KILL_SGID for iomap
[PATCH 7/9] fuse_trace: let the kernel handle KILL_SUID/KILL_SGID for
[PATCH 8/9] fuse: update ctime when updating acls on an iomap inode
[PATCH 9/9] fuse: always cache ACLs when using iomap
[PATCHSET v8 7/8] fuse: cache iomap mappings for even better file IO
[PATCH 01/12] fuse: cache iomaps
[PATCH 02/12] fuse_trace: cache iomaps
[PATCH 03/12] fuse: use the iomap cache for iomap_begin
[PATCH 04/12] fuse_trace: use the iomap cache for iomap_begin
[PATCH 05/12] fuse: invalidate iomap cache after file updates
[PATCH 06/12] fuse_trace: invalidate iomap cache after file updates
[PATCH 07/12] fuse: enable iomap cache management
[PATCH 08/12] fuse_trace: enable iomap cache management
[PATCH 09/12] fuse: overlay iomap inode info in struct fuse_inode
[PATCH 10/12] fuse: constrain iomap mapping cache size
[PATCH 11/12] fuse_trace: constrain iomap mapping cache size
[PATCH 12/12] fuse: enable iomap
[PATCHSET v8 8/8] fuse: run fuse servers as a contained service
[PATCH 1/2] fuse: allow privileged mount helpers to pre-approve iomap
[PATCH 2/2] fuse: set iomap backing device block size
[PATCHSET v8 1/6] libfuse: allow servers to use iomap for better file
[PATCH 01/25] libfuse: bump kernel and library ABI versions
[PATCH 02/25] libfuse: wait in do_destroy until all open files are
[PATCH 03/25] libfuse: add kernel gates for FUSE_IOMAP
[PATCH 04/25] libfuse: add fuse commands for iomap_begin and end
[PATCH 05/25] libfuse: add upper level iomap commands
[PATCH 06/25] libfuse: add a lowlevel notification to add a new
[PATCH 07/25] libfuse: add upper-level iomap add device function
[PATCH 08/25] libfuse: add iomap ioend low level handler
[PATCH 09/25] libfuse: add upper level iomap ioend commands
[PATCH 10/25] libfuse: add a reply function to send FUSE_ATTR_* to
[PATCH 11/25] libfuse: connect high level fuse library to
[PATCH 12/25] libfuse: support enabling exclusive mode for files
[PATCH 13/25] libfuse: support direct I/O through iomap
[PATCH 14/25] libfuse: don't allow hardlinking of iomap files in the
[PATCH 15/25] libfuse: allow discovery of the kernel's iomap
[PATCH 16/25] libfuse: add lower level iomap_config implementation
[PATCH 17/25] libfuse: add upper level iomap_config implementation
[PATCH 18/25] libfuse: add low level code to invalidate iomap block
[PATCH 19/25] libfuse: add upper-level API to invalidate parts of an
[PATCH 20/25] libfuse: add atomic write support
[PATCH 21/25] libfuse: allow disabling of fs memory reclaim and write
[PATCH 22/25] libfuse: create a helper to transform an open regular
[PATCH 23/25] libfuse: add swapfile support for iomap files
[PATCH 24/25] libfuse: add lower-level filesystem freeze, thaw,
[PATCH 25/25] libfuse: add upper-level filesystem freeze, thaw,
[PATCHSET v8 2/6] libfuse: allow servers to specify root node id
[PATCH 1/1] libfuse: allow root_nodeid mount option
[PATCHSET v8 3/6] libfuse: implement syncfs
[PATCH 1/2] libfuse: add strictatime/lazytime mount options
[PATCH 2/2] libfuse: set sync, immutable,
[PATCHSET v8 4/6] libfuse: add some service helper commands for iomap
[PATCH 1/3] mount_service: delegate iomap privilege from
[PATCH 2/3] libfuse: enable setting iomap block device block size
[PATCH 3/3] mount_service: create loop devices for regular files
[PATCHSET v8 5/6] fuse: add sample iomap fuse servers
[PATCH 1/7] example/iomap_ll: create a simple iomap server
[PATCH 2/7] example/iomap_ll: track block state
[PATCH 3/7] example/iomap_ll: implement atomic writes
[PATCH 4/7] example/iomap_inline_ll: create a simple server to test
[PATCH 5/7] example/iomap_ow_ll: create a simple iomap out of place
[PATCH 6/7] example/iomap_ow_ll: implement atomic writes
[PATCH 7/7] example/iomap_service_ll: create a sample systemd service
[PATCHSET v8 6/6] libfuse: cache iomap mappings for even better file
[PATCH 1/9] libfuse: enable iomap cache management for lowlevel fuse
[PATCH 2/9] libfuse: add upper-level iomap cache management
[PATCH 3/9] libfuse: allow constraining of iomap mapping cache size
[PATCH 4/9] libfuse: add upper-level iomap mapping cache constraint
[PATCH 5/9] libfuse: enable iomap
[PATCH 6/9] example/iomap_ll: cache mappings for later
[PATCH 7/9] example/iomap_inline_ll: cache iomappings in the kernel
[PATCH 8/9] example/iomap_ow_ll: cache iomappings in the kernel
[PATCH 9/9] example/iomap_service_ll: cache iomappings in the kernel
[PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support
[PATCH 1/5] libext2fs: invalidate cached blocks when freeing them
[PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte
[PATCH 3/5] libext2fs: allow unix_write_byte when the write would be
[PATCH 4/5] libext2fs: allow clients to ask to write full superblocks
[PATCH 5/5] libext2fs: allow callers to disallow I/O to file data
[PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file
[PATCH 01/19] fuse2fs: implement bare minimum iomap for file mapping
[PATCH 02/19] fuse2fs: add iomap= mount option
[PATCH 03/19] fuse2fs: implement iomap configuration
[PATCH 04/19] fuse2fs: register block devices for use with iomap
[PATCH 05/19] fuse2fs: implement directio file reads
[PATCH 06/19] fuse2fs: add extent dump function for debugging
[PATCH 07/19] fuse2fs: implement direct write support
[PATCH 08/19] fuse2fs: turn on iomap for pagecache IO
[PATCH 09/19] fuse2fs: don't zero bytes in punch hole
[PATCH 10/19] fuse2fs: don't do file data block IO when iomap is
[PATCH 11/19] fuse2fs: try to create loop device when ext4 device is
[PATCH 12/19] fuse2fs: enable file IO to inline data files
[PATCH 13/19] fuse2fs: set iomap-related inode flags
[PATCH 14/19] fuse2fs: configure block device block size
[PATCH 15/19] fuse4fs: separate invalidation
[PATCH 16/19] fuse2fs: implement statx
[PATCH 17/19] fuse2fs: enable atomic writes
[PATCH 18/19] fuse4fs: disable fs reclaim and write throttling
[PATCH 19/19] fuse2fs: implement freeze and shutdown requests
[PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services
[PATCH 1/3] fuse4fs: configure iomap when running as a service
[PATCH 2/3] fuse4fs: set iomap backing device blocksize
[PATCH 3/3] fuse4fs: ask for loop devices when opening via
[PATCHSET v8 4/6] fuse4fs: specify the root node id
[PATCH 1/1] fuse4fs: don't use inode number translation when possible
[PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when
[PATCH 01/10] fuse2fs: add strictatime/lazytime mount options
[PATCH 02/10] fuse2fs: skip permission checking on utimens when iomap
[PATCH 03/10] fuse2fs: let the kernel tell us about acl/mode updates
[PATCH 04/10] fuse2fs: better debugging for file mode updates
[PATCH 05/10] fuse2fs: debug timestamp updates
[PATCH 06/10] fuse2fs: use coarse timestamps for iomap mode
[PATCH 07/10] fuse2fs: add tracing for retrieving timestamps
[PATCH 08/10] fuse2fs: enable syncfs
[PATCH 09/10] fuse2fs: set sync, immutable,
[PATCH 10/10] fuse4fs: increase attribute timeout in iomap mode
[PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file
[PATCH 1/4] fuse2fs: enable caching of iomaps
[PATCH 2/4] fuse2fs: constrain iomap mapping cache size
[PATCH 3/4] fuse4fs: upsert first file mapping to kernel on open
[PATCH 4/4] fuse2fs: enable iomap
^ permalink raw reply
* [PATCHSET v8 1/6] libext2fs: refactoring for fuse2fs iomap support
From: Darrick J. Wong @ 2026-04-29 14:20 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <20260429141253.GQ7739@frogsfrogsfrogs>
Hi all,
In preparation for connecting fuse, iomap, and fuse2fs for a much more
performant file IO path, make some changes to the Unix IO manager in
libext2fs so that we can have better IO. First we start by making
filesystem flushes a lot more efficient by eliding fsyncs when they're
not necessary, and allowing library clients to turn off the racy code
that writes the superblock byte by byte but exposes stale checksums.
XXX: The second part of this series adds IO tagging so that we could tag
IOs by inode number to distinguish file data blocks in cache from
everything else. This is temporary scaffolding whilst we're in the
middle adding directio and later buffered writes. Once we can use the
pagecache for all file IO activity I think we could drop the back half
of this series.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-iomap-prep
---
Commits in this patchset:
* libext2fs: invalidate cached blocks when freeing them
* libext2fs: only flush affected blocks in unix_write_byte
* libext2fs: allow unix_write_byte when the write would be aligned
* libext2fs: allow clients to ask to write full superblocks
* libext2fs: allow callers to disallow I/O to file data blocks
---
lib/ext2fs/ext2_io.h | 8 ++++++-
lib/ext2fs/ext2fs.h | 4 +++
debian/libext2fs2t64.symbols | 1 +
lib/ext2fs/alloc_stats.c | 6 +++++
lib/ext2fs/closefs.c | 7 ++++++
lib/ext2fs/fileio.c | 12 +++++++++-
lib/ext2fs/io_manager.c | 9 +++++++
lib/ext2fs/unix_io.c | 51 ++++++++++++++++++++++++++++++++++++++++--
8 files changed, 93 insertions(+), 5 deletions(-)
^ permalink raw reply
* [PATCHSET v8 2/6] fuse2fs: use fuse iomap data paths for better file I/O performance
From: Darrick J. Wong @ 2026-04-29 14:20 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <20260429141253.GQ7739@frogsfrogsfrogs>
Hi all,
Switch fuse2fs to use the new iomap file data IO paths instead of
pushing it very slowly through the /dev/fuse connection. For local
filesystems, all we have to do is respond to requests for file to device
mappings; the rest of the IO hot path stays within the kernel. This
means that we can get rid of all file data block processing within
fuse2fs.
Because we're not pinning dirty pages through a potentially slow network
connection, we don't need the heavy BDI throttling for which most fuse
servers have become infamous. Yes, mapping lookups for writeback can
stall, but mappings are small as compared to data and this situation
exists for all kernel filesystems as well.
The performance of this new data path is quite stunning: on a warm
system, streaming reads and writes through the pagecache go from
60-90MB/s to 2-2.5GB/s. Direct IO reads and writes improve from the
same baseline to 2.5-8GB/s. FIEMAP and SEEK_DATA/SEEK_HOLE now work
too. The kernel ext4 driver can manage about 1.6GB/s for pagecache IO
and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the
kernel for streaming file IO.
Random 4k buffered IO is not so good: plain fuse2fs pokes along at
25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s. The kernel
can do 900-1300MB/s. Random directio is worse: plain fuse2fs does
20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does
40-55MB/s. I suspect that metadata heavy workloads do not perform well
on fuse2fs because libext2fs wasn't designed for that and it doesn't
even have a journal to absorb all the fsync writes. We also probably
need iomap caching really badly.
These performance numbers are slanted: my machine is 12 years old, and
fuse2fs is VERY poorly optimized for performance. It contains a single
Big Filesystem Lock which nukes multi-threaded scalability. There's no
inode cache nor is there a proper buffer cache, which means that fuse2fs
reads metadata in from disk and checksums it on EVERY ACCESS. Sad!
Despite these gaps, this RFC demonstrates that it's feasible to run the
metadata parsing parts of a filesystem in userspace while not
sacrificing much performance. We now have a vehicle to move the
filesystems out of the kernel, where they can be containerized so that
malicious filesystems can be contained, somewhat.
iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so
for capable systems, fuse2fs doesn't need to run in fuseblk mode
anymore.
However, there are some major warts remaining:
1. The iomap cookie validation is not present, which can lead to subtle
races between pagecache zeroing and writeback on filesystems that
support unwritten and delalloc mappings.
2. Mappings ought to be cached in the kernel for more speed.
3. iomap doesn't support things like fscrypt or fsverity, and I haven't
yet figured out how inline data is supposed to work.
4. I would like to be able to turn on fuse+iomap on a per-inode basis,
which currently isn't possible because the kernel fuse driver will iget
inodes prior to calling FUSE_GETATTR to discover the properties of the
inode it just read.
5. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.
6. iomap is an inode-based service, not a file-based service. This
means that we /must/ push ext2's inode numbers into the kernel via
FUSE_GETATTR so that it can report those same numbers back out through
the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid
to index its incore inode, so we have to pass those too so that
notifications work properly.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-fileio
---
Commits in this patchset:
* fuse2fs: implement bare minimum iomap for file mapping reporting
* fuse2fs: add iomap= mount option
* fuse2fs: implement iomap configuration
* fuse2fs: register block devices for use with iomap
* fuse2fs: implement directio file reads
* fuse2fs: add extent dump function for debugging
* fuse2fs: implement direct write support
* fuse2fs: turn on iomap for pagecache IO
* fuse2fs: don't zero bytes in punch hole
* fuse2fs: don't do file data block IO when iomap is enabled
* fuse2fs: try to create loop device when ext4 device is a regular file
* fuse2fs: enable file IO to inline data files
* fuse2fs: set iomap-related inode flags
* fuse2fs: configure block device block size
* fuse4fs: separate invalidation
* fuse2fs: implement statx
* fuse2fs: enable atomic writes
* fuse4fs: disable fs reclaim and write throttling
* fuse2fs: implement freeze and shutdown requests
---
configure | 90 ++
configure.ac | 54 +
fuse4fs/fuse4fs.1.in | 6
fuse4fs/fuse4fs.c | 1934 +++++++++++++++++++++++++++++++++++++++++++++++++-
lib/config.h.in | 6
misc/fuse2fs.1.in | 6
misc/fuse2fs.c | 1947 ++++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 4016 insertions(+), 27 deletions(-)
^ permalink raw reply
* [PATCHSET v8 3/6] fuse4fs: adapt iomap for fuse services
From: Darrick J. Wong @ 2026-04-29 14:20 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <20260429141253.GQ7739@frogsfrogsfrogs>
Hi all,
This series adapts the iomap code to work in systemd service mode.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-iomap-service
---
Commits in this patchset:
* fuse4fs: configure iomap when running as a service
* fuse4fs: set iomap backing device blocksize
* fuse4fs: ask for loop devices when opening via fuservicemount
---
fuse4fs/fuse4fs.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 83 insertions(+), 11 deletions(-)
^ permalink raw reply
* [PATCHSET v8 4/6] fuse4fs: specify the root node id
From: Darrick J. Wong @ 2026-04-29 14:21 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <20260429141253.GQ7739@frogsfrogsfrogs>
Hi all,
This series adapts fuse4fs to have a 1:1 mapping of ext2_ino_t to fuse_ino_t
for slightly better performance and less confusing code interpretation.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-root-nodeid
---
Commits in this patchset:
* fuse4fs: don't use inode number translation when possible
---
fuse4fs/fuse4fs.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)
^ permalink raw reply
* [PATCHSET v8 5/6] fuse2fs: handle timestamps and ACLs correctly when iomap is enabled
From: Darrick J. Wong @ 2026-04-29 14:21 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <20260429141253.GQ7739@frogsfrogsfrogs>
Hi all,
When iomap is enabled for a fuse file, we try to keep as much of the
file IO path in the kernel as we possibly can. That means no calling
out to the fuse server in the IO path when we can avoid it. However,
the existing FUSE architecture defers all file attributes to the fuse
server -- [cm]time updates, ACL metadata management, set[ug]id removal,
and permissions checking thereof, etc.
We'd really rather do all these attribute updates in the kernel, and
only push them to the fuse server when it's actually necessary (e.g.
fsync). Furthermore, the POSIX ACL code has the weird behavior that if
the access ACL can be represented entirely by i_mode bits, it will
change the mode and delete the ACL, which fuse servers generally don't
seem to implement.
IOWs, we want consistent and correct (as defined by fstests) behavior
of file attributes in iomap mode. Let's make the kernel manage all that
and push the results to userspace as needed. This improves performance
even further, since it's sort of like writeback_cache mode but more
aggressive.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-attrs
---
Commits in this patchset:
* fuse2fs: add strictatime/lazytime mount options
* fuse2fs: skip permission checking on utimens when iomap is enabled
* fuse2fs: let the kernel tell us about acl/mode updates
* fuse2fs: better debugging for file mode updates
* fuse2fs: debug timestamp updates
* fuse2fs: use coarse timestamps for iomap mode
* fuse2fs: add tracing for retrieving timestamps
* fuse2fs: enable syncfs
* fuse2fs: set sync, immutable, and append at file load time
* fuse4fs: increase attribute timeout in iomap mode
---
fuse4fs/fuse4fs.1.in | 6 +
fuse4fs/fuse4fs.c | 226 ++++++++++++++++++++++++++++++----------
misc/fuse2fs.1.in | 6 +
misc/fuse2fs.c | 282 +++++++++++++++++++++++++++++++++++++-------------
4 files changed, 389 insertions(+), 131 deletions(-)
^ permalink raw reply
* [PATCHSET v8 6/6] fuse2fs: cache iomap mappings for even better file IO performance
From: Darrick J. Wong @ 2026-04-29 14:21 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <20260429141253.GQ7739@frogsfrogsfrogs>
Hi all,
This series improves the performance (and correctness for some
filesystems) by adding the ability to cache iomap mappings in the
kernel. For filesystems that can change mapping states during pagecache
writeback (e.g. unwritten extent conversion) this is absolutely
necessary to deal with races with writes to the pagecache because
writeback does not take i_rwsem. For everyone else, it simply
eliminates roundtrips to userspace.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
Comments and questions are, as always, welcome.
e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap-cache
---
Commits in this patchset:
* fuse2fs: enable caching of iomaps
* fuse2fs: constrain iomap mapping cache size
* fuse4fs: upsert first file mapping to kernel on open
* fuse2fs: enable iomap
---
fuse4fs/fuse4fs.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++-----
misc/fuse2fs.c | 38 ++++++++++++++++++++++-----
2 files changed, 101 insertions(+), 13 deletions(-)
^ permalink raw reply
* [PATCH 1/5] libext2fs: invalidate cached blocks when freeing them
From: Darrick J. Wong @ 2026-04-29 14:51 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <177747214224.4107228.16300103064218258692.stgit@frogsfrogsfrogs>
From: Darrick J. Wong <djwong@kernel.org>
When we're freeing blocks, we should tell the IO manager to drop them
from any cache it might be maintaining to improve performance.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/ext2_io.h | 8 +++++++-
debian/libext2fs2t64.symbols | 1 +
lib/ext2fs/alloc_stats.c | 6 ++++++
lib/ext2fs/io_manager.c | 9 +++++++++
lib/ext2fs/unix_io.c | 35 +++++++++++++++++++++++++++++++++++
5 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/ext2_io.h b/lib/ext2fs/ext2_io.h
index c880ea2524f248..0148492caf63b6 100644
--- a/lib/ext2fs/ext2_io.h
+++ b/lib/ext2fs/ext2_io.h
@@ -104,7 +104,10 @@ struct struct_io_manager {
unsigned long long count);
errcode_t (*flock)(io_channel channel, unsigned int flock_flags);
errcode_t (*get_fd)(io_channel channel, int *fd);
- long reserved[12];
+ errcode_t (*invalidate_blocks)(io_channel channel,
+ unsigned long long block,
+ unsigned long long count);
+ long reserved[11];
};
#define IO_FLAG_RW 0x0001
@@ -157,6 +160,9 @@ extern errcode_t io_channel_cache_readahead(io_channel io,
extern errcode_t io_channel_flock(io_channel io, unsigned int flock_flags);
extern errcode_t io_channel_funlock(io_channel io);
extern errcode_t io_channel_get_fd(io_channel io, int *fd);
+extern errcode_t io_channel_invalidate_blocks(io_channel io,
+ unsigned long long block,
+ unsigned long long count);
#ifdef _WIN32
/* windows_io.c */
diff --git a/debian/libext2fs2t64.symbols b/debian/libext2fs2t64.symbols
index 555fbbb0c98878..b19a362967f00e 100644
--- a/debian/libext2fs2t64.symbols
+++ b/debian/libext2fs2t64.symbols
@@ -702,6 +702,7 @@ libext2fs.so.2 libext2fs2t64 #MINVER#
io_channel_flock@Base 1.47.99
io_channel_funlock@Base 1.47.99
io_channel_get_fd@Base 1.47.99
+ io_channel_invalidate_blocks@Base 1.47.99
io_channel_read_blk64@Base 1.41.1
io_channel_set_options@Base 1.37
io_channel_write_blk64@Base 1.41.1
diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c
index 95a6438f252e0f..68bbe6807a8ed3 100644
--- a/lib/ext2fs/alloc_stats.c
+++ b/lib/ext2fs/alloc_stats.c
@@ -82,6 +82,9 @@ void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse)
-inuse * (blk64_t) EXT2FS_CLUSTER_RATIO(fs));
ext2fs_mark_super_dirty(fs);
ext2fs_mark_bb_dirty(fs);
+ if (inuse < 0)
+ io_channel_invalidate_blocks(fs->io, blk,
+ EXT2FS_CLUSTER_RATIO(fs));
if (fs->block_alloc_stats)
(fs->block_alloc_stats)(fs, (blk64_t) blk, inuse);
}
@@ -144,11 +147,14 @@ void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk,
ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
ext2fs_group_desc_csum_set(fs, group);
ext2fs_free_blocks_count_add(fs->super, -inuse * (blk64_t) n);
+
blk += n;
num -= n;
}
ext2fs_mark_super_dirty(fs);
ext2fs_mark_bb_dirty(fs);
+ if (inuse < 0)
+ io_channel_invalidate_blocks(fs->io, orig_blk, orig_num);
if (fs->block_alloc_stats_range)
(fs->block_alloc_stats_range)(fs, orig_blk, orig_num, inuse);
}
diff --git a/lib/ext2fs/io_manager.c b/lib/ext2fs/io_manager.c
index dff3d73552827f..a92dba7b9dc880 100644
--- a/lib/ext2fs/io_manager.c
+++ b/lib/ext2fs/io_manager.c
@@ -174,3 +174,12 @@ errcode_t io_channel_get_fd(io_channel io, int *fd)
return io->manager->get_fd(io, fd);
}
+
+errcode_t io_channel_invalidate_blocks(io_channel io, unsigned long long block,
+ unsigned long long count)
+{
+ if (!io->manager->invalidate_blocks)
+ return EXT2_ET_OP_NOT_SUPPORTED;
+
+ return io->manager->invalidate_blocks(io, block, count);
+}
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 567bbd9493f7f1..54bd4b5597ea9e 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -672,6 +672,25 @@ static errcode_t reuse_cache(io_channel channel,
#define FLUSH_INVALIDATE 0x01
#define FLUSH_NOLOCK 0x02
+/* Remove blocks from the cache. Dirty contents are discarded. */
+static void invalidate_cached_blocks(io_channel channel,
+ struct unix_private_data *data,
+ unsigned long long block,
+ unsigned long long count)
+{
+ struct unix_cache *cache;
+ int i;
+
+ mutex_lock(data, CACHE_MTX);
+ for (i = 0, cache = data->cache; i < data->cache_size; i++, cache++) {
+ if (!cache->in_use || cache->block < block ||
+ cache->block >= block + count)
+ continue;
+ cache->in_use = 0;
+ }
+ mutex_unlock(data, CACHE_MTX);
+}
+
/*
* Flush all of the blocks in the cache
*/
@@ -1832,6 +1851,20 @@ static errcode_t unix_get_fd(io_channel channel, int *fd)
return 0;
}
+static errcode_t unix_invalidate_blocks(io_channel channel,
+ unsigned long long block,
+ unsigned long long count)
+{
+ struct unix_private_data *data;
+
+ EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+ data = (struct unix_private_data *) channel->private_data;
+ EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+ invalidate_cached_blocks(channel, data, block, count);
+ return 0;
+}
+
#if __GNUC_PREREQ (4, 6)
#pragma GCC diagnostic pop
#endif
@@ -1855,6 +1888,7 @@ static struct struct_io_manager struct_unix_manager = {
.zeroout = unix_zeroout,
.flock = unix_flock,
.get_fd = unix_get_fd,
+ .invalidate_blocks = unix_invalidate_blocks,
};
io_manager unix_io_manager = &struct_unix_manager;
@@ -1878,6 +1912,7 @@ static struct struct_io_manager struct_unixfd_manager = {
.zeroout = unix_zeroout,
.flock = unix_flock,
.get_fd = unix_get_fd,
+ .invalidate_blocks = unix_invalidate_blocks,
};
io_manager unixfd_io_manager = &struct_unixfd_manager;
^ permalink raw reply related
* [PATCH 2/5] libext2fs: only flush affected blocks in unix_write_byte
From: Darrick J. Wong @ 2026-04-29 14:51 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <177747214224.4107228.16300103064218258692.stgit@frogsfrogsfrogs>
From: Darrick J. Wong <djwong@kernel.org>
There's no need to invalidate the entire cache when writing a range of
bytes to the device. The only ones we need to invalidate are the ones
that we're writing separately.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 54bd4b5597ea9e..35c42c35f735a3 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1588,6 +1588,7 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
{
struct unix_private_data *data;
errcode_t retval = 0;
+ unsigned long long bno, nbno;
ssize_t actual;
EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
@@ -1603,10 +1604,17 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
#ifndef NO_IO_CACHE
/*
- * Flush out the cache completely
+ * Flush all the dirty blocks, then invalidate the blocks we're about
+ * to write.
*/
- if ((retval = flush_cached_blocks(channel, data, FLUSH_INVALIDATE)))
+ retval = flush_cached_blocks(channel, data, 0);
+ if (retval)
return retval;
+
+ bno = offset / channel->block_size;
+ nbno = (offset + size + channel->block_size - 1) / channel->block_size;
+
+ invalidate_cached_blocks(channel, data, bno, nbno - bno);
#endif
if (lseek(data->dev, offset + data->offset, SEEK_SET) < 0)
^ permalink raw reply related
* [PATCH 3/5] libext2fs: allow unix_write_byte when the write would be aligned
From: Darrick J. Wong @ 2026-04-29 14:52 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <177747214224.4107228.16300103064218258692.stgit@frogsfrogsfrogs>
From: Darrick J. Wong <djwong@kernel.org>
If someone calls write_byte on an IO channel with an alignment
requirement and the range to be written is aligned correctly, go ahead
and do the write. This will be needed later when we try to speed up
superblock writes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/unix_io.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/lib/ext2fs/unix_io.c b/lib/ext2fs/unix_io.c
index 35c42c35f735a3..ea8ee56b7d5163 100644
--- a/lib/ext2fs/unix_io.c
+++ b/lib/ext2fs/unix_io.c
@@ -1599,7 +1599,9 @@ static errcode_t unix_write_byte(io_channel channel, unsigned long offset,
#ifdef ALIGN_DEBUG
printf("unix_write_byte: O_DIRECT fallback\n");
#endif
- return EXT2_ET_UNIMPLEMENTED;
+ if (!IS_ALIGNED(data->offset + offset, channel->align) ||
+ !IS_ALIGNED(data->offset + offset + size, channel->align))
+ return EXT2_ET_UNIMPLEMENTED;
}
#ifndef NO_IO_CACHE
^ permalink raw reply related
* [PATCH 4/5] libext2fs: allow clients to ask to write full superblocks
From: Darrick J. Wong @ 2026-04-29 14:52 UTC (permalink / raw)
To: tytso
Cc: bernd, miklos, linux-ext4, neal, linux-fsdevel, fuse-devel,
joannelkoong
In-Reply-To: <177747214224.4107228.16300103064218258692.stgit@frogsfrogsfrogs>
From: Darrick J. Wong <djwong@kernel.org>
write_primary_superblock currently does this weird dance where it will
try to write only the dirty bytes of the primary superblock to disk. In
theory, this is done so that tune2fs can incrementally update superblock
bytes when the filesystem is mounted; ext2 was famous for allowing using
this dance to set new fs parameters and have them take effect in real
time.
The ability to do this safely was obliterated back in 2001 when ext3 was
introduced with journalling, because tune2fs has no way to know if the
journal has already logged an updated primary superblock but not yet
written it to disk, which means that they can race to write, and changes
can be lost.
This (non-)safety was further obliterated back in 2012 when I added
checksums to all the metadata blocks in ext4 because anyone else with
the block device open can see the primary superblock in an intermediate
state where the checksum does not match the superblock contents.
At this point in 2025 it's kind of stupid for fuse2fs to be doing this
because you can't have the kernel and fuse2fs mount the same filesystem
at the same time. It also makes fuse2fs op_fsync slow because libext2fs
performs a bunch of small writes and introduce extra fsyncs.
So, add a new flag to ask for full superblock writes, which fuse2fs will
use later.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
lib/ext2fs/ext2fs.h | 1 +
lib/ext2fs/closefs.c | 7 +++++++
2 files changed, 8 insertions(+)
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 02c3cbcea92482..8fad4c4011dd5a 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -220,6 +220,7 @@ typedef struct ext2_file *ext2_file_t;
#define EXT2_FLAG_IBITMAP_TAIL_PROBLEM 0x2000000
#define EXT2_FLAG_THREADS 0x4000000
#define EXT2_FLAG_IGNORE_SWAP_DIRENT 0x8000000
+#define EXT2_FLAG_WRITE_FULL_SUPER 0x10000000
/*
* Internal flags for use by the ext2fs library only
diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
index 8e5bec03a050de..9a67db76e7b326 100644
--- a/lib/ext2fs/closefs.c
+++ b/lib/ext2fs/closefs.c
@@ -196,6 +196,13 @@ static errcode_t write_primary_superblock(ext2_filsys fs,
int check_idx, write_idx, size;
errcode_t retval;
+ if (fs->flags & EXT2_FLAG_WRITE_FULL_SUPER) {
+ retval = io_channel_write_byte(fs->io, SUPERBLOCK_OFFSET,
+ SUPERBLOCK_SIZE, super);
+ if (!retval)
+ return 0;
+ }
+
if (!fs->io->manager->write_byte || !fs->orig_super) {
fallback:
io_channel_set_blksize(fs->io, SUPERBLOCK_OFFSET);
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox