* Re: [PATCH v4] ext2: Remove deprecated DAX support
From: kernel test robot @ 2026-05-28 4:16 UTC (permalink / raw)
To: Ashwin Gundarapu, jack; +Cc: oe-kbuild-all, linux-ext4, linux-kernel
In-Reply-To: <19e5aa07c9b.3a2e576d130187.5289857983023045470@zohomail.in>
Hi Ashwin,
kernel test robot noticed the following build warnings:
[auto build test WARNING on jack-fs/for_next]
[also build test WARNING on linus/master v7.1-rc5 next-20260527]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ashwin-Gundarapu/ext2-Remove-deprecated-DAX-support/20260524-233631
base: https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_next
patch link: https://lore.kernel.org/r/19e5aa07c9b.3a2e576d130187.5289857983023045470%40zohomail.in
patch subject: [PATCH v4] ext2: Remove deprecated DAX support
config: arm-randconfig-r071-20260528 (https://download.01.org/0day-ci/archive/20260528/202605281203.e91xvDyr-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
smatch: v0.5.0-9185-gbcc58b9c
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605281203.e91xvDyr-lkp@intel.com/
smatch warnings:
fs/ext2/inode.c:1251 ext2_setsize() warn: inconsistent indenting
vim +1251 fs/ext2/inode.c
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1236
2c27c65ed0696f0 Christoph Hellwig 2010-06-04 1237 static int ext2_setsize(struct inode *inode, loff_t newsize)
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1238 {
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1239 int error;
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1240
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1241 if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1242 S_ISLNK(inode->i_mode)))
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1243 return -EINVAL;
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1244 if (ext2_inode_is_fast_symlink(inode))
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1245 return -EINVAL;
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1246 if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1247 return -EPERM;
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1248
562c72aa57c36b1 Christoph Hellwig 2011-06-24 1249 inode_dio_wait(inode);
562c72aa57c36b1 Christoph Hellwig 2011-06-24 1250
737f2e93b9724a3 Nicholas Piggin 2010-05-27 @1251 error = block_truncate_page(inode->i_mapping,
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1252 newsize, ext2_get_block);
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1253 if (error)
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1254 return error;
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1255
70f3bad8c3154ba Jan Kara 2021-04-12 1256 filemap_invalidate_lock(inode->i_mapping);
2c27c65ed0696f0 Christoph Hellwig 2010-06-04 1257 truncate_setsize(inode, newsize);
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1258 __ext2_truncate_blocks(inode, newsize);
70f3bad8c3154ba Jan Kara 2021-04-12 1259 filemap_invalidate_unlock(inode->i_mapping);
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1260
5cdc59fce617a2e Jeff Layton 2023-10-04 1261 inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
^1da177e4c3f415 Linus Torvalds 2005-04-16 1262 if (inode_needs_sync(inode)) {
b0439bbc29f0201 Jan Kara 2026-03-26 1263 mmb_sync(&EXT2_I(inode)->i_metadata_bhs);
c37650161a53c01 Christoph Hellwig 2010-10-06 1264 sync_inode_metadata(inode, 1);
^1da177e4c3f415 Linus Torvalds 2005-04-16 1265 } else {
^1da177e4c3f415 Linus Torvalds 2005-04-16 1266 mark_inode_dirty(inode);
^1da177e4c3f415 Linus Torvalds 2005-04-16 1267 }
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1268
737f2e93b9724a3 Nicholas Piggin 2010-05-27 1269 return 0;
^1da177e4c3f415 Linus Torvalds 2005-04-16 1270 }
^1da177e4c3f415 Linus Torvalds 2005-04-16 1271
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* [PATCH v6 11/11] fstests: test UUID consistency for clones with metadata_uuid
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Btrfs and xfs uses the metadata_uuid superblock feature to change the
on-disk UUID without rewriting every block header. This patch adds a
sanity check to ensure UUID consistency when a filesystem with
metadata_uuid enabled is cloned.
Signed-off-by: Anand Jain <asj@kernel.org>
---
tests/generic/806 | 84 +++++++++++++++++++++++++++++++++++++++++++
tests/generic/806.out | 19 ++++++++++
2 files changed, 103 insertions(+)
create mode 100644 tests/generic/806
create mode 100644 tests/generic/806.out
diff --git a/tests/generic/806 b/tests/generic/806
new file mode 100644
index 000000000000..801671fb9ce9
--- /dev/null
+++ b/tests/generic/806
@@ -0,0 +1,84 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>. All Rights Reserved.
+#
+# FS QA Test 806
+#
+# Verify that the cloned filesystem UUID remains consistent, even when the
+# `metadata_uuid` feature is enabled.
+#
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_cleanup()
+{
+ cd /
+ rm -r -f $tmp.*
+ umount $mnt1 $mnt2 2>/dev/null
+ _loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+filter_pool()
+{
+ sed -e "s|${devs[0]}|DEV1|g" -e "s|${mnt1}|MNT1|g" \
+ -e "s|${devs[1]}|DEV2|g" -e "s|${mnt2}|MNT2|g" | _filter_spaces
+}
+
+# Collect and print device resolution properties across user-space tools
+print_info()
+{
+ local mntpt=$1
+ local tgt=$(findmnt -no SOURCE $mntpt)
+ local fsuuid=$(blkid -s UUID -o value $tgt)
+
+ echo "mntpt=$mntpt tgt=$tgt fsuuid=$fsuuid" >> $seqres.full
+ echo
+ findmnt -o SOURCE,TARGET,UUID "$tgt" | tail -n +2 | \
+ sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+ awk -v dev="$tgt" '$1 == dev { print $1, $2 }' /proc/self/mounts | \
+ filter_pool
+ df --all --output=source,target "$tgt" | tail -n +2 | filter_pool
+}
+
+# Create base loop device and its clone, applying the metadata_uuid tuning
+# callback to the base filesystem before the copy occurs.
+devs=()
+_loop_image_create_clone devs _change_metadata_uuid
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both clone and baseline
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+
+print_info $mnt1
+print_info $mnt2
+
+# Cycle mounts and reverse the initialization order to ensure UUID tracking
+# doesn't mismatch or flip when metadata_uuid optimization is active.
+echo
+echo "**** mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+
+print_info $mnt1
+print_info $mnt2
+
+status=0
+exit
diff --git a/tests/generic/806.out b/tests/generic/806.out
new file mode 100644
index 000000000000..7315e791ba51
--- /dev/null
+++ b/tests/generic/806.out
@@ -0,0 +1,19 @@
+QA output created by 806
+
+DEV1 MNT1 FSUUID
+DEV1 MNT1
+DEV1 MNT1
+
+DEV2 MNT2 FSUUID
+DEV2 MNT2
+DEV2 MNT2
+
+**** mount cycle ****
+
+DEV1 MNT1 FSUUID
+DEV1 MNT1
+DEV1 MNT1
+
+DEV2 MNT2 FSUUID
+DEV2 MNT2
+DEV2 MNT2
--
2.43.0
^ permalink raw reply related
* [PATCH v6 10/11] fstests: add _change_metadata_uuid helper
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
_change_metadata_uuid changes the UUID of the golden filesystem before it
is cloned.
Signed-off-by: Anand Jain <asj@kernel.org>
---
common/rc | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/common/rc b/common/rc
index 5446552aed92..79be51e4da31 100644
--- a/common/rc
+++ b/common/rc
@@ -1537,6 +1537,29 @@ _scratch_resvblks()
esac
}
+# Change the metadata UUID of the given device to a newly generated one.
+# Args:
+# $1: Block device path to modify.
+_change_metadata_uuid()
+{
+ local temp_mnt=$TEST_DIR/${seq}_mnt
+ local dev=$1
+
+ case $FSTYP in
+ xfs)
+ _require_command "$XFS_ADMIN_PROG" "xfs_admin"
+ $XFS_ADMIN_PROG -U generate $dev >> $seqres.full
+ ;;
+ btrfs)
+ _require_command "$BTRFS_TUNE_PROG" "btrfstune"
+ $BTRFS_TUNE_PROG -m $dev
+ ;;
+ *)
+ _notrun "Require filesystem with metadata_uuid feature"
+ ;;
+ esac
+}
+
# Create a small loop image, run an optional tuning function ($2) on it,
# clone it, and attach both to loop devices, returned in ($1).
# Args:
--
2.43.0
^ permalink raw reply related
* [PATCH v6 09/11] fstests: verify exportfs file handles on cloned filesystems
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Ensure that exportfs can correctly decode file handles on a cloned
filesystem across a mount cycle, by file handles generated on a
cloned device remain valid after mount cycle.
Signed-off-by: Anand Jain <asj@kernel.org>
---
tests/generic/805 | 80 +++++++++++++++++++++++++++++++++++++++++++
tests/generic/805.out | 2 ++
2 files changed, 82 insertions(+)
create mode 100644 tests/generic/805
create mode 100644 tests/generic/805.out
diff --git a/tests/generic/805 b/tests/generic/805
new file mode 100644
index 000000000000..5827eee039df
--- /dev/null
+++ b/tests/generic/805
@@ -0,0 +1,80 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>. All Rights Reserved.
+#
+# FS QA Test No. 805
+# Verify that file handles encoded on a cloned filesystem remain valid and
+# resolvable via open_by_handle across a mount cycle and mount order swap.
+
+. ./common/preamble
+
+_begin_fstest auto quick exportfs clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_exportfs
+_require_loop
+_require_test_program "open_by_handle"
+
+_cleanup()
+{
+ cd /
+ rm -r -f $tmp.*
+ _unmount $mnt1 2>/dev/null
+ _unmount $mnt2 2>/dev/null
+ _loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Create test dir and test files, encode file handles and store to tmp file
+create_test_files()
+{
+ rm -rf $testdir
+ mkdir -p $testdir
+ $here/src/open_by_handle -cwp -o $tmp.handles_file $testdir $NUMFILES
+}
+
+# Attempt to read and decode the saved file handles on the targeted mount point.
+test_file_handles()
+{
+ local opt=$1
+ local when=$2
+
+ echo test_file_handles after $when
+ $here/src/open_by_handle $opt -i $tmp.handles_file $mnt2 $NUMFILES
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both identical UUID filesystems simultaneously
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+
+NUMFILES=1
+testdir=$mnt2/testdir
+
+# Decode file handles of files/dir after cycle mount
+create_test_files
+
+# Cycle mounts and reverse initialization sequence to check if
+# file handle lookups are okay
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+
+# Verify file handles can still be resolved post-mount-cycle
+test_file_handles -rp "cycle mount"
+
+status=0
+exit
diff --git a/tests/generic/805.out b/tests/generic/805.out
new file mode 100644
index 000000000000..29b11ec77ffb
--- /dev/null
+++ b/tests/generic/805.out
@@ -0,0 +1,2 @@
+QA output created by 805
+test_file_handles after cycle mount
--
2.43.0
^ permalink raw reply related
* [PATCH v6 08/11] fstests: verify IMA isolation on cloned filesystems
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Add testcase to verify IMA measurement isolation when multiple devices
share the same FSUUID.
Signed-off-by: Anand Jain <asj@kernel.org>
---
tests/generic/804 | 108 ++++++++++++++++++++++++++++++++++++++++++
tests/generic/804.out | 10 ++++
2 files changed, 118 insertions(+)
create mode 100644 tests/generic/804
create mode 100644 tests/generic/804.out
diff --git a/tests/generic/804 b/tests/generic/804
new file mode 100644
index 000000000000..31ae77a2f461
--- /dev/null
+++ b/tests/generic/804
@@ -0,0 +1,108 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>. All Rights Reserved.
+#
+# FS QA Test 804
+# Verify IMA isolation on cloned filesystems:
+# . Mount two devices sharing the same FSUUID (cloned).
+# . Apply an IMA policy to measure files based on that FSUUID.
+# . Create unique files on each mount point to trigger measurements.
+# . Confirm the IMA log correctly attributes events to the respective mounts.
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+ "btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
+[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+ "btrfs: derive f_fsid from on-disk fsuuid and dev_t"
+
+_cleanup()
+{
+ cd /
+ rm -r -f $tmp.*
+ _unmount $mnt1 2>/dev/null
+ _unmount $mnt2 2>/dev/null
+ _loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Normalize device names and mount points
+filter_pool()
+{
+ sed -e "s|${devs[0]}|DEV1|g" -e "s|$mnt1|MNT1|g" \
+ -e "s|${devs[1]}|DEV2|g" -e "s|$mnt2|MNT2|g" | _filter_spaces
+}
+
+# Core helper to set IMA policy and check measurement logs
+do_ima()
+{
+ local ima_policy="/sys/kernel/security/ima/policy"
+ local ima_log="/sys/kernel/security/ima/ascii_runtime_measurements"
+ local fsuuid
+ local mnt=$1
+ local enable=$2
+
+ # Since the in-memory IMA audit log is only cleared upon reboot,
+ # use unique random filenames to avoid log collisions.
+ local foofile=$(mktemp --dry-run foobar_XXXXX)
+
+ echo $mnt $enable | filter_pool
+
+ [ -w "$ima_policy" ] || _notrun "IMA policy not writable"
+
+ fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+ # Load IMA policy to measure file access specifically for this
+ # filesystem UUID.
+ if [[ $enable -eq 1 ]]; then
+ echo "measure func=FILE_CHECK fsuuid=$fsuuid" > "$ima_policy" || \
+ _notrun "Policy rejected"
+ fi
+
+ # Create a file to trigger measurement and verify its entry in
+ # the IMA log.
+ echo "test_data" > $mnt/$foofile
+
+ # IMA log extract
+ grep $foofile "$ima_log" | awk '{ print $5 }' | filter_pool | \
+ sed "s/$foofile/FOOBAR_FILE/"
+
+ echo "dbg: $mnt $fsuuid $foofile" >> $seqres.full
+ cat $ima_log | tail -1 >> $seqres.full
+ echo >> $seqres.full
+}
+
+# Initialize loop base and cloned instances
+devs=()
+_loop_image_create_clone devs
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Concurrently mount both clones
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+
+# IMA response on baseline and clone configuration
+do_ima $mnt1 1
+do_ima $mnt2 0
+
+# Cycle mount on the second device.
+echo mount cycle
+_unmount $mnt2
+_mount $mount_opts ${devs[1]} $mnt2 || _fail "Failed to mount dev2"
+
+do_ima $mnt1 0
+do_ima $mnt2 0
+
+status=0
+exit
diff --git a/tests/generic/804.out b/tests/generic/804.out
new file mode 100644
index 000000000000..9804181d6c17
--- /dev/null
+++ b/tests/generic/804.out
@@ -0,0 +1,10 @@
+QA output created by 804
+MNT1 1
+MNT1/FOOBAR_FILE
+MNT2 0
+MNT2/FOOBAR_FILE
+mount cycle
+MNT1 0
+MNT1/FOOBAR_FILE
+MNT2 0
+MNT2/FOOBAR_FILE
--
2.43.0
^ permalink raw reply related
* [PATCH v6 07/11] fstests: verify libblkid resolution of duplicate UUIDs
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Verify how findmnt, df (libblkid) resolve device paths when multiple
block devices share the same FSUUID.
Signed-off-by: Anand Jain <asj@kernel.org>
---
tests/generic/803 | 84 +++++++++++++++++++++++++++++++++++++++++++
tests/generic/803.out | 19 ++++++++++
2 files changed, 103 insertions(+)
create mode 100644 tests/generic/803
create mode 100644 tests/generic/803.out
diff --git a/tests/generic/803 b/tests/generic/803
new file mode 100644
index 000000000000..b304a2743604
--- /dev/null
+++ b/tests/generic/803
@@ -0,0 +1,84 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>. All Rights Reserved.
+#
+# FS QA Test 803
+# Verify how libblkid resolve devices when multiple devices sharing the
+# same FSUUID.
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_cleanup()
+{
+ cd /
+ rm -r -f $tmp.*
+ umount $mnt1 $mnt2 2>/dev/null
+ _loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Normalize pool devices and mount points names
+filter_pool()
+{
+ sed -e "s|${devs[0]}|DEV1|g" -e "s|${mnt1}|MNT1|g" \
+ -e "s|${devs[1]}|DEV2|g" -e "s|${mnt2}|MNT2|g" | _filter_spaces
+}
+
+# Collect and print device tracking info from findmnt, /proc/mounts, and df.
+# This checks whether user-space tools get confused or remain accurate when
+# resolving a duplicate/cloned filesystem UUID.
+print_info()
+{
+ local mntpt=$1
+ local tgt=$(findmnt -no SOURCE $mntpt)
+ local fsuuid=$(blkid -s UUID -o value $tgt)
+
+ echo "mntpt=$mntpt tgt=$tgt fsuuid=$fsuuid" >> $seqres.full
+ echo
+ findmnt -o SOURCE,TARGET,UUID "$tgt" | tail -n +2 | \
+ sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+ awk -v dev="$tgt" '$1 == dev { print $1, $2 }' /proc/self/mounts | \
+ filter_pool
+ df --all --output=source,target "$tgt" | tail -n +2 | filter_pool
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both identical UUID filesystems simultaneously
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+
+print_info $mnt1
+print_info $mnt2
+
+# Cycle mounts and reverse the initialization order to see if libblkid / findmnt
+# resolution changes based on mount order.
+echo
+echo "**** mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+
+print_info $mnt1
+print_info $mnt2
+
+status=0
+exit
diff --git a/tests/generic/803.out b/tests/generic/803.out
new file mode 100644
index 000000000000..20a1cb36a213
--- /dev/null
+++ b/tests/generic/803.out
@@ -0,0 +1,19 @@
+QA output created by 803
+
+DEV1 MNT1 FSUUID
+DEV1 MNT1
+DEV1 MNT1
+
+DEV2 MNT2 FSUUID
+DEV2 MNT2
+DEV2 MNT2
+
+**** mount cycle ****
+
+DEV1 MNT1 FSUUID
+DEV1 MNT1
+DEV1 MNT1
+
+DEV2 MNT2 FSUUID
+DEV2 MNT2
+DEV2 MNT2
--
2.43.0
^ permalink raw reply related
* [PATCH v6 06/11] fstests: verify f_fsid for cloned filesystems
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Verify that the cloned filesystem provides an f_fsid that is persistent
across mount cycles, yet unique from the original filesystem's f_fsid.
Signed-off-by: Anand Jain <asj@kernel.org>
---
tests/generic/802 | 67 +++++++++++++++++++++++++++++++++++++++++++
tests/generic/802.out | 7 +++++
2 files changed, 74 insertions(+)
create mode 100644 tests/generic/802
create mode 100644 tests/generic/802.out
diff --git a/tests/generic/802 b/tests/generic/802
new file mode 100644
index 000000000000..653e74e11b53
--- /dev/null
+++ b/tests/generic/802
@@ -0,0 +1,67 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>. All Rights Reserved.
+#
+# FS QA Test 802
+# Verify f_fsid and s_uuid of cloned filesystems across mount cycle.
+
+. ./common/preamble
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+ "btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
+[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+ "btrfs: derive f_fsid from on-disk fsuuid and dev_t"
+
+_cleanup()
+{
+ cd /
+ rm -r -f $tmp.*
+ umount $mnt1 $mnt2 2>/dev/null
+ _loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both filesystems simultaneously using mandatory clone mount options
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+
+# Capture baseline filesystem IDs for comparison
+fsid_scratch=$(stat -f -c "%i" $mnt1)
+fsid_clone=$(stat -f -c "%i" $mnt2)
+
+echo "**** fsid initially ****"
+echo $fsid_scratch | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
+echo $fsid_clone | sed -e "s/$fsid_clone/FSID_CLONE/g"
+
+# Verify that the fsids remain stable after a mount cycle, even when the
+# mount order is reversed.
+echo "**** fsid after mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+
+# Compare post mount-cycle values against the baseline
+stat -f -c "%i" $mnt1 | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
+stat -f -c "%i" $mnt2 | sed -e "s/$fsid_clone/FSID_CLONE/g"
+
+status=0
+exit
diff --git a/tests/generic/802.out b/tests/generic/802.out
new file mode 100644
index 000000000000..d1e008f122bb
--- /dev/null
+++ b/tests/generic/802.out
@@ -0,0 +1,7 @@
+QA output created by 802
+**** fsid initially ****
+FSID_SCRATCH
+FSID_CLONE
+**** fsid after mount cycle ****
+FSID_SCRATCH
+FSID_CLONE
--
2.43.0
^ permalink raw reply related
* [PATCH v6 05/11] fstests: verify fanotify isolation on cloned filesystems
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Verify that fanotify events are correctly routed to the appropriate
watcher when cloned filesystems are mounted.
Helps verify kernel's event notification distinguishes between devices
sharing the same FSID/UUID.
Signed-off-by: Anand Jain <asj@kernel.org>
---
tests/generic/801 | 135 ++++++++++++++++++++++++++++++++++++++++++
tests/generic/801.out | 7 +++
2 files changed, 142 insertions(+)
create mode 100644 tests/generic/801
create mode 100644 tests/generic/801.out
diff --git a/tests/generic/801 b/tests/generic/801
new file mode 100644
index 000000000000..3bfb87d41922
--- /dev/null
+++ b/tests/generic/801
@@ -0,0 +1,135 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>. All Rights Reserved.
+#
+# FS QA Test 801
+# Verify fanotify FID functionality on cloned filesystems by setting up
+# watchers and making sure notifications are in the correct logs files.
+
+. ./common/preamble
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+_require_command "$FSNOTIFYWAIT_PROG" fsnotifywait
+_require_unique_f_fsid
+
+_cleanup()
+{
+ cd /
+ [[ -n $pid1 ]] && { kill -TERM "$pid1" 2> /dev/null; wait $pid1; }
+ [[ -n $pid2 ]] && { kill -TERM "$pid2" 2> /dev/null; wait $pid2; }
+
+ if [ "$semanage_added" = "yes" ]; then
+ semanage permissive -d unconfined_t >/dev/null 2>&1 || true
+ fi
+
+ umount $mnt1 $mnt2 2>/dev/null
+ _loop_image_destroy "${devs[@]}" 2> /dev/null
+ rm -r -f $tmp.*
+}
+
+# Run fsnotifywait in unbuffered mode to watch filesystem-wide create events
+monitor_fanotify()
+{
+ local mmnt=$1
+ exec stdbuf -oL $FSNOTIFYWAIT_PROG -m -F -S -e create "$mmnt" 2>&1
+}
+
+# Transform f_fsid into the hi.lo format used in fanotify FID logs
+fsid_to_fid_parts()
+{
+ local fsid=$1
+ # Pad to 16 hex chars (64-bit), then split into two 32-bit halves
+ local padded=$(printf '%016x' "0x${fsid}")
+ local hi=$(printf '%x' "0x${padded:0:8}") # strips leading zeros
+ local lo=$(printf '%x' "0x${padded:8:8}") # strips leading zeros
+ echo "${hi}.${lo}"
+}
+
+# Create base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both base and clone filesystems using required clone mount options
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+ _fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+ _fail "Failed to mount dev2"
+
+# Fetch filesystem IDs to verify the kernel can differentiate between them
+fsid1=$(stat -f -c "%i" $mnt1)
+fsid2=$(stat -f -c "%i" $mnt2)
+
+log1=$tmp.fanotify1
+log2=$tmp.fanotify2
+
+pid1=""
+pid2=""
+echo "Setup FID fanotify watchers on both mnt1 and mnt2"
+
+# Permit unconfined_t domains when SELinux is enforcing to prevent fanotify
+# blockages
+semanage_added="no"
+if [ "$(getenforce 2>/dev/null)" = "Enforcing" ]; then
+ if ! semanage permissive -l | grep -q "unconfined_t"; then
+ semanage permissive -a unconfined_t >/dev/null 2>&1 && semanage_added="yes"
+ fi
+fi
+
+# Start asynchronous fanotify monitors
+( monitor_fanotify "$mnt1" > "$log1" ) &
+pid1=$!
+( monitor_fanotify "$mnt2" > "$log2" ) &
+pid2=$!
+sleep 2
+
+echo "Trigger file creation on mnt1"
+touch $mnt1/file_on_mnt1
+sync
+sleep 1
+
+echo "Trigger file creation on mnt2"
+touch $mnt2/file_on_mnt2
+sync
+sleep 1
+
+echo "Verify fsid in the fanotify"
+kill $pid1 $pid2
+wait $pid1 $pid2 2>/dev/null
+pid1=""
+pid2=""
+
+e_fsid1=$(fsid_to_fid_parts "$fsid1")
+e_fsid2=$(fsid_to_fid_parts "$fsid2")
+
+# Dump debug details to the full log
+echo $fsid1 $e_fsid1 $fsid2 $e_fsid2 >> $seqres.full
+cat $log1 >> $seqres.full
+cat $log2 >> $seqres.full
+
+# Ensure monitor 1 only captured events belonging to mnt 1 and fsid 1
+if grep -qF "$e_fsid1" "$log1" && ! grep -qF "$e_fsid2" "$log1"; then
+ echo "SUCCESS: mnt1 events found"
+else
+ [ ! -s "$log1" ] && echo " - mnt1 received no events."
+ grep -qF "$e_fsid2" "$log1" && echo " - mnt1 received event from mnt2."
+fi
+
+# Ensure monitor 2 only captured events belonging to mnt 2 and fsid 2
+if grep -qF "$e_fsid2" "$log2" && ! grep -qF "$e_fsid1" "$log2"; then
+ echo "SUCCESS: mnt2 events found"
+else
+ [ ! -s "$log2" ] && echo " - mnt2 received no events."
+ grep -qF "$e_fsid1" "$log2" && echo " - mnt2 received event from mnt1."
+fi
+
+status=0
+exit
diff --git a/tests/generic/801.out b/tests/generic/801.out
new file mode 100644
index 000000000000..d7b318d9f27c
--- /dev/null
+++ b/tests/generic/801.out
@@ -0,0 +1,7 @@
+QA output created by 801
+Setup FID fanotify watchers on both mnt1 and mnt2
+Trigger file creation on mnt1
+Trigger file creation on mnt2
+Verify fsid in the fanotify
+SUCCESS: mnt1 events found
+SUCCESS: mnt2 events found
--
2.43.0
^ permalink raw reply related
* [PATCH v6 04/11] fstests: add _require_unique_f_fsid() helper
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Add a helper to check if the target filesystem supports unique f_fsid
tracking across cloned or snapshot instances.
Certain filesystems like XFS, Btrfs, and F2FS ensure unique f_fsid
identifiers per filesystem instance. However, Ext4 derives its f_fsid
directly from its superblock UUID, which leads to identical f_fsid
values on cloned images until the UUID is manually modified by userspace.
Introduce _require_unique_f_fsid() to allow test cases requiring strict
f_fsid uniqueness to skip gracefully on unsupported filesystems.
Signed-off-by: Anand Jain <asj@kernel.org>
---
common/rc | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/common/rc b/common/rc
index 937f478963b4..5446552aed92 100644
--- a/common/rc
+++ b/common/rc
@@ -6314,6 +6314,27 @@ _require_fanotify_ioerrors()
_notrun "$FSTYP does not support fanotify ioerrors"
}
+# Ext4 derives f_fsid from the superblock UUID, meaning clones share the
+# same f_fsid until their UUIDs diverge. Conversely, XFS, Btrfs,
+# and F2FS ensure f_fsid remains unique per filesystem instance (often by
+# deriving it from the UUID and underlying block device.)
+#
+# Across all filesystems, a UUID collision causes libblkid tools to return
+# non-deterministic device mappings. It is ultimately the responsibility
+# of the userspace utility or use-case to enforce uniqueness when a clone
+# diverges. For details, see mailing list thread discussions titled:
+# "ext4: derive f_fsid from block device to avoid collisions".
+_require_unique_f_fsid()
+{
+ # Skip the test if the filesystem does not enforce unique f_fsids
+ # natively. Checking this dynamically requires recreating a clone
+ # layout, so we use a static lookup based on FSTYP.
+ if [ "$FSTYP" == "ext4" ]; then
+ _notrun "Target filesystem ($FSTYP) does not guarantee unique f_fsid on clones."
+ fi
+}
+
+
# Computes a percentage of the available space in a filesystem and
# returns that quantity in MB. The percentage must not contain a percent
# sign ("%").
--
2.43.0
^ permalink raw reply related
* [PATCH v6 03/11] fstests: add FSNOTIFYWAIT_PROG
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Define `FSNOTIFYWAIT_PROG` for an upcoming test case that uses `fsnotifywait`.
Signed-off-by: Anand Jain <asj@kernel.org>
---
common/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/common/config b/common/config
index d5299d5b926f..5661fa0ec310 100644
--- a/common/config
+++ b/common/config
@@ -242,6 +242,7 @@ export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
export PARTED_PROG="$(type -P parted)"
export XFS_PROPERTY_PROG="$(type -P xfs_property)"
export FSCRYPTCTL_PROG="$(type -P fscryptctl)"
+export FSNOTIFYWAIT_PROG="$(type -P fsnotifywait)"
# udev wait functions.
#
--
2.43.0
^ permalink raw reply related
* [PATCH v6 02/11] fstests: add _clone_mount_option() helper
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Adds _clone_mount_option() helper function to handle filesystem-specific
requirements for mounting cloned devices. Abstract the need for -o nouuid
on XFS.
Signed-off-by: Anand Jain <asj@kernel.org>
---
common/rc | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/common/rc b/common/rc
index d7e3e0bdfb1e..937f478963b4 100644
--- a/common/rc
+++ b/common/rc
@@ -414,6 +414,23 @@ _scratch_mount_options()
$SCRATCH_DEV $SCRATCH_MNT
}
+# Return filesystem-specific mount options required for mounting clone/snapshot
+# devices.
+_clone_mount_option()
+{
+ local mount_opts=""
+
+ case "$FSTYP" in
+ xfs)
+ # Allow mounting a duplicate filesystem on the same host
+ mount_opts="-o nouuid"
+ ;;
+ *)
+ esac
+
+ echo $mount_opts
+}
+
_supports_filetype()
{
local dir=$1
--
2.43.0
^ permalink raw reply related
* [PATCH v6 01/11] fstests: add _loop_image_create_clone() helper
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
In-Reply-To: <cover.1779939330.git.asj@kernel.org>
Introduce _loop_image_create_clone() and _loop_image_destroy() to mkfs an
image file and clone it to another image file, and attach a loop device to
them. And its destroy part.
Signed-off-by: Anand Jain <asj@kernel.org>
---
common/rc | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)
diff --git a/common/rc b/common/rc
index 79189e7e6e94..d7e3e0bdfb1e 100644
--- a/common/rc
+++ b/common/rc
@@ -1520,6 +1520,69 @@ _scratch_resvblks()
esac
}
+# Create a small loop image, run an optional tuning function ($2) on it,
+# clone it, and attach both to loop devices, returned in ($1).
+# Args:
+# $1: Nameref to return the array of allocated loop devices [base, clone].
+# $2: Optional callback function to tune the base filesystem before cloning.
+_loop_image_create_clone()
+{
+ local -n _ret=$1
+ local pre_clone_tune_func="$2"
+ local img_file=$TEST_DIR/${seq}.img
+ local img_file_clone=$TEST_DIR/${seq}_clone.img
+ local size=$(_small_fs_size_mb 128) # Smallest possible
+ local loop_devs
+
+ # Since we copy the block device image, we keep its size small.
+ _require_fs_space $TEST_DIR $((size * 1024))
+
+ _create_file_sized $((size * 1024 * 1024)) $img_file ||
+ _fail "Failed: Create $img_file $size"
+
+ loop_devs=$(_create_loop_device $img_file)
+ _ret=($loop_devs)
+
+ case $FSTYP in
+ xfs)
+ _mkfs_dev "-s size=4096" ${loop_devs[0]}
+ ;;
+ btrfs)
+ _mkfs_dev ${loop_devs[0]}
+ ;;
+ *)
+ _mkfs_dev ${loop_devs[0]}
+ ;;
+ esac
+
+ # Only execute if the function argument is not empty
+ if [ -n "$pre_clone_tune_func" ]; then
+ $pre_clone_tune_func ${loop_devs[0]}
+ fi
+
+ sync ${loop_devs[0]}
+ cp $img_file $img_file_clone
+
+ loop_devs="$loop_devs $(_create_loop_device $img_file_clone)"
+
+ _ret=($loop_devs)
+}
+
+# Teardown loop devices and delete their underlying backing image files.
+# Accepts a list of loop device paths (e.g., /dev/loop0 /dev/loop1).
+_loop_image_destroy()
+{
+ for d in "$@"; do
+ # Retrieve the path of the backing file
+ local f=$(losetup --noheadings --output BACK-FILE $d)
+
+ # Detach the loop device from the backing file
+ _destroy_loop_device "$d"
+
+ # Clean up the backing disk image file
+ [ -n "$f" ] && rm -f "$f"
+ done
+}
# Repair scratch filesystem. Returns 0 if the FS is good to go (either no
# errors found or errors were fixed) and nonzero otherwise; also spits out
--
2.43.0
^ permalink raw reply related
* [PATCH v6 0/11] fstests: add test coverage for cloned filesystem ids
From: Anand Jain @ 2026-05-28 4:05 UTC (permalink / raw)
To: fstests; +Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch
v6:
. Renamed `pre_clone_tune_uuid()` to `_change_metadata_uuid()`.
. Created the `_require_unique_f_fsid()` helper instead of handling it inside the test case (patch 5/11).
. Separated `FSNOTIFYWAIT_PROG` into its own patch.
. Dropped the `inotify` test case in favor of `fsnotify`.
. Added comments throughout, especially for helper functions.
v5:
https://lore.kernel.org/fstests/cover.1779367627.git.asj@kernel.org
v4:
https://lore.kernel.org/fstests/cover.1777357320.git.asj@kernel.org
v3:
https://lore.kernel.org/fstests/cover.1777281778.git.asj@kernel.org
v2:
https://lore.kernel.org/fstests/cover.1774090817.git.asj@kernel.org
v1:
https://lore.kernel.org/fstests/cover.1772095513.git.asj@kernel.org
This series adds fstests infrastructure and test cases to verify correct
filesystem identity when a filesystem is cloned (block-level copy).
Test covers inotify, fanotify, f_fsid, libblkid, IMA, exportfs file handles
and libblkid tools verify with metadata_uuid.
New helpers:
_loop_image_create_clone() and _loop_image_destroy() to help create fs and clone
_clone_mount_option() helper to apply per-filesystem clone mount options
_change_metadata_uuid() changes the UUID before the clone
New tests:
- fanotify events are isolated between cloned filesystems
- f_fsid is unique across cloned filesystem instances
- libblkid correctly resolves duplicate UUIDs to distinct devices
with and without metadata_uuid
- IMA distinct identity for each cloned filesystem
- exportfs file handles resolve correctly on cloned filesystems
Kernel Patches:
Requires Btrfs kernel patches for all tests to pass.
[1] https://lore.kernel.org/linux-btrfs/cover.1777281686.git.asj@kernel.org
Anand Jain (11):
fstests: add _loop_image_create_clone() helper
fstests: add _clone_mount_option() helper
fstests: add FSNOTIFYWAIT_PROG
fstests: add _require_unique_f_fsid() helper
fstests: verify fanotify isolation on cloned filesystems
fstests: verify f_fsid for cloned filesystems
fstests: verify libblkid resolution of duplicate UUIDs
fstests: verify IMA isolation on cloned filesystems
fstests: verify exportfs file handles on cloned filesystems
fstests: add _change_metadata_uuid helper
fstests: test UUID consistency for clones with metadata_uuid
common/config | 1 +
common/rc | 124 ++++++++++++++++++++++++++++++++++++++
tests/generic/801 | 135 ++++++++++++++++++++++++++++++++++++++++++
tests/generic/801.out | 7 +++
tests/generic/802 | 67 +++++++++++++++++++++
tests/generic/802.out | 7 +++
tests/generic/803 | 84 ++++++++++++++++++++++++++
tests/generic/803.out | 19 ++++++
tests/generic/804 | 108 +++++++++++++++++++++++++++++++++
tests/generic/804.out | 10 ++++
tests/generic/805 | 80 +++++++++++++++++++++++++
tests/generic/805.out | 2 +
tests/generic/806 | 84 ++++++++++++++++++++++++++
tests/generic/806.out | 19 ++++++
14 files changed, 747 insertions(+)
create mode 100644 tests/generic/801
create mode 100644 tests/generic/801.out
create mode 100644 tests/generic/802
create mode 100644 tests/generic/802.out
create mode 100644 tests/generic/803
create mode 100644 tests/generic/803.out
create mode 100644 tests/generic/804
create mode 100644 tests/generic/804.out
create mode 100644 tests/generic/805
create mode 100644 tests/generic/805.out
create mode 100644 tests/generic/806
create mode 100644 tests/generic/806.out
--
2.43.0
^ permalink raw reply
* Re: [PATCH v5 05/10] fstests: verify f_fsid for cloned filesystems
From: Anand Jain @ 2026-05-27 22:41 UTC (permalink / raw)
To: Christoph Hellwig, Anand Jain
Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
amir73il, zlang
In-Reply-To: <ahP2hjv9zl_WL1kg@infradead.org>
On 25/5/26 15:13, Christoph Hellwig wrote:
> On Thu, May 21, 2026 at 08:54:55PM +0800, Anand Jain wrote:
>> +[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
>> + "btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
>> +[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
>> + "btrfs: derive f_fsid from on-disk fsuuid and dev_t"
>
> Seems like these are stuck on the btrfs list. Any progress on that?
They are in the `for-next` branch of the Btrfs Linux repository:
https://github.com/btrfs/linux.git
^ permalink raw reply
* Re: [PATCH] ext4: replace ext4_dir_entry with ext4_dir_entry_2
From: Andreas Dilger @ 2026-05-27 22:28 UTC (permalink / raw)
To: Artem Blagodarenko; +Cc: Theodore Ts'o, linux-ext4
In-Reply-To: <20260526233608.7600-1-ablagodarenko@thelustrecollective.com>
On May 26, 2026, at 17:36, Artem Blagodarenko <artem.blagodarenko@gmail.com> wrote:
>
> From: Artem Blagodarenko <artem.blagodarenko@gmail.com>
>
> Replace remaining uses of struct ext4_dir_entry in namei.c
> with struct ext4_dir_entry_2.
>
> The code paths affected by this change already depend on the
> filetype feature, so using struct ext4_dir_entry_2 is
> appropriate and avoids mixing the two directory entry types
> unnecessarily.
>
> This change does not affect support for 16-bit rec_len.
>
> Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>
> ---
> fs/ext4/namei.c | 36 ++++++++++++++++++------------------
> 1 file changed, 18 insertions(+), 18 deletions(-)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 4a47fbd8dd30..a316fc2ac41b 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -102,7 +102,7 @@ static struct buffer_head *ext4_append(handle_t *handle,
> }
>
> static int ext4_dx_csum_verify(struct inode *inode,
> - struct ext4_dir_entry *dirent);
> + struct ext4_dir_entry_2 *dirent);
>
> /*
> * Hints to ext4_read_dirblock regarding whether we expect a directory
> @@ -128,7 +128,7 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode,
> unsigned int line)
> {
> struct buffer_head *bh;
> - struct ext4_dir_entry *dirent;
> + struct ext4_dir_entry_2 *dirent;
> int is_dx_block = 0;
>
> if (block >= inode->i_size >> inode->i_blkbits) {
> @@ -160,7 +160,7 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode,
> }
> if (!bh)
> return NULL;
> - dirent = (struct ext4_dir_entry *) bh->b_data;
> + dirent = (struct ext4_dir_entry_2 *) bh->b_data;
> /* Determine whether or not we have an index block */
> if (is_dx(inode)) {
> if (block == 0)
> @@ -317,13 +317,13 @@ static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
> int blocksize = EXT4_BLOCK_SIZE(inode->i_sb);
>
> #ifdef PARANOID
> - struct ext4_dir_entry *d, *top;
> + struct ext4_dir_entry_2 *d, *top;
>
> - d = (struct ext4_dir_entry *)bh->b_data;
> - top = (struct ext4_dir_entry *)(bh->b_data +
> + d = (struct ext4_dir_entry_2 *)bh->b_data;
> + top = (struct ext4_dir_entry_2 *)(bh->b_data +
> (blocksize - sizeof(struct ext4_dir_entry_tail)));
> while (d < top && ext4_rec_len_from_disk(d->rec_len, blocksize))
> - d = (struct ext4_dir_entry *)(((void *)d) +
> + d = (struct ext4_dir_entry_2 *)(((void *)d) +
> ext4_rec_len_from_disk(d->rec_len, blocksize));
>
> if (d != top)
> @@ -410,10 +410,10 @@ int ext4_handle_dirty_dirblock(handle_t *handle,
> }
>
> static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
> - struct ext4_dir_entry *dirent,
> + struct ext4_dir_entry_2 *dirent,
> int *offset)
> {
> - struct ext4_dir_entry *dp;
> + struct ext4_dir_entry_2 *de;
> struct dx_root_info *root;
> int count_offset;
> int blocksize = EXT4_BLOCK_SIZE(inode->i_sb);
> @@ -422,10 +422,10 @@ static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
> if (rlen == blocksize)
> count_offset = 8;
> else if (rlen == 12) {
> - dp = (struct ext4_dir_entry *)(((void *)dirent) + 12);
> - if (ext4_rec_len_from_disk(dp->rec_len, blocksize) != blocksize - 12)
> + de = (struct ext4_dir_entry_2 *)(((void *)dirent) + 12);
> + if (ext4_rec_len_from_disk(de->rec_len, blocksize) != blocksize - 12)
> return NULL;
> - root = (struct dx_root_info *)(((void *)dp + 12));
> + root = (struct dx_root_info *)(((void *)de + 12));
> if (root->reserved_zero ||
> root->info_length != sizeof(struct dx_root_info))
> return NULL;
> @@ -438,7 +438,7 @@ static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
> return (struct dx_countlimit *)(((void *)dirent) + count_offset);
> }
>
> -static __le32 ext4_dx_csum(struct inode *inode, struct ext4_dir_entry *dirent,
> +static __le32 ext4_dx_csum(struct inode *inode, struct ext4_dir_entry_2 *dirent,
> int count_offset, int count, struct dx_tail *t)
> {
> struct ext4_inode_info *ei = EXT4_I(inode);
> @@ -456,7 +456,7 @@ static __le32 ext4_dx_csum(struct inode *inode, struct ext4_dir_entry *dirent,
> }
>
> static int ext4_dx_csum_verify(struct inode *inode,
> - struct ext4_dir_entry *dirent)
> + struct ext4_dir_entry_2 *dirent)
> {
> struct dx_countlimit *c;
> struct dx_tail *t;
> @@ -485,7 +485,7 @@ static int ext4_dx_csum_verify(struct inode *inode,
> return 1;
> }
>
> -static void ext4_dx_csum_set(struct inode *inode, struct ext4_dir_entry *dirent)
> +static void ext4_dx_csum_set(struct inode *inode, struct ext4_dir_entry_2 *dirent)
> {
> struct dx_countlimit *c;
> struct dx_tail *t;
> @@ -515,7 +515,7 @@ static inline int ext4_handle_dirty_dx_node(handle_t *handle,
> struct inode *inode,
> struct buffer_head *bh)
> {
> - ext4_dx_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
> + ext4_dx_csum_set(inode, (struct ext4_dir_entry_2 *)bh->b_data);
> return ext4_handle_dirty_metadata(handle, inode, bh);
> }
>
> @@ -1488,7 +1488,7 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
> }
>
> static int is_dx_internal_node(struct inode *dir, ext4_lblk_t block,
> - struct ext4_dir_entry *de)
> + struct ext4_dir_entry_2 *de)
> {
> struct super_block *sb = dir->i_sb;
>
> @@ -1619,7 +1619,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
> }
> if (!buffer_verified(bh) &&
> !is_dx_internal_node(dir, block,
> - (struct ext4_dir_entry *)bh->b_data) &&
> + (struct ext4_dir_entry_2 *)bh->b_data) &&
> !ext4_dirblock_csum_verify(dir, bh)) {
> EXT4_ERROR_INODE_ERR(dir, EFSBADCRC,
> "checksumming directory "
> --
> 2.43.7
>
Cheers, Andreas
^ permalink raw reply
* Re: [PATCH] ext4: add ext4_dir_entry_is_tail()
From: Andreas Dilger @ 2026-05-27 22:26 UTC (permalink / raw)
To: Artem Blagodarenko; +Cc: linux-ext4
In-Reply-To: <20260526233816.7654-1-ablagodarenko@thelustrecollective.com>
On May 26, 2026, at 17:38, Artem Blagodarenko <artem.blagodarenko@gmail.com> wrote:
>
> From: Artem Blagodarenko <artem.blagodarenko@gmail.com>
>
> Replace open-coded checks for directory tail entries with a call
> to ext4_dir_entry_is_tail(). This helper will also be used by
> upcoming changes.
>
> Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>
> ---
> fs/ext4/ext4.h | 16 ++++++++++++++++
> fs/ext4/namei.c | 7 +------
> 2 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..01b1222b1454 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3917,6 +3917,22 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> io_end->flag &= ~EXT4_IO_END_UNWRITTEN;
> }
>
> +/*
> + * ext4_dir_entry_is_tail() - Check if a directory entry is a tail entry.
> + * @de: directory entry to check
> + *
> + * Returns true if @de is a directory block tail entry (checksum record).
> + */
> +static inline bool ext4_dir_entry_is_tail(struct ext4_dir_entry_2 *de)
> +{
> + struct ext4_dir_entry_tail *t = (struct ext4_dir_entry_tail *)de;
> +
> + return !t->det_reserved_zero1 &&
> + le16_to_cpu(t->det_rec_len) == sizeof(*t) &&
> + !t->det_reserved_zero2 &&
> + t->det_reserved_ft == EXT4_FT_DIR_CSUM;
> +}
> +
> extern const struct iomap_ops ext4_iomap_ops;
> extern const struct iomap_ops ext4_iomap_report_ops;
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 4a47fbd8dd30..accf63fbbc79 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -314,7 +314,6 @@ static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
> struct buffer_head *bh)
> {
> struct ext4_dir_entry_tail *t;
> - int blocksize = EXT4_BLOCK_SIZE(inode->i_sb);
>
> #ifdef PARANOID
> struct ext4_dir_entry *d, *top;
> @@ -334,11 +333,7 @@ static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
> t = EXT4_DIRENT_TAIL(bh->b_data, EXT4_BLOCK_SIZE(inode->i_sb));
> #endif
>
> - if (t->det_reserved_zero1 ||
> - (ext4_rec_len_from_disk(t->det_rec_len, blocksize) !=
> - sizeof(struct ext4_dir_entry_tail)) ||
> - t->det_reserved_zero2 ||
> - t->det_reserved_ft != EXT4_FT_DIR_CSUM)
> + if (!ext4_dir_entry_is_tail((struct ext4_dir_entry_2 *)t))
> return NULL;
>
> return t;
> --
> 2.43.7
>
Cheers, Andreas
^ permalink raw reply
* Re: [PATCH] jbd2: Remove special jbd2 slabs
From: Tal Zussman @ 2026-05-27 20:33 UTC (permalink / raw)
To: Matthew Wilcox (Oracle), Theodore Ts'o
Cc: Jan Kara, linux-ext4, linux-fsdevel, Mike Rapoport (Microsoft),
Vlastimil Babka
In-Reply-To: <20260525201321.21717-1-willy@infradead.org>
On 5/25/26 4:13 PM, Matthew Wilcox (Oracle) wrote:
Hi,
One small comment below.
> When jbd2 was originally written, kmalloc() would not guarantee alignment
> for the requested memory. Since commit 59bb47985c1d in 2019, kmalloc
> has guaranteed natural alignment for power-of-two allocations. We can
> now remove the jbd2 special slabs and just use kmalloc() directly.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
> fs/jbd2/commit.c | 8 ++-
> fs/jbd2/journal.c | 121 ++----------------------------------------
> fs/jbd2/transaction.c | 8 +--
> include/linux/jbd2.h | 3 --
> 4 files changed, 11 insertions(+), 129 deletions(-)
>
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 38f318bb4279..2e8dbc4547bb 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -514,10 +514,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> * leave undo-committed data.
> */
> if (jh->b_committed_data) {
> - struct buffer_head *bh = jh2bh(jh);
> -
> spin_lock(&jh->b_state_lock);
> - jbd2_free(jh->b_committed_data, bh->b_size);
> + kfree(jh->b_committed_data);
> jh->b_committed_data = NULL;
> spin_unlock(&jh->b_state_lock);
> }
> @@ -978,7 +976,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> * its triggers if they exist, so we can clear that too.
> */
> if (jh->b_committed_data) {
> - jbd2_free(jh->b_committed_data, bh->b_size);
> + kfree(jh->b_committed_data);
> jh->b_committed_data = NULL;
> if (jh->b_frozen_data) {
> jh->b_committed_data = jh->b_frozen_data;
> @@ -986,7 +984,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> jh->b_frozen_triggers = NULL;
> }
> } else if (jh->b_frozen_data) {
> - jbd2_free(jh->b_frozen_data, bh->b_size);
> + kfree(jh->b_frozen_data);
> jh->b_frozen_data = NULL;
> jh->b_frozen_triggers = NULL;
> }
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index a6616380ce38..ad10c8a92fa0 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -95,8 +95,6 @@ EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
> EXPORT_SYMBOL(jbd2_inode_cache);
>
> -static int jbd2_journal_create_slab(size_t slab_size);
> -
> #ifdef CONFIG_JBD2_DEBUG
> void __jbd2_debug(int level, const char *file, const char *func,
> unsigned int line, const char *fmt, ...)
> @@ -385,10 +383,10 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
> goto escape_done;
>
> spin_unlock(&jh_in->b_state_lock);
> - tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS | __GFP_NOFAIL);
> + tmp = kmalloc(bh_in->b_size, GFP_NOFS | __GFP_NOFAIL);
> spin_lock(&jh_in->b_state_lock);
> if (jh_in->b_frozen_data) {
> - jbd2_free(tmp, bh_in->b_size);
> + kfree(tmp);
> goto copy_done;
> }
>
> @@ -2063,14 +2061,6 @@ EXPORT_SYMBOL(jbd2_journal_update_sb_errno);
> int jbd2_journal_load(journal_t *journal)
> {
> int err;
> - journal_superblock_t *sb = journal->j_superblock;
> -
> - /*
> - * Create a slab for this blocksize
> - */
> - err = jbd2_journal_create_slab(be32_to_cpu(sb->s_blocksize));
> - if (err)
> - return err;
>
> /* Let the recovery code check whether it needs to recover any
> * data from the journal. */
> @@ -2698,108 +2688,6 @@ size_t journal_tag_bytes(journal_t *journal)
> return sz - sizeof(__u32);
> }
>
> -/*
> - * JBD memory management
> - *
> - * These functions are used to allocate block-sized chunks of memory
> - * used for making copies of buffer_head data. Very often it will be
> - * page-sized chunks of data, but sometimes it will be in
> - * sub-page-size chunks. (For example, 16k pages on Power systems
> - * with a 4k block file system.) For blocks smaller than a page, we
> - * use a SLAB allocator. There are slab caches for each block size,
> - * which are allocated at mount time, if necessary, and we only free
> - * (all of) the slab caches when/if the jbd2 module is unloaded. For
> - * this reason we don't need to a mutex to protect access to
> - * jbd2_slab[] allocating or releasing memory; only in
> - * jbd2_journal_create_slab().
> - */
> -#define JBD2_MAX_SLABS 8
> -static struct kmem_cache *jbd2_slab[JBD2_MAX_SLABS];
> -
> -static const char *jbd2_slab_names[JBD2_MAX_SLABS] = {
> - "jbd2_1k", "jbd2_2k", "jbd2_4k", "jbd2_8k",
> - "jbd2_16k", "jbd2_32k", "jbd2_64k", "jbd2_128k"
> -};
> -
> -
> -static void jbd2_journal_destroy_slabs(void)
> -{
> - int i;
> -
> - for (i = 0; i < JBD2_MAX_SLABS; i++) {
> - kmem_cache_destroy(jbd2_slab[i]);
> - jbd2_slab[i] = NULL;
> - }
> -}
> -
> -static int jbd2_journal_create_slab(size_t size)
> -{
> - static DEFINE_MUTEX(jbd2_slab_create_mutex);
> - int i = order_base_2(size) - 10;
> - size_t slab_size;
> -
> - if (size == PAGE_SIZE)
> - return 0;
> -
> - if (i >= JBD2_MAX_SLABS)
> - return -EINVAL;
> -
> - if (unlikely(i < 0))
> - i = 0;
> - mutex_lock(&jbd2_slab_create_mutex);
> - if (jbd2_slab[i]) {
> - mutex_unlock(&jbd2_slab_create_mutex);
> - return 0; /* Already created */
> - }
> -
> - slab_size = 1 << (i+10);
> - jbd2_slab[i] = kmem_cache_create(jbd2_slab_names[i], slab_size,
> - slab_size, 0, NULL);
> - mutex_unlock(&jbd2_slab_create_mutex);
> - if (!jbd2_slab[i]) {
> - printk(KERN_EMERG "JBD2: no memory for jbd2_slab cache\n");
> - return -ENOMEM;
> - }
> - return 0;
> -}
> -
> -static struct kmem_cache *get_slab(size_t size)
> -{
> - int i = order_base_2(size) - 10;
> -
> - BUG_ON(i >= JBD2_MAX_SLABS);
> - if (unlikely(i < 0))
> - i = 0;
> - BUG_ON(jbd2_slab[i] == NULL);
> - return jbd2_slab[i];
> -}
> -
> -void *jbd2_alloc(size_t size, gfp_t flags)
> -{
> - void *ptr;
> -
> - BUG_ON(size & (size-1)); /* Must be a power of 2 */
> -
> - if (size < PAGE_SIZE)
> - ptr = kmem_cache_alloc(get_slab(size), flags);
> - else
> - ptr = (void *)__get_free_pages(flags, get_order(size));
> -
> - /* Check alignment; SLUB has gotten this wrong in the past,
> - * and this can lead to user data corruption! */
> - BUG_ON(((unsigned long) ptr) & (size-1));
> -
> - return ptr;
> -}
> -
> -void jbd2_free(void *ptr, size_t size)
> -{
> - if (size < PAGE_SIZE)
> - kmem_cache_free(get_slab(size), ptr);
> - else
> - free_pages((unsigned long)ptr, get_order(size));
> -};
> -
> /*
> * Journal_head storage management
> */
> @@ -2977,11 +2865,11 @@ static void journal_release_journal_head(struct journal_head *jh, size_t b_size)
I think the b_size parameter can be removed from journal_release_journal_head()
and its single caller now.
> {
> if (jh->b_frozen_data) {
> printk(KERN_WARNING "%s: freeing b_frozen_data\n", __func__);
> - jbd2_free(jh->b_frozen_data, b_size);
> + kfree(jh->b_frozen_data);
> }
> if (jh->b_committed_data) {
> printk(KERN_WARNING "%s: freeing b_committed_data\n", __func__);
> - jbd2_free(jh->b_committed_data, b_size);
> + kfree(jh->b_committed_data);
> }
> journal_free_journal_head(jh);
> }
> @@ -3142,7 +3030,6 @@ static void jbd2_journal_destroy_caches(void)
> jbd2_journal_destroy_handle_cache();
> jbd2_journal_destroy_inode_cache();
> jbd2_journal_destroy_transaction_cache();
> - jbd2_journal_destroy_slabs();
> }
>
> static int __init journal_init(void)
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 4885903bbd10..48ddb566d12d 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1131,7 +1131,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> if (!frozen_buffer) {
> JBUFFER_TRACE(jh, "allocate memory for buffer");
> spin_unlock(&jh->b_state_lock);
> - frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
> + frozen_buffer = kmalloc(jh2bh(jh)->b_size,
> GFP_NOFS | __GFP_NOFAIL);
> goto repeat;
> }
> @@ -1159,7 +1159,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>
> out:
> if (unlikely(frozen_buffer)) /* It's usually NULL */
> - jbd2_free(frozen_buffer, bh->b_size);
> + kfree(frozen_buffer);
>
> JBUFFER_TRACE(jh, "exit");
> return error;
> @@ -1424,7 +1424,7 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
>
> repeat:
> if (!jh->b_committed_data)
> - committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> + committed_data = kmalloc(jh2bh(jh)->b_size,
> GFP_NOFS|__GFP_NOFAIL);
>
> spin_lock(&jh->b_state_lock);
> @@ -1445,7 +1445,7 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
> out:
> jbd2_journal_put_journal_head(jh);
> if (unlikely(committed_data))
> - jbd2_free(committed_data, bh->b_size);
> + kfree(committed_data);
> return err;
> }
>
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 7e785aa6d35d..b68561187e90 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -63,9 +63,6 @@ void __jbd2_debug(int level, const char *file, const char *func,
> #define jbd2_debug(n, fmt, a...) no_printk(fmt, ##a)
> #endif
>
> -extern void *jbd2_alloc(size_t size, gfp_t flags);
> -extern void jbd2_free(void *ptr, size_t size);
> -
> #define JBD2_MIN_JOURNAL_BLOCKS 1024
> #define JBD2_DEFAULT_FAST_COMMIT_BLOCKS 256
>
^ permalink raw reply
* Re: [PATCH v2 2/2] ext4: get ext4_group_desc in ext4_mb_prefetch only when necessary
From: Andreas Dilger @ 2026-05-27 19:49 UTC (permalink / raw)
To: Bohdan Trach
Cc: Theodore Ts'o, Baokun Li, Jan Kara, Ojaswin Mujoo,
Ritesh Harjani (IBM), Zhang Yi, mchehab+huawei, bohdan.trach,
lilith.oberhauser, linux-ext4, linux-kernel
In-Reply-To: <20260527090329.2680170-3-bohdan.trach@huaweicloud.com>
On May 27, 2026, at 03:03, Bohdan Trach <bohdan.trach@huaweicloud.com> wrote:
>
> Getting ext4_group_desc structure can contribute to the cost of
> ext4_mb_prefetch() without any need, as most groups fail the
> !EXT4_MB_GRP_TEST_AND_SET_READ check.
>
> Optimize ext4_mb_prefetch by getting the group description only when
> necessary.
>
> The result is further increase in performance of fallocate() system call
> path that triggers ext4_mb_prefetch() via a linear group scan.
>
> Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
This looks reasonable, and is independent of the EXT4_MB_GRP_TEST_AND_SET_READ()
micro-optimization in the 1/2 patch.
Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>
Cheers, Andreas
^ permalink raw reply
* Re: [PATCH v2 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ
From: Andreas Dilger @ 2026-05-27 19:46 UTC (permalink / raw)
To: Bohdan Trach
Cc: Theodore Ts'o, Baokun Li, Jan Kara, Ojaswin Mujoo,
Ritesh Harjani (IBM), Zhang Yi, mchehab+huawei, bohdan.trach,
lilith.oberhauser, linux-ext4, linux-kernel
In-Reply-To: <20260527090329.2680170-2-bohdan.trach@huaweicloud.com>
On May 27, 2026, at 03:03, Bohdan Trach <bohdan.trach@huaweicloud.com> wrote:
>
> EXT4_MB_GRP_TEST_AND_SET_READ uses test_and_set_bit function which
> issues an atomic write. This can cause high overhead due to cache
> contention when multiple threads iterate over groups in a tight loop,
> as is the case for ext4_mb_prefetch(). We have seen this to be a
> problem for Kunpeng 920b CPUs which uses a single ARM LSE instruction
> for this purpose.
>
> Avoid this unconditional atomic write by testing the bit first without
> changing its value. This is OK for this use case as this bit is never
> unset.
>
> This change significantly reduces costs of fallocate() operations which
> trigger linear group scans on large multicore machines where
> test_and_set_bit issues an atomic write operation unconditionally.
>
> Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
Thanks for the patch. Definitely the benchmarks in the 0/2 email show
significant gains for the Kunpeng system, and reducing contention makes sense
as core counts increase and the likely case is that the bit is already set.
That said, I wonder if this should (also/instead) be put into test_and_set_bit()
itself, or add test_and_unlikely_set_bit() or test_and_rarely_set_bit()
(or similar) optimized for the case where the bit is likely to already be set.
I see in your benchmarking that there is not "apples-to-apples" comparisons for
ARM(Kunpeng) vs. AMD on the same storage. The storage hardware and space usage
is different for each test run, and the ARM numbers show only marginal gains and
more negative than positive results at all thread counts:
> Benchmark on an existing file system for AMD 9654 (15T FS, 6% space
> used), kernel 7.1-rc3. This shows the performance impact on a mostly
> free file system.
> | thr. | base | patched | improv. |
> | | perf | perf | |
> |------|-------|---------|------------|
> | 1 | 30901 | 31191 | +0.9384810 |
> | 2 | 50874 | 50504 | -0.7272870 |
> | 4 | 66068 | 64108 | -2.9666404 |
> | 8 | 63963 | 61927 | -3.1830902 |
> | 16 | 47809 | 47044 | -1.6001171 |
> | 32 | 42441 | 42326 | -0.2709644 |
> | 64 | 39773 | 39929 | +0.3922259 |
> | 128 | 37065 | 36413 | -1.7590719 |
The performance reduction might be caused by the now double memory access on
AMD that is only adding overhead on that CPU implementation? It would be useful
to see the testing on Kunpeng vs. AMD/Intel on the same storage device/usage.
That would tell us if it is more appropriate to optimize this in the aarch64
test_and_set_bit() rather than in ext4.
Cheers, Andreas
> ---
> fs/ext4/ext4.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 56b82d4a15d7..f8eacf1375f8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3551,7 +3551,13 @@ struct ext4_group_info {
> #define EXT4_MB_GRP_CLEAR_TRIMMED(grp) \
> (clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
> - (test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> + (ext4_mb_grp_test_and_set_read((grp)))
> +
> +static inline int ext4_mb_grp_test_and_set_read(struct ext4_group_info *grp)
> +{
> + return (test_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state) ||
> + test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state));
> +}
>
> #define EXT4_MAX_CONTENTION 8
> #define EXT4_CONTENTION_THRESHOLD 2
> --
> 2.43.0
>
Cheers, Andreas
^ permalink raw reply
* Re: [PATCH 04/17] nilfs2: replace get_zeroed_page() with kzalloc()
From: Ryusuke Konishi @ 2026-05-27 16:02 UTC (permalink / raw)
To: Mike Rapoport (Microsoft)
Cc: Viacheslav Dubeyko, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
linux-mm
In-Reply-To: <1bb537f6dc36b00788b613fb8f71579478418457.camel@redhat.com>
On Tue, May 26, 2026 at 2:07 AM Viacheslav Dubeyko wrote:
>
> On Sat, 2026-05-23 at 20:54 +0300, Mike Rapoport (Microsoft) wrote:
> > nilfs_ioctl_wrap_copy() allocates a temporary buffer with
> > get_zeroed_page().
> >
> > kzalloc() is a better API for such use and it also provides better
> > scalability and more debugging possibilities.
> >
> > Replace use of get_zeroed_page() with kzalloc().
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > fs/nilfs2/ioctl.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
> > index e0a606643e87..b73f2c5d10f0 100644
> > --- a/fs/nilfs2/ioctl.c
> > +++ b/fs/nilfs2/ioctl.c
> > @@ -69,7 +69,7 @@ static int nilfs_ioctl_wrap_copy(struct the_nilfs *nilfs,
> > if (argv->v_index > ~(__u64)0 - argv->v_nmembs)
> > return -EINVAL;
> >
> > - buf = (void *)get_zeroed_page(GFP_NOFS);
> > + buf = kzalloc(PAGE_SIZE, GFP_NOFS);
> > if (unlikely(!buf))
> > return -ENOMEM;
> > maxmembs = PAGE_SIZE / argv->v_size;
> > @@ -107,7 +107,7 @@ static int nilfs_ioctl_wrap_copy(struct the_nilfs *nilfs,
> > }
> > argv->v_nmembs = total;
> >
> > - free_pages((unsigned long)buf, 0);
> > + kfree(buf);
> > return ret;
> > }
> >
>
> Makes sense to me.
>
> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
>
> Thanks,
> Slava.
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
This conversion looks reasonable and won't affect the behavior of the
ioctls that use the modified function.
Thanks,
Ryusuke Konishi
^ permalink raw reply
* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 15:58 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-19-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> For append writes, wait for ordered I/O to complete before updating
> i_disksize. This ensures that zeroed data is flushed to disk before the
> metadata update, preventing stale data from being exposed during
> unaligned post-EOF append writes.
>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/ext4.h | 11 +++++++
> fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> fs/ext4/super.c | 23 ++++++++++----
> 4 files changed, 161 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 078feda47e36..9ce2128eea3e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> #ifdef CONFIG_FS_ENCRYPTION
> struct fscrypt_inode_info *i_crypt_info;
> #endif
> +
> + /*
> + * Track ordered zeroed data during post-EOF append writes, fallocate,
> + * and truncate-up operations. These parameters are used only in the
> + * iomap buffered I/O path.
> + */
> + ext4_lblk_t i_ordered_lblk;
> + ext4_lblk_t i_ordered_len;
> + wait_queue_head_t i_ordered_wq;
> };
>
> /*
> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> __u64 len, __u64 *moved_len);
>
> /* page-io.c */
> +#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
> +
> extern int __init ext4_init_pageio(void);
> extern void ext4_exit_pageio(void);
> extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e013aeb03d7b..11fb369efeb1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> {
> struct iomap_ioend *ioend = wpc->wb_ctx;
> struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> + ext4_lblk_t start, end, order_lblk, order_len;
>
> /*
> * After I/O completion, a worker needs to be scheduled when:
> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>
> + /*
> + * Mark the I/O as ordered. Ordered I/O requires separate endio
> + * handling and must not be merged with regular I/O operations.
> + */
> + order_len = READ_ONCE(ei->i_ordered_len);
> + if (order_len) {
> + /*
> + * Pair with smp_store_release() in ext4_block_zero_eof().
> + * Ensure we see the updated i_ordered_lblk that was written
> + * before the release store to i_ordered_len.
> + */
> + smp_rmb();
> + order_lblk = READ_ONCE(ei->i_ordered_lblk);
> + start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> + end = EXT4_B_TO_LBLK(ioend->io_inode,
> + ioend->io_offset + ioend->io_size);
> +
> + if (start <= order_lblk && end >= order_lblk + order_len) {
Hi Zhang,
I guess this check is enough cause ordered_lblk and ordered_len will
always be contained in a single block.
> + ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> + ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> + ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
FWIU, we are wanting the ordered IO to not be merged and submitted asap
since we want to wake up the waiters. Is there any other reason?
Adding the boundary in ->writeback_submit() only affects
iomap_ioend_can_merge() which happens after we have woken up the waiters
and deferred the IO to the wq. We ideally want it affect
iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
->writeback_range().
Secondly, I don't think boundary is the right flag here. It ensures
that everything before the ordered iomap gets submitted and the ordered
iomap starts a new ioend. This can still keep getting merged with the
newer ioends untils we decide to submit the IO, which can delay waking
up the waiters. If we really want the "no merge" behavior, we'll have to
do something like [1] (Check the 2 NOMERGE flag patches).
> + }
> + }
> +
> return iomap_ioend_writeback_submit(wpc, error);
> }
>
> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> loff_t from, loff_t end)
> {
> struct address_space *mapping = inode->i_mapping;
> + struct ext4_inode_info *ei = EXT4_I(inode);
> struct folio *folio;
> bool do_submit = false;
> + int ret;
>
> folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> if (IS_ERR(folio))
> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> folio_wait_writeback(folio);
> WARN_ON_ONCE(folio_test_writeback(folio));
>
> - if (likely(folio_test_dirty(folio)))
> + /*
> + * Mark the ordered range. It will be cleared upon I/O completion
> + * in ext4_iomap_end_bio(). Any operation that extends i_disksize
> + * (including append write end io past the zeroed boundary,
> + * truncate up and append fallocate) must wait for this I/O to
> + * complete before updating i_disksize.
> + *
> + * When multiple overlapping unaligned EOF writes are in flight, we
> + * only need to track and wait for the first one. Subsequent writes
> + * will zero the gap in memory and ensure that the zeroed data is
> + * written out along with the valid data in the same block before
> + * i_disksize is updated.
> + */
> + if (likely(folio_test_dirty(folio) &&
> + READ_ONCE(ei->i_ordered_len) == 0)) {
> + WRITE_ONCE(ei->i_ordered_lblk,
> + from >> inode->i_blkbits);
> + /*
> + * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
> + * and ext4_iomap_wb_ordered_wait(). Ensure the updated
> + * i_ordered_lblk is visible when i_ordered_len becomes
> + * non-zero.
> + */
> + smp_store_release(&ei->i_ordered_len, 1);
> do_submit = true;
> + }
> folio_unlock(folio);
> folio_put(folio);
>
> /* Submit zeroed block. */
> - if (do_submit)
> - return filemap_fdatawrite_range(mapping, from, end - 1);
> + if (do_submit) {
> + ret = filemap_fdatawrite_range(mapping, from, end - 1);
> + if (ret) {
> + /*
> + * Pairs with wait_event() in
> + * ext4_iomap_wb_ordered_wait(). Ensure
> + * i_ordered_len = 0 is visible before waking up
> + * waiters.
> + */
> + smp_store_release(&ei->i_ordered_len, 0);
> + wake_up_all(&ei->i_ordered_wq);
> + return ret;
> + }
> + }
> return 0;
> }
>
> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> * data=ordered mode. We submit zeroed range directly here.
> * Do not wait for I/O completion for performance.
> *
> - * TODO: Any operation that extends i_disksize (including
> - * append write end io past the zeroed boundary, truncate up,
> - * and append fallocate) must wait for the relevant I/O to
> - * complete before updating i_disksize.
> + * The end_io handler ext4_iomap_wb_ordered_wait() will wait
> + * for I/O completion before updating i_disksize if the write
> + * extends beyond the zeroed boundary.
> + *
> + * TODO: Any other operation that extends i_disksize
> + * (including truncate up and append fallocate) must wait for
> + * the relevant I/O to complete before updating i_disksize.
> */
> } else if (ext4_inode_buffered_iomap(inode)) {
> err = ext4_iomap_submit_zero_block(inode, from, end);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 3050c887329f..ad05ebb49bf6 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
> return 0;
> }
>
> +/*
> + * If the old disk size is not block size aligned and the current
> + * writeback range is entirely beyond the old EOF block, we should
> + * wait for the zeroed data written in ext4_block_zero_eof() to be
> + * written out, otherwise, it may expose stale data in that block.
> + */
> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
> + loff_t pos, loff_t end)
> +{
> + struct ext4_inode_info *ei = EXT4_I(inode);
> + unsigned int blocksize = i_blocksize(inode);
> + loff_t disksize = READ_ONCE(ei->i_disksize);
> + ext4_lblk_t order_lblk, order_len;
> +
> + /*
> + * Waiting for ordered I/O is unnecessary when:
> + * - The on-disk size is block-aligned (no stale data exists).
> + * - The write start is within the block of the old EOF
> + * (overwriting, or appending to a block that already contains
> + * valid data).
> + */
> + if (!(disksize & (blocksize - 1)) ||
> + pos < round_up(disksize, blocksize))
> + return;
> +
> + order_len = READ_ONCE(ei->i_ordered_len);
> + if (!order_len)
> + return;
> +
> + /*
> + * Pair with smp_store_release() in ext4_iomap_end_bio() and
> + * ext4_block_zero_eof(). Ensure we see the updated i_ordered_lblk
> + * that was written before the release store to i_ordered_len.
> + */
> + smp_rmb();
> + order_lblk = READ_ONCE(ei->i_ordered_lblk);
> + if ((pos >> inode->i_blkbits) >= order_lblk + order_len)
> + wait_event(ei->i_ordered_wq, READ_ONCE(ei->i_ordered_len) == 0);
> +}
> +
> static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
> loff_t end)
> {
> @@ -656,6 +696,9 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
> goto out;
> }
>
> + /* Wait ordered zero data to be written out. */
> + ext4_iomap_wb_ordered_wait(inode, pos, pos + size);
> +
> /* We may need to convert one extent and dirty the inode. */
> credits = ext4_chunk_trans_blocks(inode,
> EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
> @@ -717,8 +760,25 @@ void ext4_iomap_end_bio(struct bio *bio)
> struct inode *inode = ioend->io_inode;
> struct ext4_inode_info *ei = EXT4_I(inode);
> struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + unsigned long io_mode = (unsigned long)ioend->io_private;
> unsigned long flags;
>
> + /*
> + * This is an ordered I/O, clear the ordered range set in
> + * ext4_block_zero_eof() and wake up all waiters that will update
> + * the inode i_disksize.
> + */
> + if (io_mode == EXT4_IOMAP_IOEND_ORDER_IO) {
> + /*
> + * Pairs with wait_event() in ext4_iomap_wb_ordered_wait().
> + * Ensure i_ordered_len = 0 is visible before waking up
> + * waiters.
> + */
> + smp_store_release(&ei->i_ordered_len, 0);
> + wake_up_all(&ei->i_ordered_wq);
> + goto defer;
> + }
> +
> /* Needs to convert unwritten extents or update the i_disksize. */
> if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
> ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 62bfe05a64bc..9c0a00e716f3 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1444,6 +1444,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> ext4_fc_init_inode(&ei->vfs_inode);
> spin_lock_init(&ei->i_fc_lock);
> mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
> + ei->i_ordered_lblk = 0;
> + ei->i_ordered_len = 0;
> + init_waitqueue_head(&ei->i_ordered_wq);
> return &ei->vfs_inode;
> }
>
> @@ -1480,12 +1483,20 @@ static void ext4_destroy_inode(struct inode *inode)
> dump_stack();
> }
>
> - if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS) &&
> - WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> - ext4_msg(inode->i_sb, KERN_ERR,
> - "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> - inode->i_ino, EXT4_I(inode),
> - EXT4_I(inode)->i_reserved_data_blocks);
> + if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS)) {
> + if (WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> + ext4_msg(inode->i_sb, KERN_ERR,
> + "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> + inode->i_ino, EXT4_I(inode),
> + EXT4_I(inode)->i_reserved_data_blocks);
> +
> + if (WARN_ON_ONCE(EXT4_I(inode)->i_ordered_len))
> + ext4_msg(inode->i_sb, KERN_ERR,
> + "Inode %llu (%p): i_ordered_lblk (%u) and i_ordered_len (%u) not cleared!",
> + inode->i_ino, EXT4_I(inode),
> + EXT4_I(inode)->i_ordered_lblk,
> + EXT4_I(inode)->i_ordered_len);
> + }
> }
>
> static void ext4_shutdown(struct super_block *sb)
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 13:41 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-18-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> In the generic buffered_head I/O path, we rely on the data=order mode to
> ensure that the zeroed EOF block data is written before updating
> i_disksize, thus preventing stale data from being exposed.
>
> However, the iomap buffered I/O path cannot use this mechanism. Instead,
> we issue the I/O immediately after performing the zero operation
> (without synchronous waiting for performance). This can reduce the risk
> of exposing stale data, but it does not guarantee that the zero data
> will be flushed to disk before the metadata of i_disksize is updated.
> The subsequent patches will wait for this I/O to complete before
> updating i_disksize.
>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
I think we discussed that we may not need to do this [1] but I guess
you've decided to make the tradeoff of issuing the IO to avoid having to
wait for bg flush to complete the tail page zeroing
However, I think one side effect might be many threads calling the
writeback mechanism to issue zero IOs which might not scale well. I
don't know if it'll be a huge problem though, I guess it's a sort of
thing we will have to deal with in case we see it in real world
workloads.
[1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
> ---
> fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 55 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 239d387ffaf2..e013aeb03d7b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
> zero_written);
> }
>
> +static int ext4_iomap_submit_zero_block(struct inode *inode,
> + loff_t from, loff_t end)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + struct folio *folio;
> + bool do_submit = false;
> +
> + folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> + if (IS_ERR(folio))
> + /* Already writeback and clear? */
> + return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> +
> + folio_wait_writeback(folio);
> + WARN_ON_ONCE(folio_test_writeback(folio));
> +
> + if (likely(folio_test_dirty(folio)))
> + do_submit = true;
> + folio_unlock(folio);
> + folio_put(folio);
> +
> + /* Submit zeroed block. */
> + if (do_submit)
> + return filemap_fdatawrite_range(mapping, from, end - 1);
> + return 0;
> +}
> +
> /*
> * Zero out a mapping from file offset 'from' up to the end of the block
> * which corresponds to 'from' or to the given 'end' inside this block.
> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
> return 0;
>
> - if (length > blocksize - offset)
> + if (length > blocksize - offset) {
> length = blocksize - offset;
> + end = from + length;
> + }
>
> err = ext4_block_zero_range(inode, from, length,
> &did_zero, &zero_written);
> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> * TODO: In the iomap path, handle this by updating i_disksize to
> * i_size after the zeroed data has been written back.
> */
> - if (ext4_should_order_data(inode) &&
> - did_zero && zero_written && !IS_DAX(inode)) {
> - handle_t *handle;
> + if (did_zero && zero_written && !IS_DAX(inode)) {
> + if (ext4_should_order_data(inode)) {
> + handle_t *handle;
>
> - handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
>
> - err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> - ext4_journal_stop(handle);
> - if (err)
> - return err;
> + err = ext4_jbd2_inode_add_write(handle, inode, from,
> + length);
> + ext4_journal_stop(handle);
> + if (err)
> + return err;
> + /*
> + * inodes using the iomap buffered I/O path do not use the
> + * data=ordered mode. We submit zeroed range directly here.
> + * Do not wait for I/O completion for performance.
> + *
> + * TODO: Any operation that extends i_disksize (including
> + * append write end io past the zeroed boundary, truncate up,
> + * and append fallocate) must wait for the relevant I/O to
> + * complete before updating i_disksize.
> + */
> + } else if (ext4_inode_buffered_iomap(inode)) {
> + err = ext4_iomap_submit_zero_block(inode, from, end);
> + if (err)
> + return err;
> + }
> }
>
> return 0;
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 16/23] ext4: disable online defrag when inode using iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 13:14 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-17-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:36PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Online defragmentation does not currently support inodes using the
> iomap buffered I/O path. The existing implementation relies on
> buffer_head for sub-folio block management and data=ordered mode for
> data consistency, both of which are incompatible with the iomap path.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Regards,
Ojaswin
> ---
> fs/ext4/move_extent.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index 3329b7ad5dbd..f707a1096544 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -476,6 +476,17 @@ static int mext_check_validity(struct inode *orig_inode,
> return -EOPNOTSUPP;
> }
>
> + /*
> + * TODO: support online defrag for inodes that using the buffered
> + * I/O iomap path.
> + */
> + if (ext4_inode_buffered_iomap(orig_inode) ||
> + ext4_inode_buffered_iomap(donor_inode)) {
> + ext4_msg(sb, KERN_ERR,
> + "Online defrag not supported for inode with iomap buffered IO path");
> + return -EOPNOTSUPP;
> + }
> +
> if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
> ext4_debug("ext4 move extent: suid or sgid is set to donor file [ino:orig %llu, donor %llu]\n",
> orig_inode->i_ino, donor_inode->i_ino);
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 13:14 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-16-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:35PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Add tracepoints for iomap buffered read, write, partial block zeroing,
> and writeback operations to help debug the iomap buffered I/O path.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Regards,
Ojaswin
> ---
> fs/ext4/inode.c | 6 +++++
> include/trace/events/ext4.h | 45 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 51 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e0dae2501292..239d387ffaf2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3961,6 +3961,8 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> if (ret < 0)
> return ret;
>
> + trace_ext4_iomap_buffered_read_begin(inode, &map, offset, length,
> + flags);
> ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> return 0;
> }
> @@ -4034,6 +4036,8 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> if (ret < 0)
> return ret;
>
> + trace_ext4_iomap_buffered_write_begin(inode, &map, offset, length,
> + flags);
> ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> return 0;
> }
> @@ -4136,6 +4140,7 @@ static int ext4_iomap_zero_begin(struct inode *inode,
> map.m_len = (start >> blkbits) - map.m_lblk;
> }
>
> + trace_ext4_iomap_zero_begin(inode, &map, offset, length, flags);
> ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> iomap->flags |= iomap_flags;
>
> @@ -4308,6 +4313,7 @@ static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> return ret;
> }
> out:
> + trace_ext4_iomap_map_writeback_range(inode, &map, offset, dirty_len, 0);
> ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> return 0;
> }
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index f493642cf121..ebafa06cd191 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -3096,6 +3096,51 @@ TRACE_EVENT(ext4_move_extent_exit,
> __entry->ret)
> );
>
> +DECLARE_EVENT_CLASS(ext4_set_iomap_class,
> + TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
> + loff_t offset, loff_t length, unsigned int flags),
> + TP_ARGS(inode, map, offset, length, flags),
> + TP_STRUCT__entry(
> + __field(dev_t, dev)
> + __field(u64, ino)
> + __field(ext4_lblk_t, m_lblk)
> + __field(unsigned int, m_len)
> + __field(unsigned int, m_flags)
> + __field(u64, m_seq)
> + __field(loff_t, offset)
> + __field(loff_t, length)
> + __field(unsigned int, iomap_flags)
> + ),
> + TP_fast_assign(
> + __entry->dev = inode->i_sb->s_dev;
> + __entry->ino = inode->i_ino;
> + __entry->m_lblk = map->m_lblk;
> + __entry->m_len = map->m_len;
> + __entry->m_flags = map->m_flags;
> + __entry->m_seq = map->m_seq;
> + __entry->offset = offset;
> + __entry->length = length;
> + __entry->iomap_flags = flags;
> +
> + ),
> + TP_printk("dev %d:%d ino %llu m_lblk %u m_len %u m_flags %s m_seq %llu orig_off 0x%llx orig_len 0x%llx iomap_flags 0x%x",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __entry->ino, __entry->m_lblk, __entry->m_len,
> + show_mflags(__entry->m_flags), __entry->m_seq,
> + __entry->offset, __entry->length, __entry->iomap_flags)
> +)
> +
> +#define DEFINE_SET_IOMAP_EVENT(name) \
> +DEFINE_EVENT(ext4_set_iomap_class, name, \
> + TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, \
> + loff_t offset, loff_t length, unsigned int flags), \
> + TP_ARGS(inode, map, offset, length, flags))
> +
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_read_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_write_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_map_writeback_range);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_zero_begin);
> +
> #endif /* _TRACE_EXT4_H */
>
> /* This part must be outside protection */
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Ojaswin Mujoo @ 2026-05-27 13:13 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <20260511072344.191271-15-yi.zhang@huaweicloud.com>
On Mon, May 11, 2026 at 03:23:34PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> infrastructure for ext4.
>
> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> out either a mapped partial block or a dirty, unwritten partial block.
>
> Important constraints:
>
> Zeroing out under an active journal handle can cause deadlock, because
> the order of acquiring the folio lock and starting a handle is
> inconsistent with the iomap writeback path.
>
> Therefore, ext4_iomap_block_zero_range():
> - Must NOT be called under an active handle.
> - Cannot rely on data=ordered mode to ensure zeroed data persistence
> before updating i_disksize (for the cases of post-EOF append write,
> post-EOF fallocate, and truncate up). In subsequent patches, we will
> address this by synchronizing commit I/O but doesn't waiting for
> completion, and updating i_disksize to i_size only after the zeroed
> data has been written back.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Looks good in itself. Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Regards,
Ojaswin
> ---
> fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 92 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c6fe42d012fc..e0dae2501292 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> return 0;
> }
>
> +static int ext4_iomap_zero_begin(struct inode *inode,
> + loff_t offset, loff_t length, unsigned int flags,
> + struct iomap *iomap, struct iomap *srcmap)
> +{
> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
> + unsigned int iomap_flags = 0;
> + int ret;
> +
> + ret = ext4_emergency_state(inode->i_sb);
> + if (unlikely(ret))
> + return ret;
> +
> + if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> + return -EINVAL;
> +
> + ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> + if (ret < 0)
> + return ret;
> +
> + /*
> + * Look up dirty folios for unwritten mappings within EOF. Providing
> + * this bypasses the flush iomap uses to trigger extent conversion
> + * when unwritten mappings have dirty pagecache in need of zeroing.
> + */
> + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> + loff_t start = ((loff_t)map.m_lblk) << blkbits;
> + loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> +
> + iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> + if ((start >> blkbits) < map.m_lblk + map.m_len)
> + map.m_len = (start >> blkbits) - map.m_lblk;
> + }
> +
> + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> + iomap->flags |= iomap_flags;
> +
> + return 0;
> +}
> +
> +static const struct iomap_ops ext4_iomap_zero_ops = {
> + .iomap_begin = ext4_iomap_zero_begin,
> +};
> +
> /*
> * Since we always allocate unwritten extents, there is no need for
> * iomap_end to clean up allocated blocks on a short write.
> @@ -4616,6 +4661,47 @@ static int ext4_block_journalled_zero_range(struct inode *inode, loff_t from,
> return err;
> }
>
> +static int ext4_block_iomap_zero_range(struct inode *inode, loff_t from,
> + loff_t length, bool *did_zero,
> + bool *zero_written)
> +{
> + int ret;
> +
> + /*
> + * Zeroing out under an active handle can cause deadlock since
> + * the order of acquiring the folio lock and starting a handle is
> + * inconsistent with the iomap writeback procedure.
> + */
> + if (WARN_ON_ONCE(ext4_handle_valid(journal_current_handle())))
> + return -EINVAL;
> +
> + /* The zeroing scope should not extend across a block. */
> + if (WARN_ON_ONCE((from >> inode->i_blkbits) !=
> + ((from + length - 1) >> inode->i_blkbits)))
> + return -EINVAL;
> +
> + if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS) &&
> + !(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
> + WARN_ON_ONCE(!inode_is_locked(inode) &&
> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> +
> + ret = iomap_zero_range(inode, from, length, did_zero,
> + &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> + NULL);
> + if (ret)
> + return ret;
> +
> + /*
> + * TODO: The iomap does not distinguish between different types of
> + * zeroing and always sets zero_written if a zeroing operation is
> + * performed, which may result in unnecessary order operations.
> + */
> + if (did_zero && zero_written)
> + *zero_written = *did_zero;
> +
> + return 0;
> +}
> +
> /*
> * Zeros out a mapping of length 'length' starting from file offset
> * 'from'. The range to be zero'd must be contained with in one block.
> @@ -4642,6 +4728,9 @@ static int ext4_block_zero_range(struct inode *inode,
> } else if (ext4_should_journal_data(inode)) {
> return ext4_block_journalled_zero_range(inode, from, length,
> did_zero);
> + } else if (ext4_inode_buffered_iomap(inode)) {
> + return ext4_block_iomap_zero_range(inode, from, length,
> + did_zero, zero_written);
> }
> return ext4_block_do_zero_range(inode, from, length, did_zero,
> zero_written);
> @@ -4682,6 +4771,9 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> * truncating up or performing an append write, because there might be
> * exposing stale on-disk data which may caused by concurrent post-EOF
> * mmap write during folio writeback.
> + *
> + * TODO: In the iomap path, handle this by updating i_disksize to
> + * i_size after the zeroed data has been written back.
> */
> if (ext4_should_order_data(inode) &&
> did_zero && zero_written && !IS_DAX(inode)) {
> --
> 2.52.0
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox