Linux EXT4 FS development
 help / color / mirror / Atom feed
* [PATCH v2 3/8] ext4: skip overwrite check for aligned non-extending DIO writes
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.

After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.

Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:

 - truncate/punch_hole are kept out: they require exclusive i_rwsem
   (blocked by our shared lock during allocation), and inode_dio_begin()
   keeps their inode_dio_wait() blocked until in-flight bios complete.
 - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
   extent tree modifications (parallel writers to the same hole).
 - The journal handle is per-thread and does not require i_rwsem
   exclusion.
 - i_disksize and orphan list are not involved in non-extending writes.

Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      42,322       43,329      1.02x
     2      68,516       70,677      1.03x
     4      62,489       97,072      1.55x
     8      58,701      110,819      1.89x
    16      58,569      116,392      1.99x
    32      60,860      117,244      1.93x

Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 9f9bc0b13772..886b73247aab 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -434,16 +434,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  * condition requires an exclusive inode lock. If yes, then we restart the
  * whole operation by releasing the shared lock and acquiring exclusive lock.
  *
- * - For unaligned_io we never take shared lock as it may cause data corruption
- *   when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
  *
- * - For extending writes case we don't take the shared lock, since it requires
- *   updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ *    to the exclusive lock -- the security update itself requires it,
+ *    regardless of whether the write extends the file or is aligned.
  *
- * - shared locking will only be true mostly with overwrites, including
- *   initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ *    lock to safely update i_disksize and the orphan list, regardless of
+ *    alignment.
  *
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ *    sufficient regardless of extent state (written, unwritten, or hole).
+ *    truncate/punch_hole cannot run while we hold the shared i_rwsem
+ *    (they need it exclusively); after we release it, inode_dio_begin()
+ *    keeps their inode_dio_wait() blocked until in-flight bios complete.
+ *    i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ *    only safe for pure written-extent overwrites. Unwritten extents or
+ *    holes require exclusive lock because concurrent partial block zeroing
+ *    in the DIO layer could corrupt data.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -454,7 +465,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite, unaligned_io, unwritten;
+	bool overwrite = true, unaligned_io, unwritten = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -466,22 +477,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 
 	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
-	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
 
 	/*
-	 * Determine whether we need to upgrade to an exclusive lock. This is
-	 * required to change security info in file_modified(), for extending
-	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
-	 * extents (as partial block zeroing may be required).
-	 *
-	 * Note that unaligned writes are allowed under shared lock so long as
-	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
-	 * data corruption due to partial block zeroing in the dio layer, and so
-	 * the I/O must occur exclusively.
+	 * For unaligned writes we need to know the extent state to determine
+	 * whether shared lock is safe. For aligned writes we skip this check
+	 * entirely since allocation under shared lock is safe.
 	 */
+	if (unaligned_io)
+		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
-	     (unaligned_io && unwritten)))) {
+	    ((!IS_NOSEC(inode) || *extend ||
+	     (unaligned_io && (!overwrite || unwritten))))) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -496,8 +504,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
 	 * behavior already. The inode lock is already held exclusive if the
-	 * write is non-overwrite or extending, so drain all outstanding dio and
-	 * set the force wait dio flag.
+	 * write is unaligned non-overwrite or extending, so drain all
+	 * outstanding dio and set the force wait dio flag.
 	 */
 	if (!*ilock_shared && (unaligned_io || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 2/8] ext4: drain in-flight DIO before buffered write fallback
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

generic/746 started failing intermittently on ext3 (no-extent inodes).
The test triggers 'Page cache invalidation failure on direct I/O'
warnings and subsequent fsync returns -EIO. Adding a 50ms delay
between ext4_buffered_write_iter() and filemap_write_and_wait_range()
in ext4_dio_write_iter() makes the race almost always reproducible.

On no-extent inodes, DIO writes to holes cannot use unwritten extents,
so ext4_iomap_alloc() leaves m_flags=0 and ext4_map_blocks() returns 0.
The iomap layer then returns -ENOTBLK, causing fallback to buffered I/O.

The fallback path in ext4_dio_write_iter() calls
ext4_buffered_write_iter() which dirties pages, then does flush and
invalidate. However, there's an unprotected window between
ext4_buffered_write_iter() returning (with inode lock released) and
the subsequent flush+invalidate.

Concurrent async DIO completions from other threads can run
kiocb_invalidate_post_direct_write() during this window. If pages have
been re-dirtied, post-invalidation finds dirty pages and triggers the
warning, setting -EIO in the error sequence.

Consider a file with two 4k extents: [hole][written]. Thread A does
DIO to the written extent, while thread B does DIO spanning both:

  kworker A (4k DIO, allocated block)    kworker B (8k DIO, fallback)
  -----------------------------------    ----------------------------
  inode_lock_shared()                    inode_lock_shared()
  iomap_dio_rw():                        iomap_dio_rw():
    kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
    submit_bio (async)                     dio->size = 0
  inode_unlock_shared()                  inode_unlock_shared()

  [bio pending in block layer]           /* fallback: lock released */
                                         ext4_buffered_write_iter()
                                           inode_lock(exclusive)
                                           generic_perform_write()
                                             -> dirty pages [0, 8k]
                                           inode_unlock(exclusive)

                                         /* pages dirty, no lock */
  [bio completes]                        filemap_write_and_wait_range()
  iomap_dio_complete()                     -> flush dirty pages
    kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
      invalidate_inode_pages2_range()
      -> finds dirty page!
      -> dio_warn_stale_pagecache()
      -> errseq_set(-EIO)

This issue can be triggered through normal I/O paths, not just
intentionally overlapping DIO writes from userspace. For example,
generic/746 uses a loop device where multiple kworkers issue concurrent
I/O to the backing file. Additionally, when block_size < folio_size,
non-overlapping DIO writes that share a large folio can also trigger
the race.

Add inode_dio_wait() in ext4_buffered_write_iter() before
generic_perform_write() to drain all in-flight DIO. This ensures
that all DIO clears existing pages before submitting IO (via
kiocb_invalidate_pages()), and all BIO waits for all DIO to
complete (via inode_dio_wait()), thus eliminating the race.

Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/d1adcf7c-c276-458d-9cac-68a4410f7626@gmail.com
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..9f9bc0b13772 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -313,6 +313,12 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	if (ret <= 0)
 		goto out;
 
+	/*
+	 * Prevent concurrent DIO and BIO to the same file range.
+	 * Wait for all in-flight DIO to complete before dirtying pages.
+	 */
+	inode_dio_wait(inode);
+
 	ret = generic_perform_write(iocb, from);
 
 out:
-- 
2.43.7


^ permalink raw reply related

* Re: [syzbot] [overlayfs?] possible deadlock in ovl_copy_up_start (5)
From: Amir Goldstein @ 2026-06-18 10:47 UTC (permalink / raw)
  To: syzbot; +Cc: linux-kernel, linux-unionfs, miklos, syzkaller-bugs, Ext4
In-Reply-To: <6a330d0e.6d5abbec.a50f.0020.GAE@google.com>

On Wed, Jun 17, 2026 at 11:09 PM syzbot
<syzbot+d4e3aadc5bd19f7e71ff@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:    596d152bc5e3 Merge branch 'for-next/core' into for-kernelci
> git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
> console output: https://syzkaller.appspot.com/x/log.txt?x=10aba8ae580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=eb40ba923a822433
> dashboard link: https://syzkaller.appspot.com/bug?extid=d4e3aadc5bd19f7e71ff
> compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> userspace arch: arm64
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/32d15acc8c01/disk-596d152b.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/83b3d8d84761/vmlinux-596d152b.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/8edfcc3bf911/Image-596d152b.gz.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+d4e3aadc5bd19f7e71ff@syzkaller.appspotmail.com
>
> ======================================================
> WARNING: possible circular locking dependency detected
> syzkaller #0 Tainted: G             L
> ------------------------------------------------------
> syz.3.611/8517 is trying to acquire lock:
> ffff0000eee140f8 (&ovl_i_lock_key[depth]){+.+.}-{4:4}, at: ovl_inode_lock_interruptible fs/overlayfs/overlayfs.h:705 [inline]
> ffff0000eee140f8 (&ovl_i_lock_key[depth]){+.+.}-{4:4}, at: ovl_copy_up_start+0x58/0x264 fs/overlayfs/util.c:735
>
> but task is already holding lock:
> ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
> ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: lock_two_nondirectories+0xe8/0x148 fs/inode.c:1256
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}:
>        __lock_release kernel/locking/lockdep.c:5574 [inline]
>        lock_release+0x178/0x3b0 kernel/locking/lockdep.c:5889
>        up_write+0x3c/0x5d8 kernel/locking/rwsem.c:1681
>        inode_unlock include/linux/fs.h:1039 [inline]
>        unlock_two_nondirectories+0x60/0x118 fs/inode.c:1269
>        ext4_move_extents+0x468/0x3580 fs/ext4/move_extent.c:656
>        __ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
>        ext4_ioctl+0x2a14/0x4234 fs/ext4/ioctl.c:1922
>        vfs_ioctl fs/ioctl.c:51 [inline]
>        __do_sys_ioctl fs/ioctl.c:597 [inline]
>        __se_sys_ioctl fs/ioctl.c:583 [inline]
>        __arm64_sys_ioctl+0x14c/0x1c4 fs/ioctl.c:583
>        __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>        invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>        el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>        el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>        el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>        el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
> -> #1 (sb_writers#3){.+.+}-{0:0}:
>        percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
>        percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
>        __sb_start_write include/linux/fs/super.h:19 [inline]
>        sb_start_write include/linux/fs/super.h:125 [inline]
>        ovl_start_write+0xf0/0x324 fs/overlayfs/util.c:32
>        ovl_do_copy_up fs/overlayfs/copy_up.c:977 [inline]
>        ovl_copy_up_one fs/overlayfs/copy_up.c:1189 [inline]
>        ovl_copy_up_flags+0x980/0x28a4 fs/overlayfs/copy_up.c:1243
>        ovl_maybe_copy_up+0x108/0x148 fs/overlayfs/copy_up.c:1272
>        ovl_open+0x12c/0x2c0 fs/overlayfs/file.c:211
>        do_dentry_open+0x5c4/0xfc8 fs/open.c:947
>        vfs_open+0x44/0x2d4 fs/open.c:1079
>        do_open fs/namei.c:4699 [inline]
>        path_openat+0x2234/0x2a6c fs/namei.c:4858
>        do_file_open+0x1c4/0x2e4 fs/namei.c:4887
>        do_sys_openat2+0x114/0x1e8 fs/open.c:1364
>        do_sys_open+0xac/0xdc fs/open.c:1370
>        __do_sys_openat fs/open.c:1386 [inline]
>        __se_sys_openat fs/open.c:1381 [inline]
>        __arm64_sys_openat+0x9c/0xb8 fs/open.c:1381
>        __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>        invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>        el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>        el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>        el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>        el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
> -> #0 (&ovl_i_lock_key[depth]){+.+.}-{4:4}:
>        check_prev_add kernel/locking/lockdep.c:3165 [inline]
>        check_prevs_add kernel/locking/lockdep.c:3284 [inline]
>        validate_chain kernel/locking/lockdep.c:3908 [inline]
>        __lock_acquire+0x1780/0x2f44 kernel/locking/lockdep.c:5237
>        lock_acquire+0x140/0x368 kernel/locking/lockdep.c:5868
>        __mutex_lock_common kernel/locking/mutex.c:646 [inline]
>        __mutex_lock+0x160/0xef8 kernel/locking/mutex.c:820
>        mutex_lock_interruptible_nested+0x24/0x30 kernel/locking/mutex.c:898
>        ovl_inode_lock_interruptible fs/overlayfs/overlayfs.h:705 [inline]
>        ovl_copy_up_start+0x58/0x264 fs/overlayfs/util.c:735
>        ovl_copy_up_one fs/overlayfs/copy_up.c:1182 [inline]
>        ovl_copy_up_flags+0x768/0x28a4 fs/overlayfs/copy_up.c:1243
>        ovl_copy_up+0x24/0x34 fs/overlayfs/copy_up.c:1282
>        ovl_rename_start fs/overlayfs/dir.c:1176 [inline]
>        ovl_rename+0x2d8/0xfec fs/overlayfs/dir.c:1363
>        vfs_rename+0xa78/0xd48 fs/namei.c:6064
>        filename_renameat2+0x66c/0x730 fs/namei.c:6182
>        __do_sys_renameat2 fs/namei.c:6211 [inline]
>        __se_sys_renameat2 fs/namei.c:6206 [inline]
>        __arm64_sys_renameat2+0xe4/0x114 fs/namei.c:6206
>        __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>        invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>        el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>        el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>        el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>        el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
> other info that might help us debug this:
>
> Chain exists of:
>   &ovl_i_lock_key[depth] --> sb_writers#3 --> &ovl_i_mutex_key[depth]/4
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&ovl_i_mutex_key[depth]/4);
>                                lock(sb_writers#3);
>                                lock(&ovl_i_mutex_key[depth]/4);
>   lock(&ovl_i_lock_key[depth]);
>
>  *** DEADLOCK ***
>
> 5 locks held by syz.3.611/8517:
>  #0: ffff00010bddc410 (sb_writers#17){.+.+}-{0:0}, at: mnt_want_write+0x44/0x9c fs/namespace.c:493
>  #1: ffff00010bddc718 (&type->s_vfs_rename_key#3){+.+.}-{4:4}, at: lock_rename fs/namei.c:3791 [inline]
>  #1: ffff00010bddc718 (&type->s_vfs_rename_key#3){+.+.}-{4:4}, at: __start_renaming+0xec/0x33c fs/namei.c:3880
>  #2: ffff0000eee14300 (&ovl_i_mutex_dir_key[depth]/1){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
>  #2: ffff0000eee14300 (&ovl_i_mutex_dir_key[depth]/1){+.+.}-{4:4}, at: lock_two_directories+0x19c/0x214 fs/namei.c:3767
>  #3: ffff0000eee13810 (&ovl_i_mutex_dir_key[depth]/5){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
>  #3: ffff0000eee13810 (&ovl_i_mutex_dir_key[depth]/5){+.+.}-{4:4}, at: lock_two_directories+0x1c4/0x214 fs/namei.c:3768
>  #4: ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
>  #4: ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: lock_two_nondirectories+0xe8/0x148 fs/inode.c:1256
>
> stack backtrace:
> CPU: 1 UID: 0 PID: 8517 Comm: syz.3.611 Tainted: G             L      syzkaller #0 PREEMPT
> Tainted: [L]=SOFTLOCKUP
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026
> Call trace:
>  show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:499 (C)
>  __dump_stack+0x30/0x40 lib/dump_stack.c:94
>  dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120
>  dump_stack+0x1c/0x28 lib/dump_stack.c:129
>  print_circular_bug+0x328/0x330 kernel/locking/lockdep.c:2043
>  check_noncircular+0x158/0x174 kernel/locking/lockdep.c:2175
>  check_prev_add kernel/locking/lockdep.c:3165 [inline]
>  check_prevs_add kernel/locking/lockdep.c:3284 [inline]
>  validate_chain kernel/locking/lockdep.c:3908 [inline]
>  __lock_acquire+0x1780/0x2f44 kernel/locking/lockdep.c:5237
>  lock_acquire+0x140/0x368 kernel/locking/lockdep.c:5868
>  __mutex_lock_common kernel/locking/mutex.c:646 [inline]
>  __mutex_lock+0x160/0xef8 kernel/locking/mutex.c:820
>  mutex_lock_interruptible_nested+0x24/0x30 kernel/locking/mutex.c:898
>  ovl_inode_lock_interruptible fs/overlayfs/overlayfs.h:705 [inline]
>  ovl_copy_up_start+0x58/0x264 fs/overlayfs/util.c:735
>  ovl_copy_up_one fs/overlayfs/copy_up.c:1182 [inline]
>  ovl_copy_up_flags+0x768/0x28a4 fs/overlayfs/copy_up.c:1243
>  ovl_copy_up+0x24/0x34 fs/overlayfs/copy_up.c:1282
>  ovl_rename_start fs/overlayfs/dir.c:1176 [inline]
>  ovl_rename+0x2d8/0xfec fs/overlayfs/dir.c:1363
>  vfs_rename+0xa78/0xd48 fs/namei.c:6064
>  filename_renameat2+0x66c/0x730 fs/namei.c:6182
>  __do_sys_renameat2 fs/namei.c:6211 [inline]
>  __se_sys_renameat2 fs/namei.c:6206 [inline]
>  __arm64_sys_renameat2+0xe4/0x114 fs/namei.c:6206
>  __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>  invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>  el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>  do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>  el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>  el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>  el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
>
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
>
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report

#syz dup: possible deadlock in lock_two_nondirectories (2)

ext4 bug should be fixed in linux-next I think.

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Baokun Li @ 2026-06-18  9:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, linux-ext4, tytso, adilger.kernel, yi.zhang, ojaswin,
	ritesh.list, peng_wang
In-Reply-To: <wbgohsiks4355iejpa2xhvjdgyd4hfpg2gjcpgrv2wcbnevwao@mhik7uhkzzhx>

On 2026/6/17 19:08, Jan Kara wrote:
> On Wed 17-06-26 15:52:24, Baokun Li wrote:
>> On 2026/6/17 10:45, Zhang Yi wrote:
>>> On 6/16/2026 9:10 PM, Baokun Li wrote:
>>>> Thank you for your review!
>>>>
>>>> After extensive testing, I found that after merging this patch,
>>>> generic/746
>>>> started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The
>>>> test
>>>> triggers a "Page cache invalidation failure on direct I/O" warning, and
>>>> subsequent fsync returns -EIO.
>>>>
>>>> The underlying race existed before this patch, but this patch appears to
>>>> have widened the reproduction window considerably, so I thought it worth
>>>> trying to address.  Here is my analysis:
>>>>
>>>> On no-extent inodes, DIO writes that hit holes cannot use unwritten
>>>> extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
>>>> returns 0 for a hole, and:
>>>>
>>>>          if (!m_flags && !ret)
>>>>                  ret = -ENOTBLK;
>>>>
>>>> The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
>>>> I/O.  The fallback path dirties pages in the page cache, then flushes
>>>> and invalidates them.  However, concurrent async DIO completions to
>>>> other blocks on the same inode can run
>>>> kiocb_invalidate_post_direct_write()
>>>> without holding the inode lock.
>>>>
>>>> Consider a file with two 4k extents: [hole][written].  Thread A does DIO
>>>> to the written extent, while thread B does DIO spanning both extents:
>>>>
>>>>    kworker A (4k DIO, allocated block)    kworker B (8k DIO,
>>>> hole->fallback)
>>>>    -----------------------------------   
>>>> -----------------------------------
>>>>    inode_lock_shared()                    inode_lock_shared()
>>>>    iomap_dio_rw():                        iomap_dio_rw():
>>>>      kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
>>>>      submit_bio (async)                     dio->size = 0
>>>>    inode_unlock_shared()                  inode_unlock_shared()
>>>>
>>>>    [bio pending in block layer]           /* fallback: inode lock
>>>> released */
>>>>                                           ext4_buffered_write_iter()
>>>>                                             inode_lock(exclusive)
>>>>                                             generic_perform_write()
>>>>                                               -> dirty pages [0, 8k]
>>>>                                             inode_unlock(exclusive)
>>>>
>>>>                                           /* pages still dirty here */
>>>>    [bio completes]                        filemap_write_and_wait_range()
>>>>    iomap_dio_complete()                     -> flush dirty pages
>>>>      kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
>>>>        invalidate_inode_pages2_range()
>>>>        -> finds dirty page!               /* window closed */
>>>>        -> dio_warn_stale_pagecache()
>>>>        -> errseq_set(-EIO)
>>>>
>>> It looks like this issue occurs when invalidate_inode_pages2_range()
>>> checks beyond the DIO write range, which may only happen when folio size
>>> is larger than block size. Is that correct?
>> Thanks for looking at this!
>>
>> Not quite — the scenario involves an 8k file with layout
>>
>>  [hole at 0-4k] [written extent at 4k-8k]
>>
>> and two DIO threads. Thread A does a 4k DIO write at offset 4k; since
>> the target block is a written extent, no fallback occurs. Thread B
>> does an 8k DIO write at offset 0; since blocks 0-4k are a hole on an
>> indirect-block inode and ext3 does not support unwritten extents,
>> iomap returns -ENOTBLK and the entire 8k write falls back to buffered
>> I/O.
> Right, but for this to happen userspace had to submit two overlapping
> direct IO writes. This always had undefined behavior so some inconsistent
> content in the file is more or less acceptable. But as Zhang pointed out,
> the same failure can also appear when block_size < folio_size and there we
> should really strive to provide consistent data.
Agreed, overlapping DIO is UB from userspace's perspective. However,
neither the loop device usage in generic/746 nor the BS < PS scenario
that Yi mentioned involves userspace intentionally submitting
overlapping DIO writes. Both are triggered through normal I/O paths,
so the issue still needs to be addressed.
>
>>>> The critical window is the gap between ext4_buffered_write_iter()
>>>> dirtying
>>>> pages and filemap_write_and_wait_range() flushing them.  In this
>>>> window the
>>>> inode lock is not held, so another thread's async DIO completion is
>>>> free to
>>>> invalidate the still-dirty pages in the page cache.
>>>>
>>>> This race has always existed on ext3 because indirect-block inodes lack
>>>> unwritten-extent support.  However, the window was extremely narrow in
>>>> practice, because the old ext4_overwrite_io() checked every block and
>>>> would conservatively take an exclusive lock.  This patch replaced it
>>>> with ext4_dio_needs_zeroing(), which only checks head and tail blocks,
>>>> making unaligned DIO more likely to take a shared lock and
>>>> proportionally increasing the chance of hitting the race.
>>>>
>>>> I tried a couple of alternatives before settling on the patch below:
>>>>
>>>> 1. Force exclusive lock + IOMAP_DIO_FORCE_WAIT for all no-extent DIO.
>>>>     This closes the window for new DIO submissions, but does not protect
>>>>     against bio completions from previously submitted async DIO, which
>>>>     run independently of the inode lock.
>>>>
>>>> 2. Wrap the fallback dirty+flush+invalidate sequence in
>>>>     filemap_invalidate_lock().  However, the ext4 DIO and iomap layers
>>>>     do not use this lock, so it would not serialise against DIO
>>>>     completions.
>>>>
>>> Could we add a call to inode_dio_wait() before falling back to buffered
>>> I/O? That is, in thread B, when falling back to buffered I/O, could we
>>> acquire the exclusive inode lock and then call inode_dio_wait() to wait
>>> for in-flight DIO to complete? This should close the race window. Since
>>> scenarios where DIO writes to holes on ext3 are relatively rare, the
>>> performance impact should be minimal (I suppose).
>>>
>> That's a great idea, thank you!
>>
>> I had been trying to fix this on the DIO side and didn't consider
>> waiting from the buffered fallback path.
>>
>> I've tested the approach locally and it closes the race; I'll add a
>> patch using it in the next version.
> Yes, this looks like the best solution so far. The fallback doesn't have to
> be fast. It was always - you are doing something stupid and we try to fixup
> for you - kind of thing and bad performance is acceptable in that case.
Agreed.
>>>> One straightforward approach that seems correct is to skip direct I/O
>>>> for no-extent inodes entirely, by returning 0 from ext4_dio_alignment():
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -6131,6 +6131,8 @@ u32 ext4_dio_alignment(struct inode *inode)
>>>>   {
>>>>          if (fsverity_active(inode))
>>>>                  return 0;
>>>> +       if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>>> +               return 0;
>>>>          if (ext4_should_journal_data(inode))
>>>>                  return 0;
>>>>          if (ext4_has_inline_data(inode))
>>>>
>>>> With this, ext4_should_use_dio() returns false for no-extent inodes, and
>>>> all I/O goes through ext4_buffered_write_iter() directly, bypassing the
>>>> DIO path entirely.  On ext3, DIO to a hole already falls back to
>>>> buffered
>>>> I/O, so there is essentially no performance benefit to using DIO in the
>>>> first place.
>>>>
>>>> Note that with this change, the fallback branch in
>>>> ext4_dio_write_iter():
>>>>
>>>>          if (ret >= 0 && iov_iter_count(from)) {
>>>>                  /* buffered fallback */
>>>>          }
>>>>
>>>> would also become dead code for extent-based inodes (since unwritten
>>>> extents guarantee iomap_dio_rw() never returns zero with unconsumed
>>>> data), and could be removed in a follow-up cleanup.
>>>>
>>>> Thoughts?  Is there a reason to preserve DIO on no-extent inodes that
>>>> I'm missing?
>>>>
>>> Hmm, this would also cause DIO to fall back to buffered I/O in common
>>> extending write cases, which I think would be unacceptable.
>> Fair point, the regression on extending writes is hard to justify.  That
>> said, until we had a better fix, I'd argue a behavioural change was
>> still preferable to potential data corruption. With the inode_dio_wait()
>> approach above, this trade-off goes away. 
> But heavily regressing performance for overwrites or extending DIO writes
> even on indirect block based files is not really acceptable. There are
> still users who for whatever reasons stay with old filesystems having
> indirect block based files and they'd likely notice the regression.
>
> 								Honza

Agreed. Significantly degrading performance for common workloads
(overwrites, extending writes) to handle a rare race condition is
unacceptable for production systems. The current fix with
inode_dio_wait() only impacts the fallback path (DIO → buffered I/O),
which is already an exceptional and slow path, so the performance
impact there is acceptable.


Thanks,
Baokun


^ permalink raw reply

* Re: [PATCH v4] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Jan Kara @ 2026-06-18  7:21 UTC (permalink / raw)
  To: Zhou, Yun
  Cc: Jan Kara, tytso, adilger.kernel, libaokun, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <ee05e46d-6ec8-4291-a61e-213812934da3@windriver.com>

On Thu 18-06-26 09:02:10, Zhou, Yun wrote:
> 
> 
> On 6/18/26 04:25, Jan Kara wrote:
> > On Mon 15-06-26 13:35:12, Yun Zhou wrote:
> > 
> > --- a/fs/ext4/xattr.c
> > +++ b/fs/ext4/xattr.c
> > @@ -464,6 +464,33 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
> >                inode_unlock(inode);
> >        }
> > 
> > +     /*
> > +      * Since this function resolves references from active xattr entries,
> > +      * the EA inode must be in active state (i_nlink=1, ref_count>0).
> > +      * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
> > +      * (inconsistent with an active entry) all indicate corruption or
> > +      * a concurrent last-reference drop.
> > +      */
> > +     if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
> > +             ext4_error(parent->i_sb,
> > +                        "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
> > +                        ea_ino, inode->i_nlink,
> > +                        ext4_xattr_inode_get_ref(inode));
> > Hum, given motivation of this is syzbot corrupted fs image, I'd just put
> > check in ext4_iget() verifying ext4_xattr_inode_get_ref() is > 0 and be
> > done with it. Much simpler and catches at least the obvious cases. The
> > consistency of xattr refcount and i_nlink is otherwise guarded by
> > ext4_xattr_inode_update_ref() and it can never be perfect as in the kernel
> > we don't have the full view of the filesystem and so cannot ascertain that
> > xattr ref count matches reality...
> > 
> Thanks very much for your review and suggestions, The issue should be easily
> fixed after "introduce ext4_put_ea_inode() for safe deferred iput" merged.
> Just need to defer iput() and no longer to make inode bad.

OK, just to make it clear, calling make_bad_inode() is never a right thing
to do once the inode has been exposed to other users. It is only usable
within iget functions when loading inode into memory and finding out the
content isn't valid. This function is one of the traps we have in VFS...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
From: Zhang Yi @ 2026-06-18  1:43 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai
In-Reply-To: <xt3zkmgktl7wpbwt4de76wh4q576fkmytw6udeojmj4goi6ul6@np4zgi2zinkg>

On 6/18/2026 4:29 AM, Jan Kara wrote:
> On Tue 16-06-26 20:37:00, Zhang Yi wrote:
>> On 6/16/2026 6:04 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:26, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> The iomap buffered write path does not hold any locks between querying
>>>> inode extent mapping information and performing buffered writes. It
>>>> relies on the sequence counter saved in the inode to detect stale
>>>> mappings.
>>>
>>> Now that I'm looking at it again, I've got a bit confused here. Buffered
>>> write path is holding i_rwsem between mapping blocks and using them so
>>> there shouldn't be races.  Perhaps you mean buffered *writeback* path? But
>>> then ext4_da_map_blocks() should not ever get called in the writeback path
>>> because it is allocating delayed blocks... So this change looks unnecessary
>>> to me now. Am I missing something?
>>>
>>> 								Honza
>>>
>>
>> Hi Jan,
>>
>> Thanks for coming back to this series. Sorry for the inaccurate
>> description in the commit message. However, this change is still needed.
>>
>> As mentioned in the comment before the ->iomap_valid() callback in
>> iomap_write_begin(), consider the following scenario — a buffered write
>> to a dirty unwritten extent, with this concurrent race:
>>
>>   Buffer write (holds i_rwsem)    Writeback (no i_rwsem)
>>   ext4_da_map_blocks()
>>     // ext4_es_lookup_extent()
>>     // finds UNWRITTEN extent
>>   ext4_set_iomap()
>>     // type = IOMAP_UNWRITTEN
>>     // validity_cookie = m_seq
>>                                   ext4_iomap_writepages()
>>                                     // writeback for unwritten extent
>>                                     // ext4_convert_unwritten_extents()
>>                                     // extent tree: UNWRITTEN → WRITTEN
>>                                     // advancing i_es_seq
>>   __iomap_write_begin()
>>     // ext4_iomap_valid(): cookie != i_es_seq
>>     // iomap invalidated, re-maps
>>     // gets type = IOMAP_MAPPED (WRITTEN)
>>     // iomap_block_needs_zeroing(): FALSE
>>
>> Without passing out m_seq, the stale IOMAP_UNWRITTEN type from the iomap
>> would cause __iomap_write_begin()->iomap_block_needs_zeroing() to zero
>> out blocks that have already been written, corrupting data on partial
>> writes.
> 
> Ah, right, thanks for explanation. So please update the description to
> explicitely mention that iomap doesn't hold folio lock between mapping the
> extent and copying data and thus can race with writeback modifying the
> extent type. Thanks!
> 
> 								Honza

Sure, thanks for pointing this out.

Cheers,
Yi.

> 
>>
>> Thanks,
>> Yi
>>
>>>>
>>>> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
>>>> blocks") added the m_seq field to ext4_map_blocks to pass out extent
>>>> sequence numbers, but it missed two callsites within
>>>> ext4_da_map_blocks(). These callsites are on the delayed allocation
>>>> path, which is also used in the iomap buffered write path. Pass out the
>>>> sequence counter to ensure stale mappings can be detected.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>> ---
>>>>   fs/ext4/inode.c | 4 ++--
>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 6c4d9137b279..39577a6b65b9 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>>>   	ext4_check_map_extents_env(inode);
>>>>   	/* Lookup extent status tree firstly */
>>>> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>>>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>>>   		map->m_len = min_t(unsigned int, map->m_len,
>>>>   				   es.es_len - (map->m_lblk - es.es_lblk));
>>>> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>>>   	 * is held in write mode, before inserting a new da entry in
>>>>   	 * the extent status tree.
>>>>   	 */
>>>> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>>>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>>>   		map->m_len = min_t(unsigned int, map->m_len,
>>>>   				   es.es_len - (map->m_lblk - es.es_lblk));
>>>> -- 
>>>> 2.52.0
>>>>
>>


^ permalink raw reply

* Re: [PATCH v4] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Zhou, Yun @ 2026-06-18  1:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, yi.zhang,
	ebiggers, linux-ext4, linux-kernel
In-Reply-To: <xzd6pbxhxxwf3ngj5v5pz2ilhykh7vsegohe6iwfkfytfuuv4o@5obqeoc5wocr>



On 6/18/26 04:25, Jan Kara wrote:
> On Mon 15-06-26 13:35:12, Yun Zhou wrote:
>
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -464,6 +464,33 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
>                inode_unlock(inode);
>        }
>
> +     /*
> +      * Since this function resolves references from active xattr entries,
> +      * the EA inode must be in active state (i_nlink=1, ref_count>0).
> +      * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
> +      * (inconsistent with an active entry) all indicate corruption or
> +      * a concurrent last-reference drop.
> +      */
> +     if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
> +             ext4_error(parent->i_sb,
> +                        "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
> +                        ea_ino, inode->i_nlink,
> +                        ext4_xattr_inode_get_ref(inode));
> Hum, given motivation of this is syzbot corrupted fs image, I'd just put
> check in ext4_iget() verifying ext4_xattr_inode_get_ref() is > 0 and be
> done with it. Much simpler and catches at least the obvious cases. The
> consistency of xattr refcount and i_nlink is otherwise guarded by
> ext4_xattr_inode_update_ref() and it can never be perfect as in the kernel
> we don't have the full view of the filesystem and so cannot ascertain that
> xattr ref count matches reality...
>
Thanks very much for your review and suggestions, The issue should be easily
fixed after "introduce ext4_put_ea_inode() for safe deferred iput" 
merged. Just
need to defer iput() and no longer to make inode bad.

^ permalink raw reply

* Re: [PATCH v7 0/4] ext4: fix xattr iput deadlock with s_writepages_rwsem
From: Zhou, Yun @ 2026-06-18  0:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, yi.zhang,
	linux-ext4, linux-kernel
In-Reply-To: <o63nwdmeoqscllaitti32enjhet4fcvstv5eh4wooviwxmsosl@mooyvfa5dfqm>



On 6/18/26 02:13, Jan Kara wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue 16-06-26 23:15:54, Yun Zhou wrote:
>> This series fixes a circular lock dependency reported by syzbot:
>>
>>    s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
>>
>> The deadlock occurs when iput() on an EA inode triggers write_inode_now()
>> while xattr_sem and a jbd2 handle are held.  The triggering path is
>> during mount-time orphan cleanup (!SB_ACTIVE) where iput_final() calls
>> write_inode_now() synchronously.
>>
>> Patch 1 blocks the deadlock by skipping extra isize expansion when
>> !SB_ACTIVE -- this prevents the xattr manipulation path from being
>> entered during mount.
>>
>> Patch 2 is a belt-and-suspenders semantic improvement: an inode under
>> eviction never needs extra isize expansion.
>>
>> Patches 3-4 are a structural improvement using a per-sb workqueue:
>>
>>    Patch 3 introduces ext4_put_ea_inode(), which does direct iput() when
>>    SB_ACTIVE (zero overhead) and defers to a workqueue when !SB_ACTIVE.
>>    It also converts the first call site (ext4_xattr_block_set release
>>    path) which previously called iput under xattr_sem + jbd2 handle.
>>
>>    Patch 4 converts the remaining EA inode iput() calls that execute
>>    under locks.  Sites where direct iput() is provably safe (i_nlink=0
>>    after dec_ref, or lookup-only paths) are left unchanged with comments.
>>
>> Link: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
> Please don't send the series so quickly. I'd say twice per week is about
> maximum sensible cadence. It takes time (easily several days) for people to
> get to look at your patches and sending your patches sometimes even several
> times per day just creates a mess in the mailbox.

I admit I've been pushing patches too frequently, but I'll definitely tone
it down going forward. The reason is that I get feedback from the 'AI 
Review'
almost immediately after submitting. It flags a lot of edge cases—some I
introduced, others that were pre-existing, or even deadlock paths I missed
despite claiming they were fixed. Most of these are genuine issues. Even if
some are extremely strict, I just want to resolve them as quickly as 
possible.

I mistakenly thought that once the AI Reviewer raised concerns, human 
reviewers
wouldn't check it again. So I kept pushing new patches to fix those 
potential
issues, only for the bot to immediately flag new ones... It would be 
great if there
were a way to send patches directly to the AI bot for a separate review.
> Also in some previous version, I gave my Reviewed-by tag for patch 1 and
> some comment and Reviewed-by tag for patch 2. So please reflect that in the
> next posting.
Got it, I will add it in the next version.

Thanks,
Yun


^ permalink raw reply

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
From: Jan Kara @ 2026-06-17 20:29 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, djwong,
	hch, yi.zhang, yangerkun, yukuai
In-Reply-To: <58ee3009-8a0e-4657-a473-475ea998316e@gmail.com>

On Tue 16-06-26 20:37:00, Zhang Yi wrote:
> On 6/16/2026 6:04 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:26, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > The iomap buffered write path does not hold any locks between querying
> > > inode extent mapping information and performing buffered writes. It
> > > relies on the sequence counter saved in the inode to detect stale
> > > mappings.
> > 
> > Now that I'm looking at it again, I've got a bit confused here. Buffered
> > write path is holding i_rwsem between mapping blocks and using them so
> > there shouldn't be races.  Perhaps you mean buffered *writeback* path? But
> > then ext4_da_map_blocks() should not ever get called in the writeback path
> > because it is allocating delayed blocks... So this change looks unnecessary
> > to me now. Am I missing something?
> > 
> > 								Honza
> > 
> 
> Hi Jan,
> 
> Thanks for coming back to this series. Sorry for the inaccurate
> description in the commit message. However, this change is still needed.
> 
> As mentioned in the comment before the ->iomap_valid() callback in
> iomap_write_begin(), consider the following scenario — a buffered write
> to a dirty unwritten extent, with this concurrent race:
> 
>   Buffer write (holds i_rwsem)    Writeback (no i_rwsem)
>   ext4_da_map_blocks()
>     // ext4_es_lookup_extent()
>     // finds UNWRITTEN extent
>   ext4_set_iomap()
>     // type = IOMAP_UNWRITTEN
>     // validity_cookie = m_seq
>                                   ext4_iomap_writepages()
>                                     // writeback for unwritten extent
>                                     // ext4_convert_unwritten_extents()
>                                     // extent tree: UNWRITTEN → WRITTEN
>                                     // advancing i_es_seq
>   __iomap_write_begin()
>     // ext4_iomap_valid(): cookie != i_es_seq
>     // iomap invalidated, re-maps
>     // gets type = IOMAP_MAPPED (WRITTEN)
>     // iomap_block_needs_zeroing(): FALSE
> 
> Without passing out m_seq, the stale IOMAP_UNWRITTEN type from the iomap
> would cause __iomap_write_begin()->iomap_block_needs_zeroing() to zero
> out blocks that have already been written, corrupting data on partial
> writes.

Ah, right, thanks for explanation. So please update the description to
explicitely mention that iomap doesn't hold folio lock between mapping the
extent and copying data and thus can race with writeback modifying the
extent type. Thanks!

								Honza

> 
> Thanks,
> Yi
> 
> > > 
> > > Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
> > > blocks") added the m_seq field to ext4_map_blocks to pass out extent
> > > sequence numbers, but it missed two callsites within
> > > ext4_da_map_blocks(). These callsites are on the delayed allocation
> > > path, which is also used in the iomap buffered write path. Pass out the
> > > sequence counter to ensure stale mappings can be detected.
> > > 
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > ---
> > >   fs/ext4/inode.c | 4 ++--
> > >   1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 6c4d9137b279..39577a6b65b9 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> > >   	ext4_check_map_extents_env(inode);
> > >   	/* Lookup extent status tree firstly */
> > > -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> > > +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> > >   		map->m_len = min_t(unsigned int, map->m_len,
> > >   				   es.es_len - (map->m_lblk - es.es_lblk));
> > > @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> > >   	 * is held in write mode, before inserting a new da entry in
> > >   	 * the extent status tree.
> > >   	 */
> > > -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> > > +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> > >   		map->m_len = min_t(unsigned int, map->m_len,
> > >   				   es.es_len - (map->m_lblk - es.es_lblk));
> > > -- 
> > > 2.52.0
> > > 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Jan Kara @ 2026-06-17 20:25 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260615053512.1315992-1-yun.zhou@windriver.com>

On Mon 15-06-26 13:35:12, Yun Zhou wrote:
> Validate EA inode state in ext4_xattr_inode_iget() to prevent
> WARN_ONCE triggers in ext4_xattr_inode_update_ref() and reject
> corrupted EA inodes before they can cause further damage.
> 
> When a corrupted ext4 image has an EA inode with inconsistent i_nlink
> and ref_count values, the code currently allows it through and later
> hits WARN_ONCE when ref_count transitions cross the 0/1 boundary.
> While this is not a security or stability issue -- it only fires on
> crafted filesystem images and merely prints a call trace -- it is
> better handled as an early sanity check that returns -EFSCORRUPTED,
> consistent with how ext4 treats other on-disk corruption.
> 
> Since ext4_xattr_inode_iget() resolves references from active xattr
> entries, the target EA inode must be in active state (i_nlink=1,
> ref_count>0).  Reject any inode that does not satisfy this.
> 
> Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
> Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> ---
> v4:
>  - Take I_MUTEX_XATTR before checking orphan state to safely decide
>    whether to call make_bad_inode(), avoiding orphan list corruption
>    if another thread is concurrently freeing the EA inode
> 
> v3:
>  - Move check after Lustre branch to avoid false positives on Lustre EA inodes
>  - Merge into single condition: i_nlink != 1 || !ref_count
>  - Add make_bad_inode() before iput() to avoid truncation in active txn
> 
> v2:
>  - Add ref_count validation to also catch i_nlink=1, ref_count=0 case
> 
>  fs/ext4/xattr.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 982a1f831e22..8efd6368f956 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -464,6 +464,33 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
>  		inode_unlock(inode);
>  	}
>  
> +	/*
> +	 * Since this function resolves references from active xattr entries,
> +	 * the EA inode must be in active state (i_nlink=1, ref_count>0).
> +	 * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
> +	 * (inconsistent with an active entry) all indicate corruption or
> +	 * a concurrent last-reference drop.
> +	 */
> +	if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
> +		ext4_error(parent->i_sb,
> +			   "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
> +			   ea_ino, inode->i_nlink,
> +			   ext4_xattr_inode_get_ref(inode));

Hum, given motivation of this is syzbot corrupted fs image, I'd just put
check in ext4_iget() verifying ext4_xattr_inode_get_ref() is > 0 and be
done with it. Much simpler and catches at least the obvious cases. The
consistency of xattr refcount and i_nlink is otherwise guarded by
ext4_xattr_inode_update_ref() and it can never be perfect as in the kernel
we don't have the full view of the filesystem and so cannot ascertain that
xattr ref count matches reality...

								Honza

> +		/*
> +		 * Mark rejected inode to prevent ext4_evict_inode() from
> +		 * attempting truncation on a corrupted inode within an active
> +		 * transaction, which could exhaust journal credits. The lock
> +		 * serializes against ext4_xattr_inode_update_ref() which
> +		 * does clear_nlink() + ext4_orphan_add() under the same lock.
> +		 */
> +		inode_lock_nested(inode, I_MUTEX_XATTR);
> +		if (!ext4_inode_orphan_tracked(inode))
> +			make_bad_inode(inode);
> +		inode_unlock(inode);
> +		iput(inode);
> +		return -EFSCORRUPTED;
> +	}
> +
>  	*ea_inode = inode;
>  	return 0;
>  }
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v3] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Zorro Lang @ 2026-06-17 19:33 UTC (permalink / raw)
  To: Disha Goel
  Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
	ojaswin, djwong
In-Reply-To: <20260608102328.40916-1-disgoel@linux.ibm.com>

On Mon, Jun 08, 2026 at 03:53:28PM +0530, Disha Goel wrote:
> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
> on DAX files.
> 
> Add an ext4-specific check in _require_defrag() to skip tests when DAX
> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
> generic/018.
> 
> XFS defrag works with DAX, so this check is ext4-specific.
> 
> Suggested-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> Changes in v3:
> - Move the DAX check inside the ext4 case statement as
>   suggested by Darrick

Make sense to me,

Reviewed-by: Zorro Lang <zlang@kernel.org>

> 
>  common/defrag | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/common/defrag b/common/defrag
> index 055d0d0e..baf05d94 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -13,6 +13,8 @@ _require_defrag()
>          DEFRAG_PROG="$XFS_FSR_PROG"
>  	;;
>      ext4)
> +        __scratch_uses_fsdax && _notrun "ext4 online defrag not supported with DAX"
> +
>  	testfile="$TEST_DIR/$$-test.defrag"
>  	donorfile="$TEST_DIR/$$-donor.defrag"
>  	bsize=`_get_block_size $TEST_DIR`
> -- 
> 2.45.1
> 
> 

^ permalink raw reply

* Re: [PATCH v7 3/4] ext4: introduce ext4_put_ea_inode() for safe deferred iput
From: Jan Kara @ 2026-06-17 18:42 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, linux-ext4, linux-kernel
In-Reply-To: <20260616151558.1728881-4-yun.zhou@windriver.com>

On Tue 16-06-26 23:15:57, Yun Zhou wrote:
> Calling iput() on EA inodes while holding xattr_sem or a jbd2 handle
> can trigger write_inode_now() -> ext4_writepages() -> s_writepages_rwsem,
> creating a lock ordering issue during mount (!SB_ACTIVE).
> 
> Add ext4_put_ea_inode() which safely releases EA inode references:
> when SB_ACTIVE, it calls iput() directly (write_inode_now cannot be
> triggered); during mount (!SB_ACTIVE), it queues the inode on a per-sb
> lock-free llist and schedules a worker to call iput() in a clean
> context without holding any ext4 locks.
> 
> Convert the iput in ext4_xattr_block_set()'s "Drop the previous xattr
> block" path to use ext4_xattr_inode_array_free_deferred(), which
> releases EA inodes via ext4_put_ea_inode().  This path previously called
> ext4_xattr_inode_array_free() (synchronous iput) while holding xattr_sem
> and a jbd2 handle.
> 
> The worker is flushed in ext4_put_super() before journal destruction to
> ensure all pending EA inode cleanup completes while the journal is still
> available.
> 
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Yes, this goes in the direction I intended. But I have couple of
suggestions:

> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..690202303269 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1706,6 +1706,11 @@ struct ext4_sb_info {
>  	struct ext4_es_stats s_es_stats;
>  	struct mb_cache *s_ea_block_cache;
>  	struct mb_cache *s_ea_inode_cache;
> +
> +	/* Deferred iput for EA inodes to avoid lock ordering issues */
> +	struct llist_head s_ea_inode_to_free;
> +	struct work_struct s_ea_inode_work;
> +

I'd probably use delayed work and schedule it with a delay of one jiffie so
that some inodes can accumulate before we process them which should reduce
the amount of task switching to workqueues.

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..b777bb0a81ea 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1308,6 +1308,9 @@ static void ext4_put_super(struct super_block *sb)
>  	destroy_workqueue(sbi->rsv_conversion_wq);
>  	ext4_release_orphan_info(sb);
>  
> +	/* Flush deferred EA inode iputs before destroying journal */
> +	flush_work(&sbi->s_ea_inode_work);
> +

This should happen earlier in ext4_put_super(). At this place quotas were
already turned off and so quota accounting would go wrong.

> @@ -3025,6 +3028,74 @@ void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *ea_inode_array)
>  	kfree(ea_inode_array);
>  }
>  
> +static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
> +				struct ext4_xattr_inode_array *array)
> +{
> +	int idx;
> +
> +	if (array == NULL)
> +		return;
> +
> +	for (idx = 0; idx < array->count; ++idx)
> +		ext4_put_ea_inode(sb, array->inodes[idx]);
> +	kfree(array);
> +}

The array of EA inodes used in xattr handling is just another mechanism
used for delaying iput() of EA inodes. It doesn't make sense to stack these
to one on top of another. Just completely replace the array mechanism with
always deferring iput of EA inode into the workqueue.

> +struct ext4_ea_iput_entry {
> +	struct llist_node node;
> +	struct inode *inode;
> +};
> +
> +/*
> + * Worker function for deferred EA inode iput.  Processes all inodes queued
> + * on s_ea_inode_to_free in a context free of xattr_sem/jbd2 handle locks.
> + */
> +void ext4_ea_inode_work(struct work_struct *work)
> +{
> +	struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
> +						s_ea_inode_work);
> +	struct llist_node *node = llist_del_all(&sbi->s_ea_inode_to_free);
> +	struct llist_node *next;
> +
> +	while (node) {
> +		struct ext4_ea_iput_entry *entry = container_of(node,
> +				struct ext4_ea_iput_entry, node);
> +		next = node->next;
> +		iput(entry->inode);
> +		kfree(entry);
> +		node = next;
> +	}
> +}

Allocating ext4_ea_iput_entry for dropping each inode is somewhat wasteful.
I want to suggest another scheme (somewhat more involved but more efficient
scheme):

1) Create a VFS helper bool iput_if_not_last(struct inode *inode) which
drops inode reference if it is not the last one (and returns true in that
case). Basically:

bool iput_if_not_last(struct inode *inode)
{
	return atomic_add_unless(&inode->i_count, -1, 1);
}

This needs to be a separate patch as it should get vetting from VFS
maintainers.

2) Use iput_if_not_last() in ext4_put_ea_inode(). If it returns true, we
are done. Otherwise we know we were at least for a moment holders of the
last inode reference, so we link the inode to the list of inodes to drop
through llist_node embedded in ext4_inode_info. We cannot race with anybody
else trying to link the same inode into the list because we hold one inode
ref and so nobody else can hit this "I was holding the last ref" path.
I'd union this llist_node say with xattr_sem which is unused for EA inodes
to avoid growing ext4_inode_info.

This way we avoid offloading unless really necessary and we don't have to
do allocations just to drop EA inode ref.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v7 0/4] ext4: fix xattr iput deadlock with s_writepages_rwsem
From: Jan Kara @ 2026-06-17 18:13 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, linux-ext4, linux-kernel
In-Reply-To: <20260616151558.1728881-1-yun.zhou@windriver.com>

On Tue 16-06-26 23:15:54, Yun Zhou wrote:
> This series fixes a circular lock dependency reported by syzbot:
> 
>   s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
> 
> The deadlock occurs when iput() on an EA inode triggers write_inode_now()
> while xattr_sem and a jbd2 handle are held.  The triggering path is
> during mount-time orphan cleanup (!SB_ACTIVE) where iput_final() calls
> write_inode_now() synchronously.
> 
> Patch 1 blocks the deadlock by skipping extra isize expansion when
> !SB_ACTIVE -- this prevents the xattr manipulation path from being
> entered during mount.
> 
> Patch 2 is a belt-and-suspenders semantic improvement: an inode under
> eviction never needs extra isize expansion.
> 
> Patches 3-4 are a structural improvement using a per-sb workqueue:
> 
>   Patch 3 introduces ext4_put_ea_inode(), which does direct iput() when
>   SB_ACTIVE (zero overhead) and defers to a workqueue when !SB_ACTIVE.
>   It also converts the first call site (ext4_xattr_block_set release
>   path) which previously called iput under xattr_sem + jbd2 handle.
> 
>   Patch 4 converts the remaining EA inode iput() calls that execute
>   under locks.  Sites where direct iput() is provably safe (i_nlink=0
>   after dec_ref, or lookup-only paths) are left unchanged with comments.
> 
> Link: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5

Please don't send the series so quickly. I'd say twice per week is about
maximum sensible cadence. It takes time (easily several days) for people to
get to look at your patches and sending your patches sometimes even several
times per day just creates a mess in the mailbox.

Also in some previous version, I gave my Reviewed-by tag for patch 1 and
some comment and Reviewed-by tag for patch 2. So please reflect that in the
next posting.

								Honza

> v7:
>  - Replaced the deferred-iput array threading approach (v4-v6) with a
>    simpler per-sb workqueue + lock-free llist design.  No function
>    signature changes needed.  ext4_put_ea_inode() does direct iput when
>    SB_ACTIVE (zero overhead in normal operation) and defers to the
>    workqueue only during mount (!SB_ACTIVE).
>  - Converted the iput in ext4_xattr_delete_inode()'s quota accounting
>    loop to ext4_put_ea_inode() to eliminate a lockdep-reportable lock
>    ordering violation (jbd2_handle -> iput -> s_writepages_rwsem).
>  - Moved flush_work() before the if (sbi->s_journal) check in
>    ext4_put_super() to cover nojournal mode.
> 
> v6:
>  - ext4_inline_data_truncate(): use local ea_inode_array instead of
>    passing NULL, freed after ext4_journal_stop().  Fixes a deadlock
>    reachable via crafted filesystem where inline data xattr entry has
>    e_value_inum set: orphan cleanup -> ext4_truncate ->
>    ext4_inline_data_truncate -> iput under !SB_ACTIVE.
> 
> v5:
>  - Split into 3 patches for easier review.
>  - Add explicit !SB_ACTIVE early-return in ext4_try_to_expand_extra_isize()
>    to block ALL mount-time paths (ext4_process_orphan -> ext4_truncate ->
>    ext4_mark_inode_dirty), not just the eviction path. v4 only relied on
>    EXT4_STATE_NO_EXPAND which doesn't cover orphan truncation.
> 
> v4:
>  - Comprehensive rewrite of the deferred iput mechanism.
>  - Thread ea_inode_array through ext4_expand_extra_isize_ea() and
>    ext4_xattr_move_to_block() so ALL ea_inode iputs in the expand
>    path are deferred, not just those in ext4_xattr_block_set().
>  - Add NULL safety to ext4_expand_inode_array(): when ea_inode_array
>    pointer is NULL, fall back to synchronous iput (for callers like
>    ext4_initxattrs that only run with SB_ACTIVE).
>  - Use __GFP_NOFAIL to guarantee deferred array growth, eliminating
>    fallback to synchronous iput under locks.
>  - Update ext4_xattr_ibody_set() and ext4_xattr_set_entry() signatures
>    to accept ea_inode_array, converting ALL iput(ea_inode) calls.
>  - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() before
>    ext4_mark_inode_dirty().
> 
> v3:
>  - Check ext4_expand_inode_array() return value; fallback to
>    direct iput() on ENOMEM to prevent inode leak.
>  - Make ext4_xattr_set_handle() take an optional ea_inode_array
>    output parameter so callers can free after ext4_journal_stop(),
>    avoiding the jbd2_handle vs s_writepages_rwsem AB-BA.
>  - Pass ea_inode_array directly to ext4_xattr_release_block()
>    instead of using a local array freed under xattr_sem.
>  - Move ext4_xattr_inode_array_free() after ext4_journal_stop()
> 
> v2:
>  - Defer iput() in ext4_xattr_block_set() via ea_inode_array,
>    freed after xattr_sem is released. Fixes the root cause.
> 
> v1:
>  - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
>    on inodes being deleted. Only fixes the syzbot reproducer, not
>    the underlying lock ordering violation.
> 
> Yun Zhou (4):
>   ext4: skip extra isize expansion during mount to prevent deadlock
>   ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
>   ext4: introduce ext4_put_ea_inode() for safe deferred iput
>   ext4: convert remaining EA inode iput() calls to ext4_put_ea_inode()
> 
>  fs/ext4/ext4.h  |   5 +++
>  fs/ext4/inode.c |  11 +++++
>  fs/ext4/super.c |   6 +++
>  fs/ext4/xattr.c | 105 +++++++++++++++++++++++++++++++++++++++++++-----
>  fs/ext4/xattr.h |   2 +
>  5 files changed, 120 insertions(+), 9 deletions(-)
> 
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Zhang Yi @ 2026-06-17 13:22 UTC (permalink / raw)
  To: Brian Foster, Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, ojaswin, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai
In-Reply-To: <ajJ9b91fTbM6qApg@bfoster>

On 6/17/2026 6:56 PM, Brian Foster wrote:
> On Wed, Jun 17, 2026 at 04:14:40PM +0800, Zhang Yi wrote:
>> On 6/16/2026 8:28 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:34, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
>>>> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
>>>> infrastructure for ext4.
>>>>
>>>> ext4_iomap_block_zero_range() calls iomap_zero_range() with
>>>> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
>>>> out either a mapped partial block or a dirty, unwritten partial block.
>>>>
>>>> Important constraints:
>>>>
>>>> Zeroing out under an active journal handle can cause deadlock, because
>>>> the order of acquiring the folio lock and starting a handle is
>>>> inconsistent with the iomap writeback path.
>>>>
>>>> Therefore, ext4_iomap_block_zero_range():
>>>> - Must NOT be called under an active handle.
>>>> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>>>>    before updating i_disksize (for the cases of post-EOF append write,
>>>>    post-EOF fallocate, and truncate up). In subsequent patches, we will
>>>>    address this by synchronizing commit I/O but doesn't waiting for
>>>>    completion, and updating i_disksize to i_size only after the zeroed
>>>>    data has been written back.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>   fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 92 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index c6fe42d012fc..e0dae2501292 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static int ext4_iomap_zero_begin(struct inode *inode,
>>>> +		loff_t offset, loff_t length, unsigned int flags,
>>>> +		struct iomap *iomap, struct iomap *srcmap)
>>>> +{
>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>
>>> This looks like a layering violation to me. I don't think you can safely
>>> assume the iomap you're passed is a part of iomap_iter...
>>>
>>>> +	struct ext4_map_blocks map;
>>>> +	u8 blkbits = inode->i_blkbits;
>>>> +	unsigned int iomap_flags = 0;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
>>>> +		return -EINVAL;
>>>> +
>>>> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>> +	 */
>>>> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
>>>> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
>>>> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
>>>> +
>>>> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
>>>> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
>>>> +			map.m_len = (start >> blkbits) - map.m_lblk;
>>>> +	}
>>>
>>> ... and you need access to iter only for this which seems to be really a
>>> hack that's trying to outsmart the iomap code. I have to admit I don't
>>> fully understand what you are trying to achieve here. Are you trying to
>>> avoid flushing of the range that will be zeroed out?
>>
>> This logic is copied from the XFS and iomap infrastructure. Its primary
>> purpose is to optimize the zeroing operations on dirty written extents.
>> It was introduced by Brian in [1].
>>
>> The history as I understand it: originally, the iomap infrastructure
>> could not zero dirty unwritten extents during zero range processing,
>> which led to stale data exposure. XFS had to flush dirty ranges itself
>> before zeroing — a workaround that was not generic.
>>
>> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
>> extents"), Brian added an unconditional flush in the iomap
>> infrastructure, ensuring that by the time zeroing runs the extent has
>> already been converted to written so the zero can proceed correctly.
>> However, this flush was too heavy and introduced noticeable performance
>> overhead.
>>
>> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
>> conditional on unwritten mappings"), which restricts flushing to only
>> dirty pagecache over unwritten or hole mappings.
>>
>> Brian later proposed a different approach: rather than relying on flush
>> to convert the extent type, find dirty folios ahead of the zero range
>> and zero the dirty unwritten extents directly. In [1] he added this
>> lookup logic. The filesystem now supplies a folio batch (a collection of
>> dirty folios) via the iomap begin callback, and zero range iterates over
>> these dirty folios to perform zeroing. Clean regions not covered by the
>> batch are simply skipped. This entirely eliminates the need to flush.
>>
>> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/
>>
>> If I understand correctly, the current approach is a compromise, and
>> Brian is still working on this. Perhaps ext4 and XFS could work together
>> on improvements in the future?
>>
> 
> I think that about covers it!
> 
> I do agree wrt to the iomap_iter thing in that it doesn't seem like the
> most elegant thing. I considered that a bit of a roadblock when first
> hacking on the batch stuff, but IIRC somebody pointed out that there was
> precedent already so I didn't think too hard about it after that. Indeed
> if you poke around, other filesystems use a similar pattern to access
> iter->private for whatever private context is carried around.
> 
> FWIW, one of the longer term thoughts for the dirty folio stuff was to
> eventually lift it out of the callback and just have iomap do it for the
> fs. That would eliminate this particular pattern and probably clean
> things up a bit, but there were also some other caveats with that that
> aren't top of mind atm (IIRC, things like dealing with map trimming,
> etc., but I haven't had a chance to think about it in a while).

Thanks for the additional information. This looks like a promising
direction. The usage on the ext4 side is relatively simple at the
moment, so the main concern is whether this would cause any issues for
XFS and other filesystems. This can be investigated later when time
permits — it's not urgent at this point.

> 
> Also note that this isn't necessarily a hard requirement. It's an
> optional optimization. iomap will flush and retry in the dirty
> pagecache+unwritten extent case if the fs hasn't otherwised provided
> folios to make sure it zeroes properly, it's just that performance of
> that may or may not be acceptable for your use case.
> 
> Brian

Yes, we will try to adopt the folio batch approach as much as possible,
hoping to reduce unnecessary performance overhead through it.

Thanks,
Yi.

> 
>>>
>>>> +	ret = iomap_zero_range(inode, from, length, did_zero,
>>>> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
>>>> +			       NULL);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * TODO: The iomap does not distinguish between different types of
>>>> +	 * zeroing and always sets zero_written if a zeroing operation is
>>>> +	 * performed, which may result in unnecessary order operations.
>>>> +	 */
>>>
>>> Is this still true after your fix to did_zero handling?
>>
>> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
>> operation has occurred through did_zero parameter, but it cannot
>> distinguish whether the zeroed range is a written extent that already
>> exists on disk. That is, even if the zeroing is performed on a delalloc
>> extent, did_zero will still return true.
>>
>> Thanks,
>> Yi.
>>
>>>
>>>> +	if (did_zero && zero_written)
>>>> +		*zero_written = *did_zero;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>   /*
>>>>    * Zeros out a mapping of length 'length' starting from file offset
>>>>    * 'from'.  The range to be zero'd must be contained with in one block.
>>>
>>> 								Honza
>>
> 


^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Zhang Yi @ 2026-06-17 13:00 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai, Brian Foster
In-Reply-To: <n3bxam2j3iwdo3u6od5esn4pjqrtgxbdzdyemzpufpor4u2lbh@itrun3kmz52e>

On 6/17/2026 6:50 PM, Jan Kara wrote:
> On Wed 17-06-26 16:14:40, Zhang Yi wrote:
>> On 6/16/2026 8:28 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:34, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
>>>> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
>>>> infrastructure for ext4.
>>>>
>>>> ext4_iomap_block_zero_range() calls iomap_zero_range() with
>>>> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
>>>> out either a mapped partial block or a dirty, unwritten partial block.
>>>>
>>>> Important constraints:
>>>>
>>>> Zeroing out under an active journal handle can cause deadlock, because
>>>> the order of acquiring the folio lock and starting a handle is
>>>> inconsistent with the iomap writeback path.
>>>>
>>>> Therefore, ext4_iomap_block_zero_range():
>>>> - Must NOT be called under an active handle.
>>>> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>>>>    before updating i_disksize (for the cases of post-EOF append write,
>>>>    post-EOF fallocate, and truncate up). In subsequent patches, we will
>>>>    address this by synchronizing commit I/O but doesn't waiting for
>>>>    completion, and updating i_disksize to i_size only after the zeroed
>>>>    data has been written back.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>   fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 92 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index c6fe42d012fc..e0dae2501292 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static int ext4_iomap_zero_begin(struct inode *inode,
>>>> +		loff_t offset, loff_t length, unsigned int flags,
>>>> +		struct iomap *iomap, struct iomap *srcmap)
>>>> +{
>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>
>>> This looks like a layering violation to me. I don't think you can safely
>>> assume the iomap you're passed is a part of iomap_iter...
>>>
>>>> +	struct ext4_map_blocks map;
>>>> +	u8 blkbits = inode->i_blkbits;
>>>> +	unsigned int iomap_flags = 0;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
>>>> +		return -EINVAL;
>>>> +
>>>> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>> +	 */
>>>> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
>>>> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
>>>> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
>>>> +
>>>> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
>>>> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
>>>> +			map.m_len = (start >> blkbits) - map.m_lblk;
>>>> +	}
>>>
>>> ... and you need access to iter only for this which seems to be really a
>>> hack that's trying to outsmart the iomap code. I have to admit I don't
>>> fully understand what you are trying to achieve here. Are you trying to
>>> avoid flushing of the range that will be zeroed out?
>>
>> This logic is copied from the XFS and iomap infrastructure. Its primary
>> purpose is to optimize the zeroing operations on dirty written extents.
>> It was introduced by Brian in [1].
> 
> Ah, I see. I still find it hacky but apparently it is an established hack
> in iomap :). Fair.
> 
>> The history as I understand it: originally, the iomap infrastructure
>> could not zero dirty unwritten extents during zero range processing,
>> which led to stale data exposure. XFS had to flush dirty ranges itself
>> before zeroing — a workaround that was not generic.
>>
>> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
>> extents"), Brian added an unconditional flush in the iomap
>> infrastructure, ensuring that by the time zeroing runs the extent has
>> already been converted to written so the zero can proceed correctly.
>> However, this flush was too heavy and introduced noticeable performance
>> overhead.
>>
>> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
>> conditional on unwritten mappings"), which restricts flushing to only
>> dirty pagecache over unwritten or hole mappings.
>>
>> Brian later proposed a different approach: rather than relying on flush
>> to convert the extent type, find dirty folios ahead of the zero range
>> and zero the dirty unwritten extents directly. In [1] he added this
>> lookup logic. The filesystem now supplies a folio batch (a collection of
>> dirty folios) via the iomap begin callback, and zero range iterates over
>> these dirty folios to perform zeroing. Clean regions not covered by the
>> batch are simply skipped. This entirely eliminates the need to flush.
>>
>> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/
> 
> Thanks for the summary! So I was confused because somehow I thought this is
> about fallocate(FALLOC_FL_ZERO_RANGE) and so I was wondering why we just
> cannot evict the page cache and be done with that. Only after reading
> everything again I've realized this is about zeroing partial blocks on hole
> punch etc. And we may need to really handle multiple folios because XFS
> also uses this mechanism to implement FALLOC_FL_ZERO_RANGE for zoned
> storage. Ugh. OK, anyway for now this looks like your patch is following
> how things are expected to be done so feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
>>>> +	/*
>>>> +	 * TODO: The iomap does not distinguish between different types of
>>>> +	 * zeroing and always sets zero_written if a zeroing operation is
>>>> +	 * performed, which may result in unnecessary order operations.
>>>> +	 */
>>>
>>> Is this still true after your fix to did_zero handling?
>>
>> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
>> operation has occurred through did_zero parameter, but it cannot
>> distinguish whether the zeroed range is a written extent that already
>> exists on disk. That is, even if the zeroing is performed on a delalloc
>> extent, did_zero will still return true.
> 
> So maybe write in the comment explicitely, that this may result in
> unnecessary flushing of folios if zeroing happened in
> delayed-not-yet-allocated blocks?
> 
> 								Honza

Sure, I'll include it in the next iteration. :)

Thanks,
Yi.




^ permalink raw reply

* [PATCH v7 11/11] fstests: test UUID consistency for clones with metadata_uuid
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Btrfs and xfs uses the metadata_uuid superblock feature to change the
on-disk UUID without rewriting every block header. This patch adds a
sanity check to ensure UUID consistency when a filesystem with
metadata_uuid enabled is cloned.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/806     | 74 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/806.out |  6 ++++
 2 files changed, 80 insertions(+)
 create mode 100644 tests/generic/806
 create mode 100644 tests/generic/806.out

diff --git a/tests/generic/806 b/tests/generic/806
new file mode 100644
index 000000000000..6d3166491006
--- /dev/null
+++ b/tests/generic/806
@@ -0,0 +1,74 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 806
+#
+# Verify that the cloned filesystem UUID remains consistent, even when the
+# `metadata_uuid` feature is enabled.
+#
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+filter_pool()
+{
+	sed -e "s|${devs[0]}|DEV1|g" -e "s|${mnt1}|MNT1|g" \
+	    -e "s|${devs[1]}|DEV2|g" -e "s|${mnt2}|MNT2|g" | _filter_spaces
+}
+
+# Create base loop device and its clone, applying the metadata_uuid tuning
+# callback to the base filesystem before the copy occurs.
+devs=()
+_loop_image_create_clone devs _change_metadata_uuid
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Get the uuid from the source device
+fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+# Mount both clone and baseline
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+# Cycle mounts and reverse the initialization order to ensure UUID tracking
+# doesn't mismatch or flip when metadata_uuid optimization is active.
+echo "**** mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+status=0
+exit
diff --git a/tests/generic/806.out b/tests/generic/806.out
new file mode 100644
index 000000000000..918f422ecddf
--- /dev/null
+++ b/tests/generic/806.out
@@ -0,0 +1,6 @@
+QA output created by 806
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
+**** mount cycle ****
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 10/11] fstests: add _change_metadata_uuid helper
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

_change_metadata_uuid changes the UUID of the golden filesystem before it
is cloned.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 common/rc | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/common/rc b/common/rc
index d95eec94f7b7..5cd4e025293b 100644
--- a/common/rc
+++ b/common/rc
@@ -1533,6 +1533,29 @@ _scratch_resvblks()
 	esac
 }
 
+# Change the metadata UUID of the given device to a newly generated one.
+# Args:
+#   $1: Block device path to modify.
+_change_metadata_uuid()
+{
+	local temp_mnt=$TEST_DIR/${seq}_mnt
+	local dev=$1
+
+	case $FSTYP in
+	xfs)
+		_require_command "$XFS_ADMIN_PROG" "xfs_admin"
+		$XFS_ADMIN_PROG -U generate $dev >> $seqres.full
+		;;
+	btrfs)
+		_require_command "$BTRFS_TUNE_PROG" "btrfstune"
+		$BTRFS_TUNE_PROG -m $dev
+		;;
+	*)
+		_notrun "Require filesystem with metadata_uuid feature"
+		;;
+	esac
+}
+
 # Create a small loop image, run an optional tuning function ($2) on it,
 # clone it, and attach both to loop devices, returned in ($1).
 # Args:
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 09/11] fstests: verify exportfs file handles on cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Ensure that exportfs can correctly decode file handles on a cloned
filesystem across a mount cycle, by file handles generated on a
cloned device remain valid after mount cycle.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/805     | 80 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/805.out |  2 ++
 2 files changed, 82 insertions(+)
 create mode 100644 tests/generic/805
 create mode 100644 tests/generic/805.out

diff --git a/tests/generic/805 b/tests/generic/805
new file mode 100644
index 000000000000..5827eee039df
--- /dev/null
+++ b/tests/generic/805
@@ -0,0 +1,80 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test No. 805
+# Verify that file handles encoded on a cloned filesystem remain valid and
+# resolvable via open_by_handle across a mount cycle and mount order swap.
+
+. ./common/preamble
+
+_begin_fstest auto quick exportfs clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_exportfs
+_require_loop
+_require_test_program "open_by_handle"
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	_unmount $mnt1 2>/dev/null
+	_unmount $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Create test dir and test files, encode file handles and store to tmp file
+create_test_files()
+{
+	rm -rf $testdir
+	mkdir -p $testdir
+	$here/src/open_by_handle -cwp -o $tmp.handles_file $testdir $NUMFILES
+}
+
+# Attempt to read and decode the saved file handles on the targeted mount point.
+test_file_handles()
+{
+	local opt=$1
+	local when=$2
+
+	echo test_file_handles after $when
+	$here/src/open_by_handle $opt -i $tmp.handles_file $mnt2 $NUMFILES
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both identical UUID filesystems simultaneously
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+NUMFILES=1
+testdir=$mnt2/testdir
+
+# Decode file handles of files/dir after cycle mount
+create_test_files
+
+# Cycle mounts and reverse initialization sequence to check if
+# file handle lookups are okay
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+# Verify file handles can still be resolved post-mount-cycle
+test_file_handles -rp "cycle mount"
+
+status=0
+exit
diff --git a/tests/generic/805.out b/tests/generic/805.out
new file mode 100644
index 000000000000..29b11ec77ffb
--- /dev/null
+++ b/tests/generic/805.out
@@ -0,0 +1,2 @@
+QA output created by 805
+test_file_handles after cycle mount
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 08/11] fstests: verify IMA isolation on cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Add testcase to verify IMA measurement isolation when multiple devices
share the same FSUUID.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/804     | 108 ++++++++++++++++++++++++++++++++++++++++++
 tests/generic/804.out |  10 ++++
 2 files changed, 118 insertions(+)
 create mode 100644 tests/generic/804
 create mode 100644 tests/generic/804.out

diff --git a/tests/generic/804 b/tests/generic/804
new file mode 100644
index 000000000000..ced32e6d79dd
--- /dev/null
+++ b/tests/generic/804
@@ -0,0 +1,108 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 804
+# Verify IMA isolation on cloned filesystems:
+# . Mount two devices sharing the same FSUUID (cloned).
+# . Apply an IMA policy to measure files based on that FSUUID.
+# . Create unique files on each mount point to trigger measurements.
+# . Confirm the IMA log correctly attributes events to the respective mounts.
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: derive f_fsid from on-disk fsuuid and dev_t"
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	_unmount $mnt1 2>/dev/null
+	_unmount $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Normalize device names and mount points
+filter_pool()
+{
+	sed -e "s|${devs[0]}|DEV1|g" -e "s|$mnt1|MNT1|g" \
+	    -e "s|${devs[1]}|DEV2|g" -e "s|$mnt2|MNT2|g" | _filter_spaces
+}
+
+# Core helper to set IMA policy and check measurement logs
+do_ima()
+{
+	local ima_policy="/sys/kernel/security/ima/policy"
+	local ima_log="/sys/kernel/security/ima/ascii_runtime_measurements"
+	local fsuuid
+	local mnt=$1
+	local enable=$2
+
+	# Since the in-memory IMA audit log is only cleared upon reboot,
+	# use unique random filenames to avoid log collisions.
+	local foofile=$(mktemp --dry-run foobar_XXXXX)
+
+	echo $mnt $enable | filter_pool
+
+	[ -w "$ima_policy" ] || _notrun "IMA policy not writable"
+
+	fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+	# Load IMA policy to measure file access specifically for this
+	# filesystem UUID.
+	if [[ $enable -eq 1 ]]; then
+		echo "measure func=FILE_CHECK fsuuid=$fsuuid" > "$ima_policy" || \
+			_notrun "Policy rejected"
+	fi
+
+	# Create a file to trigger measurement and verify its entry in
+	# the IMA log.
+	echo "test_data" > $mnt/$foofile
+
+	# IMA log extract
+	grep $foofile "$ima_log" | awk '{ print $5 }' | filter_pool | \
+						sed "s/$foofile/FOOBAR_FILE/"
+
+	echo "dbg: $mnt $fsuuid $foofile" >> $seqres.full
+	cat $ima_log | tail -1 >> $seqres.full
+	echo >> $seqres.full
+}
+
+# Initialize loop base and cloned instances
+devs=()
+_loop_image_create_clone devs
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Concurrently mount both clones
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+#  IMA response on baseline and clone configuration
+do_ima $mnt1 1
+do_ima $mnt2 0
+
+# Cycle mount on the second device.
+echo mount cycle
+_unmount $mnt2
+_mount $mount_opts ${devs[1]} $mnt2 || _fail "Failed to mount dev2"
+
+do_ima $mnt1 0
+do_ima $mnt2 0
+
+status=0
+exit
diff --git a/tests/generic/804.out b/tests/generic/804.out
new file mode 100644
index 000000000000..9804181d6c17
--- /dev/null
+++ b/tests/generic/804.out
@@ -0,0 +1,10 @@
+QA output created by 804
+MNT1 1
+MNT1/FOOBAR_FILE
+MNT2 0
+MNT2/FOOBAR_FILE
+mount cycle
+MNT1 0
+MNT1/FOOBAR_FILE
+MNT2 0
+MNT2/FOOBAR_FILE
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 07/11] fstests: verify libblkid resolution of duplicate UUIDs
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Verify how findmnt, df (libblkid) resolve device paths when multiple
block devices share the same FSUUID.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/803     | 72 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/803.out |  6 ++++
 2 files changed, 78 insertions(+)
 create mode 100644 tests/generic/803
 create mode 100644 tests/generic/803.out

diff --git a/tests/generic/803 b/tests/generic/803
new file mode 100644
index 000000000000..77901592366c
--- /dev/null
+++ b/tests/generic/803
@@ -0,0 +1,72 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 803
+# Check if the mountinfo based findmnt would resolve to the common uuid
+# as per the blkid (libblkid based).
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Normalize pool devices and mount points names
+filter_pool()
+{
+	sed -e "s|${devs[0]}|DEV1|g" -e "s|${mnt1}|MNT1|g" \
+	    -e "s|${devs[1]}|DEV2|g" -e "s|${mnt2}|MNT2|g" | _filter_spaces
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Get the uuid from the source device
+fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+# Mount both identical UUID filesystems simultaneously
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+# Btrfs assigned a random uuid for the clone fs before the fix.
+# Cycle mounts and reverse the initialization (source and clone fs) order.
+echo "**** mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+status=0
+exit
diff --git a/tests/generic/803.out b/tests/generic/803.out
new file mode 100644
index 000000000000..3a130c662430
--- /dev/null
+++ b/tests/generic/803.out
@@ -0,0 +1,6 @@
+QA output created by 803
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
+**** mount cycle ****
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 06/11] fstests: verify f_fsid for cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Verify that the cloned filesystem provides an f_fsid that is persistent
across mount cycles, yet unique from the original filesystem's f_fsid.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/802     | 64 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/802.out |  4 +++
 2 files changed, 68 insertions(+)
 create mode 100644 tests/generic/802
 create mode 100644 tests/generic/802.out

diff --git a/tests/generic/802 b/tests/generic/802
new file mode 100644
index 000000000000..910807c11584
--- /dev/null
+++ b/tests/generic/802
@@ -0,0 +1,64 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 802
+# Check that the cloned filesystem provides an f_fsid that is persistent
+# across mount cycles if the block device maj:min remains unchanged.
+
+. ./common/preamble
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: derive f_fsid from on-disk fsuuid and dev_t"
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both filesystems simultaneously using mandatory clone mount options
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+# Capture baseline filesystem IDs for comparison
+fsid_scratch=$(stat -f -c "%i" $mnt1)
+fsid_clone=$(stat -f -c "%i" $mnt2)
+
+# Verify that the fsids remain stable after a mount cycle, even when the
+# mount order is reversed.
+echo "**** fsid after mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+# Compare post mount-cycle values against the baseline
+stat -f -c "%i" $mnt1 | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
+stat -f -c "%i" $mnt2 | sed -e "s/$fsid_clone/FSID_CLONE/g"
+
+status=0
+exit
diff --git a/tests/generic/802.out b/tests/generic/802.out
new file mode 100644
index 000000000000..0202a9a2c108
--- /dev/null
+++ b/tests/generic/802.out
@@ -0,0 +1,4 @@
+QA output created by 802
+**** fsid after mount cycle ****
+FSID_SCRATCH
+FSID_CLONE
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 05/11] fstests: verify fanotify isolation on cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Verify that fanotify events are correctly routed to the appropriate
watcher when cloned filesystems are mounted.
Helps verify kernel's event notification distinguishes between devices
sharing the same FSID/UUID.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/801     | 135 ++++++++++++++++++++++++++++++++++++++++++
 tests/generic/801.out |   7 +++
 2 files changed, 142 insertions(+)
 create mode 100644 tests/generic/801
 create mode 100644 tests/generic/801.out

diff --git a/tests/generic/801 b/tests/generic/801
new file mode 100644
index 000000000000..3bfb87d41922
--- /dev/null
+++ b/tests/generic/801
@@ -0,0 +1,135 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 801
+# Verify fanotify FID functionality on cloned filesystems by setting up
+# watchers and making sure notifications are in the correct logs files.
+
+. ./common/preamble
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+_require_command "$FSNOTIFYWAIT_PROG" fsnotifywait
+_require_unique_f_fsid
+
+_cleanup()
+{
+	cd /
+	[[ -n $pid1 ]] && { kill -TERM "$pid1" 2> /dev/null; wait $pid1; }
+	[[ -n $pid2 ]] && { kill -TERM "$pid2" 2> /dev/null; wait $pid2; }
+
+	if [ "$semanage_added" = "yes" ]; then
+		semanage permissive -d unconfined_t >/dev/null 2>&1 || true
+	fi
+
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+	rm -r -f $tmp.*
+}
+
+# Run fsnotifywait in unbuffered mode to watch filesystem-wide create events
+monitor_fanotify()
+{
+	local mmnt=$1
+	exec stdbuf -oL $FSNOTIFYWAIT_PROG -m -F -S -e create "$mmnt" 2>&1
+}
+
+# Transform f_fsid into the hi.lo format used in fanotify FID logs
+fsid_to_fid_parts()
+{
+	local fsid=$1
+	# Pad to 16 hex chars (64-bit), then split into two 32-bit halves
+	local padded=$(printf '%016x' "0x${fsid}")
+	local hi=$(printf '%x' "0x${padded:0:8}")   # strips leading zeros
+	local lo=$(printf '%x' "0x${padded:8:8}")   # strips leading zeros
+	echo "${hi}.${lo}"
+}
+
+# Create base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both base and clone filesystems using required clone mount options
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+# Fetch filesystem IDs to verify the kernel can differentiate between them
+fsid1=$(stat -f -c "%i" $mnt1)
+fsid2=$(stat -f -c "%i" $mnt2)
+
+log1=$tmp.fanotify1
+log2=$tmp.fanotify2
+
+pid1=""
+pid2=""
+echo "Setup FID fanotify watchers on both mnt1 and mnt2"
+
+# Permit unconfined_t domains when SELinux is enforcing to prevent fanotify
+# blockages
+semanage_added="no"
+if [ "$(getenforce 2>/dev/null)" = "Enforcing" ]; then
+    if ! semanage permissive -l | grep -q "unconfined_t"; then
+        semanage permissive -a unconfined_t >/dev/null 2>&1 && semanage_added="yes"
+    fi
+fi
+
+# Start asynchronous fanotify monitors
+( monitor_fanotify "$mnt1" > "$log1" ) &
+pid1=$!
+( monitor_fanotify "$mnt2" > "$log2" ) &
+pid2=$!
+sleep 2
+
+echo "Trigger file creation on mnt1"
+touch $mnt1/file_on_mnt1
+sync
+sleep 1
+
+echo "Trigger file creation on mnt2"
+touch $mnt2/file_on_mnt2
+sync
+sleep 1
+
+echo "Verify fsid in the fanotify"
+kill $pid1 $pid2
+wait $pid1 $pid2 2>/dev/null
+pid1=""
+pid2=""
+
+e_fsid1=$(fsid_to_fid_parts "$fsid1")
+e_fsid2=$(fsid_to_fid_parts "$fsid2")
+
+# Dump debug details to the full log
+echo $fsid1 $e_fsid1 $fsid2 $e_fsid2 >> $seqres.full
+cat $log1 >> $seqres.full
+cat $log2 >> $seqres.full
+
+# Ensure monitor 1 only captured events belonging to mnt 1 and fsid 1
+if grep -qF "$e_fsid1" "$log1" && ! grep -qF "$e_fsid2" "$log1"; then
+	echo "SUCCESS: mnt1 events found"
+else
+	[ ! -s "$log1" ] && echo "  - mnt1 received no events."
+	grep -qF "$e_fsid2" "$log1" && echo "  - mnt1 received event from mnt2."
+fi
+
+# Ensure monitor 2 only captured events belonging to mnt 2 and fsid 2
+if grep -qF "$e_fsid2" "$log2" && ! grep -qF "$e_fsid1" "$log2"; then
+	echo "SUCCESS: mnt2 events found"
+else
+	[ ! -s "$log2" ] && echo "  - mnt2 received no events."
+	grep -qF "$e_fsid1" "$log2" && echo "  - mnt2 received event from mnt1."
+fi
+
+status=0
+exit
diff --git a/tests/generic/801.out b/tests/generic/801.out
new file mode 100644
index 000000000000..d7b318d9f27c
--- /dev/null
+++ b/tests/generic/801.out
@@ -0,0 +1,7 @@
+QA output created by 801
+Setup FID fanotify watchers on both mnt1 and mnt2
+Trigger file creation on mnt1
+Trigger file creation on mnt2
+Verify fsid in the fanotify
+SUCCESS: mnt1 events found
+SUCCESS: mnt2 events found
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 04/11] fstests: add _require_unique_f_fsid() helper
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Add a helper to check if the target filesystem supports unique f_fsid
tracking across cloned or snapshot instances.

Certain filesystems like XFS, Btrfs, and F2FS ensure unique f_fsid
identifiers per filesystem instance. However, Ext4 derives its f_fsid
directly from its superblock UUID, which leads to identical f_fsid
values on cloned images until the UUID is manually modified by userspace.

Introduce _require_unique_f_fsid() to allow test cases requiring strict
f_fsid uniqueness to skip gracefully on unsupported filesystems.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 common/rc | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/common/rc b/common/rc
index 968ba33686f3..d95eec94f7b7 100644
--- a/common/rc
+++ b/common/rc
@@ -6310,6 +6310,27 @@ _require_fanotify_ioerrors()
 	_notrun "$FSTYP does not support fanotify ioerrors"
 }
 
+# Ext4 derives f_fsid from the superblock UUID, meaning clones share the
+# same f_fsid until their UUIDs diverge. Conversely, XFS, Btrfs,
+# and F2FS ensure f_fsid remains unique per filesystem instance (often by
+# deriving it from the UUID and underlying block device.)
+#
+# Across all filesystems, a UUID collision causes libblkid tools to return
+# non-deterministic device mappings. It is ultimately the responsibility
+# of the userspace utility or use-case to enforce uniqueness when a clone
+# diverges. For details, see mailing list thread discussions:
+#   Link: https://lore.kernel.org/linux-ext4/20260409131238.GC18443@macsyma-wired.lan/
+_require_unique_f_fsid()
+{
+	# Skip the test if the filesystem does not enforce unique f_fsids
+	# natively. Checking this dynamically requires recreating a clone
+	# layout, so we use a static lookup based on FSTYP.
+	if [ "$FSTYP" == "ext4" ]; then
+		_notrun "Target filesystem ($FSTYP) does not guarantee unique f_fsid on clones."
+	fi
+}
+
+
 # Computes a percentage of the available space in a filesystem and
 # returns that quantity in MB. The percentage must not contain a percent
 # sign ("%").
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 03/11] fstests: add FSNOTIFYWAIT_PROG
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Define `FSNOTIFYWAIT_PROG` for an upcoming test case that uses `fsnotifywait`.

Signed-off-by: Anand Jain <asj@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/config | 1 +
 1 file changed, 1 insertion(+)

diff --git a/common/config b/common/config
index d5299d5b926f..5661fa0ec310 100644
--- a/common/config
+++ b/common/config
@@ -242,6 +242,7 @@ export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
 export PARTED_PROG="$(type -P parted)"
 export XFS_PROPERTY_PROG="$(type -P xfs_property)"
 export FSCRYPTCTL_PROG="$(type -P fscryptctl)"
+export FSNOTIFYWAIT_PROG="$(type -P fsnotifywait)"
 
 # udev wait functions.
 #
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 02/11] fstests: add _clone_mount_option() helper
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Adds _clone_mount_option() helper function to handle filesystem-specific
requirements for mounting cloned devices. Abstract the need for -o nouuid
on XFS.

Signed-off-by: Anand Jain <asj@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/rc | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/common/rc b/common/rc
index d7e3e0bdfb1e..968ba33686f3 100644
--- a/common/rc
+++ b/common/rc
@@ -414,6 +414,19 @@ _scratch_mount_options()
 					$SCRATCH_DEV $SCRATCH_MNT
 }
 
+# Return filesystem-specific mount options required for mounting clone/snapshot
+# devices.
+_clone_mount_option()
+{
+	case "$FSTYP" in
+	xfs)
+		# Allow mounting a duplicate filesystem on the same host
+		echo "-o nouuid"
+		;;
+	*)
+	esac
+}
+
 _supports_filetype()
 {
 	local dir=$1
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox