Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Baokun Li @ 2026-06-09  9:43 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, ojaswin, ritesh.list,
	Sashiko
In-Reply-To: <06543725-5f38-4cac-9fdd-8a72cfb28f84@huaweicloud.com>

On 2026/6/9 10:48, Zhang Yi wrote:
> Hi, Baokun!

Hi Yi,

Thank you for you review!

> On 6/8/2026 7:11 PM, Baokun Li wrote:
>> The mke2fs man page documents:
>>
>>   Valid cluster-size values are from 2048 to 256M bytes per cluster.
> Hmm, I checked the mke2fs(8)[1] and didn't find this sentence, instead,
> it said:
>
>   Valid cluster-size values range from 2 to 32768 times the filesystem
>   block size and must be a power of 2.

Oh, I was indeed looking at a slightly older man page. It was changed to
the current description in commit 87cbb381f2e2 ("e2fsprogs:
misc/mke2fs.8.in: Correct valid cluster-size values").

That commit assumes the number of blocks in a cluster cannot exceed the
maximum number of blocks that a single extent can hold
(EXT_INIT_MAX_LEN = 32768). I believe this is wrong — a single cluster
can be covered by multiple extents. So we should fix the bug in
ext_falloc_helper() rather than working around it by adding a
restriction to the documentation.

The root cause is in ext_falloc_helper():

    max_uninit_len = EXT_UNINIT_MAX_LEN & ~EXT2FS_CLUSTER_MASK(fs);
    max_init_len = EXT_INIT_MAX_LEN & ~EXT2FS_CLUSTER_MASK(fs);

When cluster_ratio >= EXT_INIT_MAX_LEN (32768), the `& ~CLUSTER_MASK`
operation zeroes out both values, causing ext2fs_new_range() to receive
len=0 and return EXT2_ET_INVALID_ARGUMENT.

The kernel handles this correctly — ext4_ext_map_blocks() simply
truncates m_len to EXT_INIT_MAX_LEN and lets the caller loop.  On
subsequent iterations, get_implied_cluster_alloc() detects that the
requested block falls within an already-allocated cluster by examining
adjacent extents, and returns the physical address without allocating a
new cluster.

I'll send a patch to fix ext_falloc_helper() so that when
cluster_ratio >= EXT_INIT_MAX_LEN, the allocation loop can create
multiple extents within the same cluster — skipping ext2fs_new_range()
and claim_range() for mid-cluster iterations where the physical blocks
are already claimed.

>
> However, the implementation of mkfs does not seem to respect this
> constraint and instead directly uses static macros to limit the
> user-specified cluster size.
>
>   #define EXT2_MIN_BLOCK_LOG_SIZE         10      /* 1024 */
>   #define EXT2_MIN_CLUSTER_LOG_SIZE       EXT2_MIN_BLOCK_LOG_SIZE
>   #define EXT2_MAX_CLUSTER_LOG_SIZE       29      /* 512MB  */

I was going to set EXT2_MAX_CLUSTER_LOG_SIZE to 28 to align with the
kernel's EXT4_MAX_CLUSTER_LOG_SIZE.

>
> This is confusing, or perhaps I missed something. If I understand
> correctly, users can now format an image with a maximum cluster size of
> 512 MB (I tried it, and it worked).

Yes, a 512M cluster size also works correctly on the current kernel.
The upper limit on cluster size here is to avoid the 32-bit overflow issue
that Sashiko mentioned — theoretically the maximum could be 512M.

It's just that the old mke2fs documentation originally specified 256M as
the intended maximum, so both the kernel and e2fsprogs target 256M as the
upper limit. If anyone prefers 512M, I'm happy to change it.

>  If the kernel's
> EXT4_MAX_CLUSTER_LOG_SIZE is limited to 28, this would cause such
> existing images to become unmountable, even though I don't think images
> with such a large cluster size actually exist in practice. Therefore,
> I'm not sure this is safe.
>
> [1] https://man7.org/linux/man-pages/man8/mke2fs.8.html
>

I personally don't think such images are likely to exist in practice,
since formatting with the default 4K block size would already fail.

Thanks,
Baokun

^ permalink raw reply

* [PATCH] ext4: fix circular lock dependency in ext4_ext_migrate
From: Yun Zhou @ 2026-06-09  8:40 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Move iput(tmp_inode) after ext4_writepages_up_write() to avoid a
circular lock dependency between s_writepages_rwsem and sb_internal
(freeze protection).

The deadlock scenario:

  CPU0 (EXT4_IOC_MIGRATE)        CPU1 (orphan cleanup during mount)
  ----                           ----
  ext4_ext_migrate()
    ext4_writepages_down_write()
      s_writepages_rwsem (write)
                                 ext4_evict_inode()
                                   sb_start_intwrite()   [sb_internal]
                                   ...
                                     ext4_writepages()
                                       s_writepages_rwsem (read) [BLOCKED]
    iput(tmp_inode)
      ext4_evict_inode()
        sb_start_intwrite()         [BLOCKED]

The tmp_inode is a temporary inode with nlink=0 created solely for
building the extent tree.  Its eviction does not require
s_writepages_rwsem protection, so deferring iput() until after
releasing the rwsem is safe.

Reported-by: syzbot+f0b58a1f5075a90dd9a5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f0b58a1f5075a90dd9a5
Fixes: cb85f4d23f79 ("ext4: fix race between writepages and enabling EXT4_EXTENTS_FL")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/migrate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 477d43d7e294..25368eb44e85 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -464,6 +464,7 @@ int ext4_ext_migrate(struct inode *inode)
 	if (IS_ERR(tmp_inode)) {
 		retval = PTR_ERR(tmp_inode);
 		ext4_journal_stop(handle);
+		tmp_inode = NULL;
 		goto out_unlock;
 	}
 	/*
@@ -591,9 +592,10 @@ int ext4_ext_migrate(struct inode *inode)
 	ext4_journal_stop(handle);
 out_tmp_inode:
 	unlock_new_inode(tmp_inode);
-	iput(tmp_inode);
 out_unlock:
 	ext4_writepages_up_write(inode->i_sb, alloc_ctx);
+	if (tmp_inode)
+		iput(tmp_inode);
 	return retval;
 }
 
-- 
2.43.0


^ permalink raw reply related

* [syzbot] [ext4?] kernel BUG in ext4_write_inline_data (5)
From: syzbot @ 2026-06-09  8:13 UTC (permalink / raw)
  To: adilger.kernel, jack, libaokun, linux-ext4, linux-kernel, ojaswin,
	ritesh.list, syzkaller-bugs, tytso, yi.zhang

Hello,

syzbot found the following issue on:

HEAD commit:    9154c4af7829 Merge tag 'mmc-v7.1-rc3' of git://git.kernel...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=161d8f2e580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=bd38685893011045
dashboard link: https://syzkaller.appspot.com/bug?extid=53c8bd6136cfbe85197e
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/d4d939a8f4d3/disk-9154c4af.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/6fd4009d5e57/vmlinux-9154c4af.xz
kernel image: https://storage.googleapis.com/syzbot-assets/d924d24cd0a4/bzImage-9154c4af.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+53c8bd6136cfbe85197e@syzkaller.appspotmail.com

------------[ cut here ]------------
kernel BUG at fs/ext4/inline.c:240!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6427 Comm: syz.2.110 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:ext4_write_inline_data+0x43c/0x440 fs/ext4/inline.c:240
Code: c1 38 c1 0f 8c 19 ff ff ff 48 89 df 49 89 d7 e8 7a 1f b0 ff 4c 89 fa e9 06 ff ff ff e8 3d f2 48 ff 90 0f 0b e8 35 f2 48 ff 90 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f
RSP: 0000:ffffc9000d34f808 EFLAGS: 00010293
RAX: ffffffff827b79eb RBX: 0000000000002cf3 RCX: ffff88803ec71ec0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88804391b8fa R08: 0000000000000000 R09: 0000000000000000
R10: dffffc0000000000 R11: ffffed100bc12881 R12: 000000000000003c
R13: ffffc9000d34f8e0 R14: 0000000000002cf2 R15: ffff88804391b1e8
FS:  00007f7639c7d6c0(0000) GS:ffff888126283000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000000813 CR3: 000000003cc4a000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 ext4_write_inline_data_end+0x332/0xaf0 fs/ext4/inline.c:825
 generic_perform_write+0x5f7/0x8b0 mm/filemap.c:4346
 ext4_buffered_write_iter+0xd0/0x3a0 fs/ext4/file.c:316
 ext4_file_write_iter+0x299/0x1c10 fs/ext4/file.c:-1
 new_sync_write fs/read_write.c:595 [inline]
 vfs_write+0x629/0xba0 fs/read_write.c:688
 ksys_pwrite64 fs/read_write.c:795 [inline]
 __do_sys_pwrite64 fs/read_write.c:803 [inline]
 __se_sys_pwrite64 fs/read_write.c:800 [inline]
 __x64_sys_pwrite64+0x19c/0x230 fs/read_write.c:800
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f763ba4ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f7639c7d028 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
RAX: ffffffffffffffda RBX: 00007f763bcc6090 RCX: 00007f763ba4ce59
RDX: 0000000000000001 RSI: 0000200000000140 RDI: 0000000000000004
RBP: 00007f763bae2d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000002cf2 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f763bcc6128 R14: 00007f763bcc6090 R15: 00007ffe7dfe22b8
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:ext4_write_inline_data+0x43c/0x440 fs/ext4/inline.c:240
Code: c1 38 c1 0f 8c 19 ff ff ff 48 89 df 49 89 d7 e8 7a 1f b0 ff 4c 89 fa e9 06 ff ff ff e8 3d f2 48 ff 90 0f 0b e8 35 f2 48 ff 90 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f
RSP: 0000:ffffc9000d34f808 EFLAGS: 00010293
RAX: ffffffff827b79eb RBX: 0000000000002cf3 RCX: ffff88803ec71ec0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88804391b8fa R08: 0000000000000000 R09: 0000000000000000
R10: dffffc0000000000 R11: ffffed100bc12881 R12: 000000000000003c
R13: ffffc9000d34f8e0 R14: 0000000000002cf2 R15: ffff88804391b1e8
FS:  00007f7639c7d6c0(0000) GS:ffff888126283000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000000813 CR3: 000000003cc4a000 CR4: 00000000003526f0


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [PATCH v4] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-09  6:20 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Jan Kara, Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	sashiko-reviews, linux-ext4, linux-kernel,
	Aditya Prakash Srivastava, syzbot+0c89d865531d053abb2d

When the data=journal mount option is used, the ext4_journalled_write_end()
function incorrectly calls ext4_write_inline_data_end() without checking
if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.

If a previous attempt to convert the inline data to an extent failed (e.g.
due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
call to ext4_write_begin() will not prepare the inline data xattr for
writing, but ext4_journalled_write_end() will incorrectly attempt to write
to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
ext4_write_inline_data() since i_inline_size was not expanded.

Additionally, two separate TOCTOU race conditions exist due to concurrent
ext4_page_mkwrite() execution:
1) A concurrent ext4_page_mkwrite() can execute ext4_convert_inline_data()
between write_begin and write_end, clearing the inline flags. Since block
buffers were not allocated in write_begin, this results in a NULL pointer
dereference in the write_end fallback paths because folio_buffers(folio) is
NULL.
2) If ext4_convert_inline_data() clears the flags exactly after the inline
flags checks pass in write_end, but before ext4_write_inline_data_end()
acquires the xattr semaphore, the subsequent check will hit a panic via
BUG_ON(!ext4_has_inline_data(inode)).

Fix these issues completely by:
1) Having write_end functions (ext4_write_end(),
ext4_journalled_write_end(), and ext4_da_do_write_end()) return 0
(VFS retry) if they fall through to the block fallback path and detect
that folio_buffers(folio) is NULL, after safely stopping any active
journal handle (protecting against a NULL handle panic in
ext4_put_nojournal()).
2) Replacing BUG_ON(!ext4_has_inline_data(inode)) inside
ext4_write_inline_data_end() with a graceful error path. If the inline flag
is cleared after locking the xattr, we unlock the xattr, release the iloc,
unlock/put the folio, stop the journal, and return 0 to trigger a retry.

Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
v4:
  - Address critical TOCTOU race condition (reported by Sashiko AI review):
    * Scenario: A buffered write holds the folio lock and evaluates the inline
      flags checks in write_end to true. Before it enters or locks the xattr_sem
      in ext4_write_inline_data_end(), a concurrent memory-mapped page fault
      (ext4_page_mkwrite()) converts the inline data to an extent. This page fault
      bypasses the folio lock (since ext4_convert_inline_data() runs lockless),
      acquires the xattr_sem, and clears the inline flags. When the buffered write
      resumes and enters ext4_write_inline_data_end(), it acquires the xattr_sem
      and immediately triggers BUG_ON(!ext4_has_inline_data(inode)) causing a
      kernel panic.
    * Fix: Replace the BUG_ON() with a graceful error-handling retry path that
      releases all resources (locks/buffers/folios/journals) and returns 0.
v3:
  - Fix journal handle leak and NULL handle crash (reported by Sashiko AI review):
    * Scenario 1 (leak): During a delayed allocation write (ext4_da_write_begin),
      inline data was prepared and a transaction handle started. If a concurrent
      page fault converts the inline data before write_end, ext4_da_write_end()
      falls through to ext4_da_do_write_end(). If the fallback check for
      !folio_buffers(folio) returns 0 to retry without calling ext4_journal_stop(),
      the transaction handle is leaked open-ended, eventually hanging the filesystem.
    * Scenario 2 (crash): If we blindly call ext4_journal_stop() on a NULL handle
      (e.g., when no transaction was started because we never took the inline path),
      __ext4_journal_stop() delegates to ext4_put_nojournal(NULL) which triggers
      BUG_ON(ref_cnt == 0), panicking the kernel.
    * Fix: Retrieve the active handle in ext4_da_do_write_end() and stop it
      if non-NULL. Also explicitly check "if (handle)" before calling
      ext4_journal_stop() in ext4_write_end() and ext4_journalled_write_end().
v2:
  - Address TOCTOU race condition (reported by Sashiko AI review):
    * Scenario: A concurrent ext4_page_mkwrite() converts inline data to extents
      and clears the flags between ext4_write_begin() and write_end(). The
      write_end function falls through to the block fallback path. Since block
      buffers were not allocated in write_begin (because it took the inline path),
      folio_buffers(folio) is NULL, causing a NULL pointer dereference in
      ext4_journalled_zero_new_buffers() or ext4_walk_page_buffers(), or silent
      data loss in the standard write path.
    * Fix: Have the write_end functions return 0 if folio_buffers(folio) is NULL,
      triggering a safe VFS-level retry. On the next write attempt, the inline
      flags will be detected as cleared, and blocks/buffers will be properly allocated.
 fs/ext4/inline.c |  9 ++++++++-
 fs/ext4/inode.c  | 24 ++++++++++++++++++++++--
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..161136e84661 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -812,7 +812,14 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
 			goto out;
 		}
 		ext4_write_lock_xattr(inode, &no_expand);
-		BUG_ON(!ext4_has_inline_data(inode));
+		if (unlikely(!ext4_has_inline_data(inode))) {
+			ext4_write_unlock_xattr(inode, &no_expand);
+			brelse(iloc.bh);
+			folio_unlock(folio);
+			folio_put(folio);
+			ext4_journal_stop(handle);
+			return 0;
+		}
 
 		/*
 		 * ei->i_inline_off may have changed since
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..bc2688e03c19 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1455,6 +1455,14 @@ static int ext4_write_end(const struct kiocb *iocb,
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);
 
+	if (unlikely(!folio_buffers(folio))) {
+		folio_unlock(folio);
+		folio_put(folio);
+		if (handle)
+			ext4_journal_stop(handle);
+		return 0;
+	}
+
 	copied = block_write_end(pos, len, copied, folio);
 	/*
 	 * it's important to update i_size while still holding folio lock:
@@ -1560,10 +1568,19 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,
 
 	BUG_ON(!ext4_handle_valid(handle));
 
-	if (ext4_has_inline_data(inode))
+	if (ext4_has_inline_data(inode) &&
+	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);
 
+	if (unlikely(!folio_buffers(folio))) {
+		folio_unlock(folio);
+		folio_put(folio);
+		if (handle)
+			ext4_journal_stop(handle);
+		return 0;
+	}
+
 	if (unlikely(copied < len) && !folio_test_uptodate(folio)) {
 		copied = 0;
 		ext4_journalled_zero_new_buffers(handle, inode, folio,
@@ -3231,7 +3248,10 @@ static int ext4_da_do_write_end(struct address_space *mapping,
 	if (unlikely(!folio_buffers(folio))) {
 		folio_unlock(folio);
 		folio_put(folio);
-		return -EIO;
+		handle = ext4_journal_current_handle();
+		if (handle)
+			ext4_journal_stop(handle);
+		return 0;
 	}
 	/*
 	 * block_write_end() will mark the inode as dirty with I_DIRTY_PAGES
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-09  4:59 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Jan Kara, Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	sashiko-reviews, linux-ext4, linux-kernel,
	Aditya Prakash Srivastava, syzbot+0c89d865531d053abb2d

When the data=journal mount option is used, the ext4_journalled_write_end()
function incorrectly calls ext4_write_inline_data_end() without checking
if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.

If a previous attempt to convert the inline data to an extent failed (e.g.
due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
call to ext4_write_begin() will not prepare the inline data xattr for
writing, but ext4_journalled_write_end() will incorrectly attempt to write
to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
ext4_write_inline_data() since i_inline_size was not expanded.

Additionally, a concurrent ext4_page_mkwrite() can execute
ext4_convert_inline_data() between ext4_write_begin() and ext4_write_end()
since it only takes xattr_sem. This concurrent path converts the inline
data to an extent and clears both the EXT4_INODE_INLINE_DATA and
EXT4_STATE_MAY_INLINE_DATA flags. Since the block buffers were not
allocated in ext4_write_begin(), falling through to the block fallback path
causes a NULL pointer dereference because folio_buffers(folio) is NULL.

Fix this by ensuring that the write_end functions (including
ext4_write_end() and ext4_da_do_write_end()) safely return 0 to trigger
a VFS retry if they detect folio_buffers(folio) is NULL after falling
through from the inline data check. Returning 0 signals
generic_perform_write() to revert the write iterator and retry the
entire write operation, which will then properly allocate block buffers.

To prevent transaction handle leaks and subsequent filesystem hangs during
retries, we stop any active journal handle via ext4_journal_stop() before
returning 0. We also explicitly check that the retrieved handle is non-NULL
prior to stopping it, preventing a potential NULL pointer crash inside
ext4_put_nojournal().

Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
v3:
  - Fix journal handle leak in ext4_da_do_write_end() by stopping the active
    handle before returning 0 on the fallback path.
  - Safely check if the retrieved handle is non-NULL before calling
    ext4_journal_stop() in ext4_write_end() and ext4_journalled_write_end()
    to prevent a kernel crash inside ext4_put_nojournal().
v2:
  - Address TOCTOU race condition (reported by Sashiko AI review):
    A concurrent ext4_page_mkwrite() can execute ext4_convert_inline_data()
    which takes xattr_sem and converts the inline data to extents, clearing
    both EXT4_INODE_INLINE_DATA and EXT4_STATE_MAY_INLINE_DATA flags between
    ext4_write_begin() and write_end().
  - Safely return 0 from write_end functions if folio_buffers(folio) is NULL
    to trigger a VFS retry. This prevents a NULL pointer dereference in
    ext4_journalled_zero_new_buffers() and ext4_walk_page_buffers().
  - Add check for ext4_write_end() and ext4_da_do_write_end().
 fs/ext4/inode.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..bc2688e03c19 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1455,6 +1455,14 @@ static int ext4_write_end(const struct kiocb *iocb,
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);

+	if (unlikely(!folio_buffers(folio))) {
+		folio_unlock(folio);
+		folio_put(folio);
+		if (handle)
+			ext4_journal_stop(handle);
+		return 0;
+	}
+
 	copied = block_write_end(pos, len, copied, folio);
 	/*
 	 * it's important to update i_size while still holding folio lock:
@@ -1560,10 +1568,19 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,

 	BUG_ON(!ext4_handle_valid(handle));

-	if (ext4_has_inline_data(inode))
+	if (ext4_has_inline_data(inode) &&
+	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);

+	if (unlikely(!folio_buffers(folio))) {
+		folio_unlock(folio);
+		folio_put(folio);
+		if (handle)
+			ext4_journal_stop(handle);
+		return 0;
+	}
+
 	if (unlikely(copied < len) && !folio_test_uptodate(folio)) {
 		copied = 0;
 		ext4_journalled_zero_new_buffers(handle, inode, folio,
@@ -3231,7 +3248,10 @@ static int ext4_da_do_write_end(struct address_space *mapping,
 	if (unlikely(!folio_buffers(folio))) {
 		folio_unlock(folio);
 		folio_put(folio);
-		return -EIO;
+		handle = ext4_journal_current_handle();
+		if (handle)
+			ext4_journal_stop(handle);
+		return 0;
 	}
 	/*
 	 * block_write_end() will mark the inode as dirty with I_DIRTY_PAGES
-- 
2.47.3

^ permalink raw reply related

* [PATCH v2] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-09  4:17 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Jan Kara, Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	sashiko-reviews, linux-ext4, linux-kernel,
	Aditya Prakash Srivastava, syzbot+0c89d865531d053abb2d

When the data=journal mount option is used, the ext4_journalled_write_end()
function incorrectly calls ext4_write_inline_data_end() without checking
if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.

If a previous attempt to convert the inline data to an extent failed (e.g.
due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
call to ext4_write_begin() will not prepare the inline data xattr for
writing, but ext4_journalled_write_end() will incorrectly attempt to write
to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
ext4_write_inline_data() since i_inline_size was not expanded.

Additionally, a concurrent ext4_page_mkwrite() can execute
ext4_convert_inline_data() between ext4_write_begin() and ext4_write_end()
since it only takes xattr_sem. This concurrent path converts the inline
data to an extent and clears both the EXT4_INODE_INLINE_DATA and
EXT4_STATE_MAY_INLINE_DATA flags. Since the block buffers were not
allocated in ext4_write_begin(), falling through to the block fallback path
causes a NULL pointer dereference because folio_buffers(folio) is NULL.

Fix this by ensuring that the write_end functions (including
ext4_write_end() and ext4_da_do_write_end()) safely return 0 to trigger
a VFS retry if they detect folio_buffers(folio) is NULL after falling
through from the inline data check. Returning 0 signals
generic_perform_write() to revert the write iterator and retry the
entire write operation, which will then properly allocate block buffers.

Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
v2:
  - Address TOCTOU race condition (reported by Sashiko AI review):
    A concurrent ext4_page_mkwrite() can execute ext4_convert_inline_data()
    which takes xattr_sem and converts the inline data to extents, clearing
    both EXT4_INODE_INLINE_DATA and EXT4_STATE_MAY_INLINE_DATA flags between
    ext4_write_begin() and write_end().
  - Safely return 0 from write_end functions if folio_buffers(folio) is NULL
    to trigger a VFS retry. This prevents a NULL pointer dereference in
    ext4_journalled_zero_new_buffers() and ext4_walk_page_buffers().
  - Add check for ext4_write_end() and ext4_da_do_write_end().
 fs/ext4/inode.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..d6c934b2ae31 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1455,6 +1455,13 @@ static int ext4_write_end(const struct kiocb *iocb,
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);

+	if (unlikely(!folio_buffers(folio))) {
+		folio_unlock(folio);
+		folio_put(folio);
+		ext4_journal_stop(handle);
+		return 0;
+	}
+
 	copied = block_write_end(pos, len, copied, folio);
 	/*
 	 * it's important to update i_size while still holding folio lock:
@@ -1560,10 +1567,18 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,

 	BUG_ON(!ext4_handle_valid(handle));

-	if (ext4_has_inline_data(inode))
+	if (ext4_has_inline_data(inode) &&
+	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);

+	if (unlikely(!folio_buffers(folio))) {
+		folio_unlock(folio);
+		folio_put(folio);
+		ext4_journal_stop(handle);
+		return 0;
+	}
+
 	if (unlikely(copied < len) && !folio_test_uptodate(folio)) {
 		copied = 0;
 		ext4_journalled_zero_new_buffers(handle, inode, folio,
@@ -3231,7 +3246,7 @@ static int ext4_da_do_write_end(struct address_space *mapping,
 	if (unlikely(!folio_buffers(folio))) {
 		folio_unlock(folio);
 		folio_put(folio);
-		return -EIO;
+		return 0;
 	}
 	/*
 	 * block_write_end() will mark the inode as dirty with I_DIRTY_PAGES
-- 
2.47.3

^ permalink raw reply related

* Re: [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
From: Jia Zhu @ 2026-06-09  3:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <slupx7cfhq4uuk5bd3xnzu7ybnnqghzrseg3xjao6x7tbj75v3@jxv32j4em2cu>

On Mon, Jun 08, 2026 at 04:29:59PM +0200, Jan Kara wrote:
> You simplify this condition to:
> 
>   block_start < to || (!folio_uptodate && bh != head)
> 
> With this updated feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks Jan.  I've applied this simplification in v3 and added your
Reviewed-by tag to both patches.

^ permalink raw reply

* [PATCH v3 2/2] ext4: avoid tail write_begin walk for uptodate folios
From: Jia Zhu @ 2026-06-09  3:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu, stable
In-Reply-To: <20260609035202.90669-1-zhujia.zj@bytedance.com>

Ext4 buffered writes into large folios also pay a full buffer_head
walk in ext4_block_write_begin().  For a small overwrite of an existing
cached folio, the folio is already uptodate and the write only needs to
prepare the buffers through the written range.  Walking the suffix still
makes the write_begin cost proportional to the folio size.

Before ext4 enabled large folios for regular files, the same loop was
bounded by a single page of buffers.  That commit made the existing
full-folio walk visible as a regression for cached small overwrites.

The suffix walk is needed for non-uptodate folios, where ext4 may have
to submit reads for partial blocks, preserve new-buffer cleanup, and run
error zeroing.  Keep those folios on the old full walk.

For already-uptodate folios, keep the walk starting at the first buffer
rather than seeking directly to from.  This preserves the existing prefix
buffer state handling.  Stop once block_start reaches the end of the
write range, because the skipped suffix would only repeat the
outside-range uptodate handling for buffers beyond @to.

On current master, the libMicro ext4 large-folio overwrite test shows
the following full-series result.  Results are median usecs/call over 10
runs, lower is better:

case        nofix     this series   improvement
write_u1k   1.418     0.3405        76.0%
write_u10k  1.887     0.4175        77.9%
pwrite_u1k  1.6775    0.3390        79.8%
pwrite_u10k 1.9035    0.4130        78.3%

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Cc: stable@vger.kernel.org # v6.16+
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/ext4/inode.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d1..0fccb8f6a2116 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1182,6 +1182,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	int nr_wait = 0;
 	int i;
 	bool should_journal_data = ext4_should_journal_data(inode);
+	bool folio_uptodate = folio_test_uptodate(folio);

 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(to > folio_size(folio));
@@ -1193,13 +1194,13 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 		head = create_empty_buffers(folio, blocksize, 0);
 	block = EXT4_PG_TO_LBLK(inode, folio->index);

-	for (bh = head, block_start = 0; bh != head || !block_start;
+	for (bh = head, block_start = 0;
+	     block_start < to || (!folio_uptodate && bh != head);
 	    block++, block_start = block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
 		if (block_end <= from || block_start >= to) {
-			if (folio_test_uptodate(folio)) {
+			if (folio_uptodate)
 				set_buffer_uptodate(bh);
-			}
 			continue;
 		}
 		if (WARN_ON_ONCE(buffer_new(bh)))
@@ -1220,7 +1221,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				if (should_journal_data)
 					do_journal_get_write_access(handle,
 								    inode, bh);
-				if (folio_test_uptodate(folio)) {
+				if (folio_uptodate) {
 					/*
 					 * Unlike __block_write_begin() we leave
 					 * dirtying of new uptodate buffers to
@@ -1237,7 +1238,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				continue;
 			}
 		}
-		if (folio_test_uptodate(folio)) {
+		if (folio_uptodate) {
 			set_buffer_uptodate(bh);
 			continue;
 		}
-- 
2.20.1

^ permalink raw reply related

* [PATCH v3 1/2] fs/buffer: avoid tail commit walk for uptodate folios
From: Jia Zhu @ 2026-06-09  3:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu, stable
In-Reply-To: <20260609035202.90669-1-zhujia.zj@bytedance.com>

block_commit_write() always walks every buffer_head attached to the
folio.  That was cheap for order-0 folios, but large folios can contain
hundreds of buffer_heads.  For a small buffered overwrite of an
already-uptodate large folio, the commit work is therefore proportional
to the folio size rather than the copied range.

This became visible with ext4 regular-file large folios, where cached
small overwrites reach block_commit_write() through block_write_end().
Before ext4 enabled large folios for regular files, this path was only
hit with order-0 folios for normal ext4 buffered writes, so the full walk
was bounded.  The ext4 large-folio commit is therefore the regression
point for this generic helper cost.

The full walk is still needed when the folio is not uptodate, because
block_commit_write() uses per-buffer uptodate state to decide whether
the whole folio can be marked uptodate.  Keep those folios on the old
full-buffer path.

For a folio that was already uptodate on entry, the commit no longer
needs tail buffers for folio-uptodate discovery.  The copied range has
already been processed once block_start reaches @to, so stop there and
avoid the suffix walk.

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org # v6.16+
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496e..c8c41c799030d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 {
 	size_t block_start, block_end;
 	bool partial = false;
+	bool uptodate = folio_test_uptodate(folio);
 	unsigned blocksize;
 	struct buffer_head *bh, *head;

@@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 			clear_buffer_new(bh);

 		block_start = block_end;
+		if (uptodate && block_start >= to)
+			break;
 		bh = bh->b_this_page;
 	} while (bh != head);

-- 
2.20.1

^ permalink raw reply related

* [PATCH v3 0/2] ext4: avoid tail walks for cached large-folio writes
From: Jia Zhu @ 2026-06-09  3:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

Hi,

This series addresses a buffered-write regression we found during our
v6.12 -> v6.18 LTS upgrade testing on ext4.

The regression is in the remaining buffer_head path.  A small overwrite
of an already cached, uptodate large folio still walks every buffer_head
attached to the folio in both write_begin and write_end.  With order-0
folios this was bounded by the page size.  After ext4 enabled large
folios for regular files, the same loops became proportional to the
folio size.

I agree that converting ext4 buffered I/O to iomap is the right long-term
direction, and that would avoid this problem.  This series is meant as a
small fix for current and LTS kernels that still use the buffer_head path.

Patch 1 follows Willy's suggestion for block_commit_write(): if the folio
was already uptodate on entry, stop the commit walk once the copied range
has been processed.

Patch 2 applies the same conservative shape to ext4_block_write_begin().
It keeps walking from the first buffer, so prefix buffer state handling is
unchanged, and only skips the suffix for folios that were already
uptodate on entry.

The workload is from libMicro, which we use in kernel release testing:

  https://github.com/rzezeski/libMicro

The table below includes the v6.12 baseline from the same release
benchmark.  The v6.12 and v6.18 columns were run with THP=always.  The
last column is v6.18 with this series applied.  Results are usecs/call,
lower is better, and the improvement is relative to unpatched v6.18.

case           v6.12    v6.18   v6.18 + series   improvement
write_u1k      0.609    4.659       0.528           88.7%
write_u10k     1.408    4.869       0.809           83.4%
pwrite_u1k     0.609    4.659       0.538           88.5%
pwrite_u10k    1.399    4.889       0.819           83.2%
writev_u1k     2.238    5.277       1.179           77.7%
writev_u10k   11.057    8.029       4.219           47.5%

For the cases that regressed from v6.12 to v6.18 in this test, this
series brings the v6.18 numbers back below the v6.12 cost.

Previous versions:
v2:
  https://lore.kernel.org/r/20260608120131.45146-1-zhujia.zj@bytedance.com
v1:
  https://lore.kernel.org/r/20260603134800.25155-1-zhujia.zj@bytedance.com

Changes since v2:
- simplify the ext4 loop condition as suggested by Jan;
- add Reviewed-by tags from Jan;
- add stable Cc tags.

Changes since v1:
- replace the ext4 seek-to-@from optimization with a conservative tail
  break that preserves prefix buffer handling;
- add the block_commit_write() tail break suggested by Willy;
- add v6.12 and v6.18 benchmark results for the full series.

Jia Zhu (2):
  fs/buffer: avoid tail commit walk for uptodate folios
  ext4: avoid tail write_begin walk for uptodate folios

 fs/buffer.c     |  3 +++
 fs/ext4/inode.c | 11 ++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

-- 
2.20.1

^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Zhang Yi @ 2026-06-09  3:33 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-4-libaokun@linux.alibaba.com>

On 6/8/2026 7:11 PM, Baokun Li wrote:
> If s_inodes_per_group is not a multiple of s_inodes_per_block, the
> division that computes s_itb_per_group truncates, reserving fewer blocks
> for the inode table than needed.
> 
> On a crafted filesystem image, this allows __ext4_get_inode_loc() to
> compute a block offset beyond the inode table, reading unrelated data as
> an inode structure.
> 
> Add the missing divisibility check alongside the existing validation in
> ext4_block_group_meta_init().
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good to me.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>

> ---
>  fs/ext4/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3ddcb4a8d4db..5ec9e1ef00c0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5306,7 +5306,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
>  	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> -	    sbi->s_inodes_per_group & 7) {
> +	    sbi->s_inodes_per_group & 7 ||
> +	    sbi->s_inodes_per_group % sbi->s_inodes_per_block) {
>  		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;


^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Zhou, Yun @ 2026-06-09  3:12 UTC (permalink / raw)
  To: Hillf Danton, Andreas Dilger
  Cc: tytso, jack, Amir Goldstein, linux-ext4, linux-kernel
In-Reply-To: <20260609011522.1708-1-hdanton@sina.com>



On 6/9/26 09:15, Hillf Danton wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Mon, 8 Jun 2026 12:20:07 -0600 Andreas Dilger wrote:
>> On Jun 8, 2026, at 09:25, Yun Zhou <yun.zhou@windriver.com> wrote:
>>> Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
>>> belong to the same superblock as the original file.  Currently, this
>>> validation is performed inside ext4_move_extents() by
>>> mext_check_validity(), but only after lock_two_nondirectories() has
>>> already acquired the inode locks.  When the donor fd refers to a file
>>> on a different filesystem (e.g., overlayfs), this late validation
>>> creates a circular lock dependency:
>>>
>>>   CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
>>>   ----                              ----
>>>   inode_lock(ovl_inode)
>>>                                     mnt_want_write_file(filp)
>>>                                       sb_start_write(ext4_sb)   [sb_writers]
>>>     backing_file_write_iter()
>>>       vfs_iter_write(real_file)
>>>         file_start_write(real_file)
>>>           sb_start_write(ext4_sb)   [blocked by freeze]
>>>                                     lock_two_nondirectories()
>>>                                       inode_lock(ovl_inode)     [blocked]
> Why does it exist if the locking order on CPU0 is incorrect?
The locking order on CPU0 (overlayfs write path: inode_lock → sb_writers)
is correct and inherent to stacked filesystems.The actual bug is that CPU1's
execution path should never exist — ext4 ioctl should not be locking an
overlayfs inode in the first place.This happens because the donor fd passed
from userspace is not validated before being used in 
lock_two_nondirectories().

^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Zhang Yi @ 2026-06-09  2:48 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-3-libaokun@linux.alibaba.com>

Hi, Baokun!

On 6/8/2026 7:11 PM, Baokun Li wrote:
> The mke2fs man page documents:
> 
>   Valid cluster-size values are from 2048 to 256M bytes per cluster.

Hmm, I checked the mke2fs(8)[1] and didn't find this sentence, instead,
it said:

  Valid cluster-size values range from 2 to 32768 times the filesystem
  block size and must be a power of 2.

However, the implementation of mkfs does not seem to respect this
constraint and instead directly uses static macros to limit the
user-specified cluster size.

  #define EXT2_MIN_BLOCK_LOG_SIZE         10      /* 1024 */
  #define EXT2_MIN_CLUSTER_LOG_SIZE       EXT2_MIN_BLOCK_LOG_SIZE
  #define EXT2_MAX_CLUSTER_LOG_SIZE       29      /* 512MB  */

This is confusing, or perhaps I missed something. If I understand
correctly, users can now format an image with a maximum cluster size of
512 MB (I tried it, and it worked). If the kernel's
EXT4_MAX_CLUSTER_LOG_SIZE is limited to 28, this would cause such
existing images to become unmountable, even though I don't think images
with such a large cluster size actually exist in practice. Therefore,
I'm not sure this is safe.

[1] https://man7.org/linux/man-pages/man8/mke2fs.8.html

> 
> but EXT4_MAX_CLUSTER_LOG_SIZE was set to 30 (1GB), allowing crafted
> filesystem images to specify cluster sizes up to 1GB.
> 
> On 32-bit systems with bigalloc enabled, the consistency check in
> ext4_handle_clustersize():
> 
>   s_blocks_per_group == s_clusters_per_group * (clustersize / blocksize)
> 
> can overflow when the cluster ratio is large enough. Since
> s_blocks_per_group is not range-checked in the bigalloc path, the
> wrapped product can pass the consistency check, leading to inconsistent
> group geometry and potential out-of-bounds block allocation.
> 
> Reduce EXT4_MAX_CLUSTER_LOG_SIZE to 28 to match the documented 256MB
> limit. With this cap, the maximum product is:
> 
>   (blocksize * 8) * (256M / blocksize) = 2^31
> 
> which fits safely in a 32-bit unsigned long for all block sizes.
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
> ---
>  fs/ext4/ext4.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..11e41a864db8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -334,7 +334,7 @@ struct ext4_io_submit {
>  #define	EXT4_MAX_BLOCK_SIZE		65536
>  #define EXT4_MIN_BLOCK_LOG_SIZE		10
>  #define EXT4_MAX_BLOCK_LOG_SIZE		16
> -#define EXT4_MAX_CLUSTER_LOG_SIZE	30
> +#define EXT4_MAX_CLUSTER_LOG_SIZE	28
>  #ifdef __KERNEL__
>  # define EXT4_BLOCK_SIZE(s)		((s)->s_blocksize)
>  #else


^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Baokun Li @ 2026-06-09  1:46 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-ext4, tytso, jack, yi.zhang, ojaswin, ritesh.list, Sashiko
In-Reply-To: <097CFB85-3C4A-4BDF-829C-E6B2322386EC@dilger.ca>

On 2026/6/9 02:27, Andreas Dilger wrote:
> On Jun 8, 2026, at 05:11, Baokun Li <libaokun@linux.alibaba.com> wrote:
>> If s_inodes_per_group is not a multiple of s_inodes_per_block, the
>> division that computes s_itb_per_group truncates, reserving fewer blocks
>> for the inode table than needed.
>>
>> On a crafted filesystem image, this allows __ext4_get_inode_loc() to
>> compute a block offset beyond the inode table, reading unrelated data as
>> an inode structure.
>>
>> Add the missing divisibility check alongside the existing validation in
>> ext4_block_group_meta_init().
>>
>> Reported-by: Sashiko <sashiko-bot@kernel.org>
>> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
>> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Hi Andreas,

Thank you for your review!

> Looks good.  Is this also fixed/checked in e2fsprogs?  
Yes, a similar patchset adding analogous checks to e2fsprogs is also in
the works.
> There is really no reason *not* to use all the space in the last itable block for inodes.
>
> Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

Indeed. Images created by mke2fs do not exhibit this issue; it only
occurs with certain crafted, malformed images. Thanks, Baokun


^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Hillf Danton @ 2026-06-09  1:15 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: tytso, jack, Yun Zhou, Amir Goldstein, linux-ext4, linux-kernel
In-Reply-To: <5D214F11-047A-4414-B9BC-8ED669817145@dilger.ca>

On Mon, 8 Jun 2026 12:20:07 -0600 Andreas Dilger wrote:
> On Jun 8, 2026, at 09:25, Yun Zhou <yun.zhou@windriver.com> wrote:
> > 
> > Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
> > belong to the same superblock as the original file.  Currently, this
> > validation is performed inside ext4_move_extents() by
> > mext_check_validity(), but only after lock_two_nondirectories() has
> > already acquired the inode locks.  When the donor fd refers to a file
> > on a different filesystem (e.g., overlayfs), this late validation
> > creates a circular lock dependency:
> > 
> >  CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
> >  ----                              ----
> >  inode_lock(ovl_inode)
> >                                    mnt_want_write_file(filp)
> >                                      sb_start_write(ext4_sb)   [sb_writers]
> >    backing_file_write_iter()
> >      vfs_iter_write(real_file)
> >        file_start_write(real_file)
> >          sb_start_write(ext4_sb)   [blocked by freeze]
> >                                    lock_two_nondirectories()
> >                                      inode_lock(ovl_inode)     [blocked]
>
Why does it exist if the locking order on CPU0 is incorrect?

^ permalink raw reply

* [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-09  1:07 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi, Xiang Mei

ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
inline buffer for a crafted or corrupted inline directory, triggering a
slab-out-of-bounds read during getdents64():

  BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
  Read of size 8 at addr ffff88800fd3da3c by task exploit/146
   ...
   kasan_report (mm/kasan/report.c:595)
   filldir64 (fs/readdir.c:371)
   iterate_dir (fs/readdir.c:110)
   ...

The payload is copied into a buffer of exactly inline_size bytes:

	dir_buf = kmalloc(inline_size, GFP_NOFS);

but iteration runs in a logical position space extra_offset bytes larger
than the buffer (extra_size = extra_offset + inline_size), so the synthetic
"." and ".." entries land at the offsets they would have in a block-based
directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
the loop bounds and the ext4_check_dir_entry() length argument are all
expressed in the larger extra_size. Two reachable sites dereference a
dirent before confirming its physical offset is inside the allocation:

In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
running up to extra_offset bytes past the allocation before its length
check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
below extra_offset makes "ctx->pos - extra_offset" negative, so a check
that only bounds the top of the buffer is bypassed by underflow and de is
formed before dir_buf.

In the cookie-rescan loop, entered when i_version changed since the last
readdir(2), the walk restarts from the beginning with i bounded by
extra_size, so as i approaches extra_size the unconditional read of
de->rec_len runs past the allocation before any validation.

Both are the same defect, logical extra_size space versus the physical
inline_size buffer. In each loop, reject a dirent whose header would not
fit within inline_size before forming de, and in the main loop also reject a
position that underflows below extra_offset. Validate the main-loop entry
against inline_size rather than extra_size. Entries that legitimately fill
the inline data still pass.

Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
v2: check the both bounds

 fs/ext4/inline.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..bbe93f1b56b2 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
 			/* for other entry, the real offset in
 			 * the buf has to be tuned accordingly.
 			 */
+			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
+			    inline_size)
+				break;
 			de = (struct ext4_dir_entry_2 *)
 				(dir_buf + i - extra_offset);
 			/* It's too expensive to do a full
@@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
 			continue;
 		}

+		/*
+		 * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
+		 * header must fit within inline_size.  ctx->pos is a signed,
+		 * lseek()-controlled loff_t: check the lower bound first, or
+		 * ctx->pos < extra_offset underflows and points de before dir_buf.
+		 */
+		if (ctx->pos < extra_offset ||
+		    ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
+		    inline_size)
+			goto out;
 		de = (struct ext4_dir_entry_2 *)
 			(dir_buf + ctx->pos - extra_offset);
 		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
-					 extra_size, ctx->pos))
+					 inline_size, ctx->pos))
 			goto out;
 		if (le32_to_cpu(de->inode)) {
 			if (!dir_emit(ctx, de->name, de->name_len,
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH ext4] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-08 21:09 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <20260608210713.1940288-1-xmei5@asu.edu>

On Mon, Jun 08, 2026 at 02:07:13PM -0700, Xiang Mei wrote:
> ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> inline buffer for a crafted or corrupted inline directory, triggering a
> slab-out-of-bounds read during getdents64():
> 
>   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
>   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
>    ...
>    kasan_report (mm/kasan/report.c:595)
>    filldir64 (fs/readdir.c:371)
>    iterate_dir (fs/readdir.c:110)
>    ...
> 
> The payload is copied into a buffer of exactly inline_size bytes:
> 
> 	dir_buf = kmalloc(inline_size, GFP_NOFS);
> 
> but iteration runs in an inflated logical position space, extra_offset
> bytes larger than the buffer (extra_size = extra_offset + inline_size),
> so the synthetic "." and ".." entries land at the offsets they would have
> in a block-based directory.  A real dirent is therefore formed at the
> physical address "dir_buf + pos - extra_offset", yet the loop bounds and
> the ext4_check_dir_entry() length argument are all expressed in the larger
> extra_size.  This mismatch lets two reachable sites dereference a dirent
> before confirming its physical offset is inside the allocation.
> 
> In the main loop, ctx->pos is attacker-controlled via lseek().  The entry
> is validated with extra_size instead of inline_size, so
> ext4_check_dir_entry() accepts rec_len/name_len running up to extra_offset
> bytes past the allocation, and dereferences de before its (too-large)
> length check fires.
> 
> In the cookie-rescan loop, entered when i_version changed since the last
> readdir(2) (or after an lseek resets the cookie), the walk restarts from
> the beginning with i bounded by extra_size.  As i approaches extra_size,
> "i - extra_offset" approaches inline_size, so the unconditional read of
> de->rec_len runs past the allocation before any validation.
> 
> Both are the same defect, logical extra_size space versus the physical
> inline_size buffer, so fix them together.  In each loop, reject a dirent
> whose minimum-size header would not fit within inline_size before forming
> and dereferencing de, and validate the main-loop entry against inline_size
> rather than extra_size.  extra_offset only inflates the logical position of
> "." and ".."; the dirent itself always lives in the kmalloc(inline_size)
> buffer.  Entries that legitimately fill the inline data still pass; only
> accesses that would fall outside the allocation are rejected.
> 
> Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> ---
>  fs/ext4/inline.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 8045e4ff270c..0ec85cfcc859 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
>  			/* for other entry, the real offset in
>  			 * the buf has to be tuned accordingly.
>  			 */
> +			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
> +			    inline_size)
> +				break;
>  			de = (struct ext4_dir_entry_2 *)
>  				(dir_buf + i - extra_offset);
>  			/* It's too expensive to do a full
> @@ -1488,10 +1491,21 @@ int ext4_read_inline_dir(struct file *file,
>  			continue;
>  		}
>  
> +		/*
> +		 * ctx->pos can be set to an arbitrary value via lseek(), and
> +		 * the rescan above may also advance it.  Make sure the dirent
> +		 * header lies within the inline_size payload before
> +		 * dereferencing it: extra_offset only inflates the logical
> +		 * position of "." and "..", the dirent itself always lives in
> +		 * the kmalloc(inline_size) buffer.
> +		 */
> +		if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> +		    inline_size)
> +			goto out;
>  		de = (struct ext4_dir_entry_2 *)
>  			(dir_buf + ctx->pos - extra_offset);
>  		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> -					 extra_size, ctx->pos))
> +					 inline_size, ctx->pos))
>  			goto out;
>  		if (le32_to_cpu(de->inode)) {
>  			if (!dir_emit(ctx, de->name, de->name_len,
> -- 
> 2.43.0
>

Thanks for your attention to this bug. We have a PoC that can trigger 
the KASAN crash. Please DM if you need it to reproduce the issue.

Xiang

^ permalink raw reply

* [PATCH ext4] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-08 21:07 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi, Xiang Mei

ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
inline buffer for a crafted or corrupted inline directory, triggering a
slab-out-of-bounds read during getdents64():

  BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
  Read of size 8 at addr ffff88800fd3da3c by task exploit/146
   ...
   kasan_report (mm/kasan/report.c:595)
   filldir64 (fs/readdir.c:371)
   iterate_dir (fs/readdir.c:110)
   ...

The payload is copied into a buffer of exactly inline_size bytes:

	dir_buf = kmalloc(inline_size, GFP_NOFS);

but iteration runs in an inflated logical position space, extra_offset
bytes larger than the buffer (extra_size = extra_offset + inline_size),
so the synthetic "." and ".." entries land at the offsets they would have
in a block-based directory.  A real dirent is therefore formed at the
physical address "dir_buf + pos - extra_offset", yet the loop bounds and
the ext4_check_dir_entry() length argument are all expressed in the larger
extra_size.  This mismatch lets two reachable sites dereference a dirent
before confirming its physical offset is inside the allocation.

In the main loop, ctx->pos is attacker-controlled via lseek().  The entry
is validated with extra_size instead of inline_size, so
ext4_check_dir_entry() accepts rec_len/name_len running up to extra_offset
bytes past the allocation, and dereferences de before its (too-large)
length check fires.

In the cookie-rescan loop, entered when i_version changed since the last
readdir(2) (or after an lseek resets the cookie), the walk restarts from
the beginning with i bounded by extra_size.  As i approaches extra_size,
"i - extra_offset" approaches inline_size, so the unconditional read of
de->rec_len runs past the allocation before any validation.

Both are the same defect, logical extra_size space versus the physical
inline_size buffer, so fix them together.  In each loop, reject a dirent
whose minimum-size header would not fit within inline_size before forming
and dereferencing de, and validate the main-loop entry against inline_size
rather than extra_size.  extra_offset only inflates the logical position of
"." and ".."; the dirent itself always lives in the kmalloc(inline_size)
buffer.  Entries that legitimately fill the inline data still pass; only
accesses that would fall outside the allocation are rejected.

Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
 fs/ext4/inline.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..0ec85cfcc859 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
 			/* for other entry, the real offset in
 			 * the buf has to be tuned accordingly.
 			 */
+			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
+			    inline_size)
+				break;
 			de = (struct ext4_dir_entry_2 *)
 				(dir_buf + i - extra_offset);
 			/* It's too expensive to do a full
@@ -1488,10 +1491,21 @@ int ext4_read_inline_dir(struct file *file,
 			continue;
 		}

+		/*
+		 * ctx->pos can be set to an arbitrary value via lseek(), and
+		 * the rescan above may also advance it.  Make sure the dirent
+		 * header lies within the inline_size payload before
+		 * dereferencing it: extra_offset only inflates the logical
+		 * position of "." and "..", the dirent itself always lives in
+		 * the kmalloc(inline_size) buffer.
+		 */
+		if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
+		    inline_size)
+			goto out;
 		de = (struct ext4_dir_entry_2 *)
 			(dir_buf + ctx->pos - extra_offset);
 		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
-					 extra_size, ctx->pos))
+					 inline_size, ctx->pos))
 			goto out;
 		if (le32_to_cpu(de->inode)) {
 			if (!dir_emit(ctx, de->name, de->name_len,
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Andreas Dilger @ 2026-06-08 18:27 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, yi.zhang, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-4-libaokun@linux.alibaba.com>

On Jun 8, 2026, at 05:11, Baokun Li <libaokun@linux.alibaba.com> wrote:
> 
> If s_inodes_per_group is not a multiple of s_inodes_per_block, the
> division that computes s_itb_per_group truncates, reserving fewer blocks
> for the inode table than needed.
> 
> On a crafted filesystem image, this allows __ext4_get_inode_loc() to
> compute a block offset beyond the inode table, reading unrelated data as
> an inode structure.
> 
> Add the missing divisibility check alongside the existing validation in
> ext4_block_group_meta_init().
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good.  Is this also fixed/checked in e2fsprogs?  There is really no reason *not* to use all the space in the last itable block for inodes.

Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

> ---
> fs/ext4/super.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3ddcb4a8d4db..5ec9e1ef00c0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5306,7 +5306,8 @@ static int ext4_block_group_meta_init(struct super_block *sb,
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
>  	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> -	    sbi->s_inodes_per_group & 7) {
> +	    sbi->s_inodes_per_group & 7 ||
> +	    sbi->s_inodes_per_group % sbi->s_inodes_per_block) {
>  		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;
> -- 
> 2.43.7
> 


Cheers, Andreas






^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Andreas Dilger @ 2026-06-08 18:24 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, yi.zhang, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-3-libaokun@linux.alibaba.com>

On Jun 8, 2026, at 05:11, Baokun Li <libaokun@linux.alibaba.com> wrote:
> 
> The mke2fs man page documents:
> 
>  Valid cluster-size values are from 2048 to 256M bytes per cluster.
> 
> but EXT4_MAX_CLUSTER_LOG_SIZE was set to 30 (1GB), allowing crafted
> filesystem images to specify cluster sizes up to 1GB.
> 
> On 32-bit systems with bigalloc enabled, the consistency check in
> ext4_handle_clustersize():
> 
>  s_blocks_per_group == s_clusters_per_group * (clustersize / blocksize)
> 
> can overflow when the cluster ratio is large enough. Since
> s_blocks_per_group is not range-checked in the bigalloc path, the
> wrapped product can pass the consistency check, leading to inconsistent
> group geometry and potential out-of-bounds block allocation.
> 
> Reduce EXT4_MAX_CLUSTER_LOG_SIZE to 28 to match the documented 256MB
> limit. With this cap, the maximum product is:
> 
>  (blocksize * 8) * (256M / blocksize) = 2^31
> 
> which fits safely in a 32-bit unsigned long for all block sizes.
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

It should be pretty safe.  I doubt anyone could be using > 1GiB clusters for real-world
usage at this point, given the nature of the bug and the amount of space used for metadata
(e.g. 1GiB per directory, etc).  At that granularity they can use LVM for space management.

Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

> ---
> fs/ext4/ext4.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..11e41a864db8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -334,7 +334,7 @@ struct ext4_io_submit {
>  #define EXT4_MAX_BLOCK_SIZE 65536
>  #define EXT4_MIN_BLOCK_LOG_SIZE 10
>  #define EXT4_MAX_BLOCK_LOG_SIZE 16
> -#define EXT4_MAX_CLUSTER_LOG_SIZE 30
> +#define EXT4_MAX_CLUSTER_LOG_SIZE 28
>  #ifdef __KERNEL__
>  # define EXT4_BLOCK_SIZE(s) ((s)->s_blocksize)
>  #else
> -- 
> 2.43.7
> 


Cheers, Andreas






^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Andreas Dilger @ 2026-06-08 18:20 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, libaokun, jack, ojaswin, ritesh.list, yi.zhang, dmonakhov,
	linux-ext4, linux-kernel
In-Reply-To: <20260608152521.1292656-1-yun.zhou@windriver.com>

On Jun 8, 2026, at 09:25, Yun Zhou <yun.zhou@windriver.com> wrote:
> 
> Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
> belong to the same superblock as the original file.  Currently, this
> validation is performed inside ext4_move_extents() by
> mext_check_validity(), but only after lock_two_nondirectories() has
> already acquired the inode locks.  When the donor fd refers to a file
> on a different filesystem (e.g., overlayfs), this late validation
> creates a circular lock dependency:
> 
>  CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
>  ----                              ----
>  inode_lock(ovl_inode)
>                                    mnt_want_write_file(filp)
>                                      sb_start_write(ext4_sb)   [sb_writers]
>    backing_file_write_iter()
>      vfs_iter_write(real_file)
>        file_start_write(real_file)
>          sb_start_write(ext4_sb)   [blocked by freeze]
>                                    lock_two_nondirectories()
>                                      inode_lock(ovl_inode)     [blocked]
> 
> With a concurrent freeze operation holding sb_writers write side, this
> forms a deadlock cycle: CPU0 waits for freeze to complete, freeze waits
> for CPU1's sb_writers reader to exit, CPU1 waits for CPU0's inode lock.
> 
> Since EXT4_IOC_MOVE_EXT exchanges physical extents between two files,
> it fundamentally requires both files to reside on the same ext4
> filesystem.  Moving the superblock check before any lock acquisition
> is both semantically correct and eliminates the circular dependency
> by ensuring that cross-filesystem donor fds are rejected before
> sb_writers or inode locks are taken.
> 
> Fixes: fcf6b1b729bc ("ext4: refactor ext4_move_extents code base")
> Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Thanks for the patch.  Looks good to me.

Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

> ---
> fs/ext4/ioctl.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 1d0c3d4bdf47..f7cd419a3218 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -1650,6 +1650,9 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd,
>  	if (!(fd_file(donor)->f_mode & FMODE_WRITE))
>  		return -EBADF;
> 
> +	if (file_inode(filp)->i_sb != file_inode(fd_file(donor))->i_sb)
> +		return -EXDEV;
> +
>  	err = mnt_want_write_file(filp);
>  	if (err)
>  		return err;
> -- 
> 2.43.0
> 


Cheers, Andreas






^ permalink raw reply

* [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Yun Zhou @ 2026-06-08 15:25 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, dmonakhov
  Cc: linux-ext4, linux-kernel, yun.zhou

Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
belong to the same superblock as the original file.  Currently, this
validation is performed inside ext4_move_extents() by
mext_check_validity(), but only after lock_two_nondirectories() has
already acquired the inode locks.  When the donor fd refers to a file
on a different filesystem (e.g., overlayfs), this late validation
creates a circular lock dependency:

  CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
  ----                              ----
  inode_lock(ovl_inode)
                                    mnt_want_write_file(filp)
                                      sb_start_write(ext4_sb)   [sb_writers]
    backing_file_write_iter()
      vfs_iter_write(real_file)
        file_start_write(real_file)
          sb_start_write(ext4_sb)   [blocked by freeze]
                                    lock_two_nondirectories()
                                      inode_lock(ovl_inode)     [blocked]

With a concurrent freeze operation holding sb_writers write side, this
forms a deadlock cycle: CPU0 waits for freeze to complete, freeze waits
for CPU1's sb_writers reader to exit, CPU1 waits for CPU0's inode lock.

Since EXT4_IOC_MOVE_EXT exchanges physical extents between two files,
it fundamentally requires both files to reside on the same ext4
filesystem.  Moving the superblock check before any lock acquisition
is both semantically correct and eliminates the circular dependency
by ensuring that cross-filesystem donor fds are rejected before
sb_writers or inode locks are taken.

Fixes: fcf6b1b729bc ("ext4: refactor ext4_move_extents code base")
Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 1d0c3d4bdf47..f7cd419a3218 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1650,6 +1650,9 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (!(fd_file(donor)->f_mode & FMODE_WRITE))
 			return -EBADF;

+		if (file_inode(filp)->i_sb != file_inode(fd_file(donor))->i_sb)
+			return -EXDEV;
+
 		err = mnt_want_write_file(filp);
 		if (err)
 			return err;
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH v6 06/11] fstests: verify f_fsid for cloned filesystems
From: Anand Suveer Jain @ 2026-06-08 14:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529043955.GG6070@frogsfrogsfrogs>

On 29/5/26 12:39, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:37PM +0800, Anand Jain wrote:
>> Verify that the cloned filesystem provides an f_fsid that is persistent
>> across mount cycles, yet unique from the original filesystem's f_fsid.
> 
> Might want to add that last part to the test description itself, because
> otherwise I don't know what 'verify' means.
> 


Looks like there's going to be v7, I have updated the test description..

-------
+# Check that the cloned filesystem provides an f_fsid that is persistent
+# across mount cycles if the block device maj:min remains unchanged.
-------

>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  tests/generic/802     | 67 +++++++++++++++++++++++++++++++++++++++++++
>>  tests/generic/802.out |  7 +++++
>>  2 files changed, 74 insertions(+)
>>  create mode 100644 tests/generic/802
>>  create mode 100644 tests/generic/802.out
>>
>> diff --git a/tests/generic/802 b/tests/generic/802
>> new file mode 100644
>> index 000000000000..653e74e11b53
>> --- /dev/null
>> +++ b/tests/generic/802
>> @@ -0,0 +1,67 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
>> +#
>> +# FS QA Test 802
>> +# Verify f_fsid and s_uuid of cloned filesystems across mount cycle.
>> +
>> +. ./common/preamble
>> +
>> +_begin_fstest auto quick mount clone
>> +
>> +_require_test
>> +_require_block_device $TEST_DEV
>> +_require_loop
>> +
>> +[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
>> +	"btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
>> +[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
>> +	"btrfs: derive f_fsid from on-disk fsuuid and dev_t"
> 
> _fixed_by_fs_commit?
> 

Oh right! I completely forgot about the new helper _fixed_by_fs_commit.
Now fixed. Thanks.

-[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
         "btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
-[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
         "btrfs: derive f_fsid from on-disk fsuuid and dev_t"

>> +
>> +_cleanup()
>> +{
>> +	cd /
>> +	rm -r -f $tmp.*
>> +	umount $mnt1 $mnt2 2>/dev/null
>> +	_loop_image_destroy "${devs[@]}" 2> /dev/null
>> +}
>> +
>> +# Setup base loop device and its clone
>> +devs=()
>> +_loop_image_create_clone devs
>> +mkdir -p $TEST_DIR/$seq
>> +mnt1=$TEST_DIR/$seq/mnt1
>> +mnt2=$TEST_DIR/$seq/mnt2
>> +mkdir -p $mnt1
>> +mkdir -p $mnt2
>> +
>> +# Mount both filesystems simultaneously using mandatory clone mount options
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
>> +						_fail "Failed to mount dev1"
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
>> +						_fail "Failed to mount dev2"
>> +
>> +# Capture baseline filesystem IDs for comparison
>> +fsid_scratch=$(stat -f -c "%i" $mnt1)
>> +fsid_clone=$(stat -f -c "%i" $mnt2)
>> +
>> +echo "**** fsid initially ****"
>> +echo $fsid_scratch | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
>> +echo $fsid_clone | sed -e "s/$fsid_clone/FSID_CLONE/g"
> 
> Why echo only to sed?
> 

I tried to keep both the .out log and the script readable.
Since a bare echo FSID_SCRATCH and echo FSID_CLONE has no
clear reference to $fsid_scratch and $fsid_clone, I just
added a little extra code around them.

However, I'm completely fine with removing both viz,
expected output and its script (for the first mount part).
As I'm pretty sure there will be a v7 now, ;-)
I can make those changes.

Thanks, Anand

>> +
>> +# Verify that the fsids remain stable after a mount cycle, even when the
>> +# mount order is reversed.
>> +echo "**** fsid after mount cycle ****"
>> +_unmount $mnt1
>> +_unmount $mnt2
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
>> +						_fail "Failed to mount dev2"
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
>> +						_fail "Failed to mount dev1"
>> +
>> +# Compare post mount-cycle values against the baseline
>> +stat -f -c "%i" $mnt1 | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
>> +stat -f -c "%i" $mnt2 | sed -e "s/$fsid_clone/FSID_CLONE/g"
>> +
>> +status=0
>> +exit
>> diff --git a/tests/generic/802.out b/tests/generic/802.out
>> new file mode 100644
>> index 000000000000..d1e008f122bb
>> --- /dev/null
>> +++ b/tests/generic/802.out
>> @@ -0,0 +1,7 @@
>> +QA output created by 802
>> +**** fsid initially ****
>> +FSID_SCRATCH
>> +FSID_CLONE
>> +**** fsid after mount cycle ****
>> +FSID_SCRATCH
>> +FSID_CLONE
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v6 05/11] fstests: verify fanotify isolation on cloned filesystems
From: Anand Suveer Jain @ 2026-06-08 14:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529043647.GF6070@frogsfrogsfrogs>

On 29/5/26 12:36, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:36PM +0800, Anand Jain wrote:
>> Verify that fanotify events are correctly routed to the appropriate
>> watcher when cloned filesystems are mounted.
>> Helps verify kernel's event notification distinguishes between devices
>> sharing the same FSID/UUID.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  tests/generic/801     | 135 ++++++++++++++++++++++++++++++++++++++++++
>>  tests/generic/801.out |   7 +++
>>  2 files changed, 142 insertions(+)
>>  create mode 100644 tests/generic/801
>>  create mode 100644 tests/generic/801.out
>>
>> diff --git a/tests/generic/801 b/tests/generic/801
>> new file mode 100644
>> index 000000000000..3bfb87d41922
>> --- /dev/null
>> +++ b/tests/generic/801
>> @@ -0,0 +1,135 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
>> +#
>> +# FS QA Test 801
>> +# Verify fanotify FID functionality on cloned filesystems by setting up
>> +# watchers and making sure notifications are in the correct logs files.
>> +
>> +. ./common/preamble
>> +
>> +_begin_fstest auto quick mount clone
>> +
>> +_require_test
>> +_require_block_device $TEST_DEV
>> +_require_loop
>> +_require_command "$FSNOTIFYWAIT_PROG" fsnotifywait
>> +_require_unique_f_fsid
>> +
>> +_cleanup()
>> +{
>> +	cd /
>> +	[[ -n $pid1 ]] && { kill -TERM "$pid1" 2> /dev/null; wait $pid1; }
>> +	[[ -n $pid2 ]] && { kill -TERM "$pid2" 2> /dev/null; wait $pid2; }
>> +
>> +	if [ "$semanage_added" = "yes" ]; then
>> +		semanage permissive -d unconfined_t >/dev/null 2>&1 || true
>> +	fi
>> +
>> +	umount $mnt1 $mnt2 2>/dev/null
>> +	_loop_image_destroy "${devs[@]}" 2> /dev/null
>> +	rm -r -f $tmp.*
>> +}
>> +



>> +# Run fsnotifywait in unbuffered mode to watch filesystem-wide create events
>> +monitor_fanotify()
>> +{
>> +	local mmnt=$1
>> +	exec stdbuf -oL $FSNOTIFYWAIT_PROG -m -F -S -e create "$mmnt" 2>&1
> 
> I guess you need stdbuf to force fsnotifywait to run in linebuffered
> mode even if you pipe/redirect it somewhere?
> 

yeah, stdbuf helps get the output as and when created.

>> +}
>> +
>> +# Transform f_fsid into the hi.lo format used in fanotify FID logs
>> +fsid_to_fid_parts()
>> +{
>> +	local fsid=$1
>> +	# Pad to 16 hex chars (64-bit), then split into two 32-bit halves
>> +	local padded=$(printf '%016x' "0x${fsid}")
>> +	local hi=$(printf '%x' "0x${padded:0:8}")   # strips leading zeros
>> +	local lo=$(printf '%x' "0x${padded:8:8}")   # strips leading zeros
>> +	echo "${hi}.${lo}"
>> +}
>> +
>> +# Create base loop device and its clone
>> +devs=()
>> +_loop_image_create_clone devs
>> +mkdir -p $TEST_DIR/$seq
>> +mnt1=$TEST_DIR/$seq/mnt1
>> +mnt2=$TEST_DIR/$seq/mnt2
>> +mkdir -p $mnt1
>> +mkdir -p $mnt2
>> +
>> +# Mount both base and clone filesystems using required clone mount options
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
>> +						_fail "Failed to mount dev1"
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
>> +						_fail "Failed to mount dev2"
>> +
>> +# Fetch filesystem IDs to verify the kernel can differentiate between them
>> +fsid1=$(stat -f -c "%i" $mnt1)
>> +fsid2=$(stat -f -c "%i" $mnt2)
>> +
>> +log1=$tmp.fanotify1
>> +log2=$tmp.fanotify2
>> +
>> +pid1=""
>> +pid2=""
>> +echo "Setup FID fanotify watchers on both mnt1 and mnt2"
>> +
>> +# Permit unconfined_t domains when SELinux is enforcing to prevent fanotify
>> +# blockages
>> +semanage_added="no"
>> +if [ "$(getenforce 2>/dev/null)" = "Enforcing" ]; then
>> +    if ! semanage permissive -l | grep -q "unconfined_t"; then
>> +        semanage permissive -a unconfined_t >/dev/null 2>&1 && semanage_added="yes"
>> +    fi
>> +fi
> 
> Is there a cleaner way to manage setting up and automatically undoing
> this step?
> 
> There might not be, since iirc the suggestion to register cleanup
> functions in a cleanups=() array and call them all in reverse order
> didn't go anywhere.
> 

If there are multiple use cases, we could wrap it up in a helper,
similar to _scratch_dev_pool_{get|put}, if it helps.

Thanks, Anand


>> +
>> +# Start asynchronous fanotify monitors
>> +( monitor_fanotify "$mnt1" > "$log1" ) &
>> +pid1=$!
>> +( monitor_fanotify "$mnt2" > "$log2" ) &
>> +pid2=$!
>> +sleep 2
>> +
>> +echo "Trigger file creation on mnt1"
>> +touch $mnt1/file_on_mnt1
>> +sync
>> +sleep 1
>> +
>> +echo "Trigger file creation on mnt2"
>> +touch $mnt2/file_on_mnt2
>> +sync
>> +sleep 1
>> +
>> +echo "Verify fsid in the fanotify"
>> +kill $pid1 $pid2
>> +wait $pid1 $pid2 2>/dev/null
>> +pid1=""
>> +pid2=""
>> +
>> +e_fsid1=$(fsid_to_fid_parts "$fsid1")
>> +e_fsid2=$(fsid_to_fid_parts "$fsid2")
>> +
>> +# Dump debug details to the full log
>> +echo $fsid1 $e_fsid1 $fsid2 $e_fsid2 >> $seqres.full
>> +cat $log1 >> $seqres.full
>> +cat $log2 >> $seqres.full
>> +
>> +# Ensure monitor 1 only captured events belonging to mnt 1 and fsid 1
>> +if grep -qF "$e_fsid1" "$log1" && ! grep -qF "$e_fsid2" "$log1"; then
>> +	echo "SUCCESS: mnt1 events found"
>> +else
>> +	[ ! -s "$log1" ] && echo "  - mnt1 received no events."
>> +	grep -qF "$e_fsid2" "$log1" && echo "  - mnt1 received event from mnt2."
>> +fi
>> +
>> +# Ensure monitor 2 only captured events belonging to mnt 2 and fsid 2
>> +if grep -qF "$e_fsid2" "$log2" && ! grep -qF "$e_fsid1" "$log2"; then
>> +	echo "SUCCESS: mnt2 events found"
>> +else
>> +	[ ! -s "$log2" ] && echo "  - mnt2 received no events."
>> +	grep -qF "$e_fsid1" "$log2" && echo "  - mnt2 received event from mnt1."
>> +fi
>> +
>> +status=0
>> +exit
>> diff --git a/tests/generic/801.out b/tests/generic/801.out
>> new file mode 100644
>> index 000000000000..d7b318d9f27c
>> --- /dev/null
>> +++ b/tests/generic/801.out
>> @@ -0,0 +1,7 @@
>> +QA output created by 801
>> +Setup FID fanotify watchers on both mnt1 and mnt2
>> +Trigger file creation on mnt1
>> +Trigger file creation on mnt2
>> +Verify fsid in the fanotify
>> +SUCCESS: mnt1 events found
>> +SUCCESS: mnt2 events found
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v6 04/11] fstests: add _require_unique_f_fsid() helper
From: Anand Suveer Jain @ 2026-06-08 14:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529043056.GE6070@frogsfrogsfrogs>

On 29/5/26 12:30, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:35PM +0800, Anand Jain wrote:
>> Add a helper to check if the target filesystem supports unique f_fsid
>> tracking across cloned or snapshot instances.
>>
>> Certain filesystems like XFS, Btrfs, and F2FS ensure unique f_fsid
>> identifiers per filesystem instance. However, Ext4 derives its f_fsid
>> directly from its superblock UUID, which leads to identical f_fsid
>> values on cloned images until the UUID is manually modified by userspace.
>>
>> Introduce _require_unique_f_fsid() to allow test cases requiring strict
>> f_fsid uniqueness to skip gracefully on unsupported filesystems.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  common/rc | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
>>
>> diff --git a/common/rc b/common/rc
>> index 937f478963b4..5446552aed92 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -6314,6 +6314,27 @@ _require_fanotify_ioerrors()
>>  	_notrun "$FSTYP does not support fanotify ioerrors"
>>  }
>>  
>> +# Ext4 derives f_fsid from the superblock UUID, meaning clones share the
>> +# same f_fsid until their UUIDs diverge. Conversely, XFS, Btrfs,
>> +# and F2FS ensure f_fsid remains unique per filesystem instance (often by
>> +# deriving it from the UUID and underlying block device.)
>> +#
>> +# Across all filesystems, a UUID collision causes libblkid tools to return
>> +# non-deterministic device mappings. It is ultimately the responsibility
> 
> "device mappings", as in /dev/disk/by-id/$UUID ?
> 

Correct.. I'll make it specific.

>> +# of the userspace utility or use-case to enforce uniqueness when a clone
>> +# diverges. For details, see mailing list thread discussions titled:
>> +#      "ext4: derive f_fsid from block device to avoid collisions".
> 
> How about providing a direct lore link?
> 

Sure, that will be..

Link:
https://lore.kernel.org/linux-ext4/20260409131238.GC18443@macsyma-wired.lan/

instead of the title.

Thanks, Anand


> --D
> 




>> +_require_unique_f_fsid()
>> +{
>> +	# Skip the test if the filesystem does not enforce unique f_fsids
>> +	# natively. Checking this dynamically requires recreating a clone
>> +	# layout, so we use a static lookup based on FSTYP.
>> +	if [ "$FSTYP" == "ext4" ]; then
>> +		_notrun "Target filesystem ($FSTYP) does not guarantee unique f_fsid on clones."
>> +	fi
>> +}
>> +
>> +
>>  # Computes a percentage of the available space in a filesystem and
>>  # returns that quantity in MB. The percentage must not contain a percent
>>  # sign ("%").
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox