Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH] ext4: fix circular lock dependency in ext4_ext_migrate
From: Jan Kara @ 2026-06-10 10:21 UTC (permalink / raw)
  To: Zhou, Yun
  Cc: Jan Kara, tytso, adilger.kernel, libaokun, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <7fe6eec7-acd1-4511-beb7-bac9bbdb9cb2@windriver.com>

On Wed 10-06-26 15:04:33, Zhou, Yun wrote:
> 
> 
> On 6/9/26 20:05, Jan Kara wrote:
> > Looks good. Feel free to add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > Just one nit below:
> > 
> > > @@ -591,9 +592,10 @@ int ext4_ext_migrate(struct inode *inode)
> > >        ext4_journal_stop(handle);
> > >   out_tmp_inode:
> > >        unlock_new_inode(tmp_inode);
> > > -     iput(tmp_inode);
> > >   out_unlock:
> > >        ext4_writepages_up_write(inode->i_sb, alloc_ctx);
> > > +     if (tmp_inode)
> > > +             iput(tmp_inode);
> > iput(NULL) is properly handled so you don't need the if (tmp_inode) check
> > here.
> Hi Jan,
> 
> Thank you for your careful review. Should I remove this redundant check in
> v2?

Yes, please. Thank you!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Jan Kara @ 2026-06-10 10:01 UTC (permalink / raw)
  To: Xiang Mei
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Jan Kara, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <20260609010739.2278172-1-xmei5@asu.edu>

What does the 'net' in [PATCH net v2] mean?

On Mon 08-06-26 18:07:39, Xiang Mei wrote:
> ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> inline buffer for a crafted or corrupted inline directory, triggering a
> slab-out-of-bounds read during getdents64():
> 
>   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
>   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
>    ...
>    kasan_report (mm/kasan/report.c:595)
>    filldir64 (fs/readdir.c:371)
>    iterate_dir (fs/readdir.c:110)
>    ...
> 
> The payload is copied into a buffer of exactly inline_size bytes:
> 
> 	dir_buf = kmalloc(inline_size, GFP_NOFS);
> 
> but iteration runs in a logical position space extra_offset bytes larger
> than the buffer (extra_size = extra_offset + inline_size), so the synthetic
> "." and ".." entries land at the offsets they would have in a block-based
> directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
> the loop bounds and the ext4_check_dir_entry() length argument are all
> expressed in the larger extra_size. Two reachable sites dereference a
> dirent before confirming its physical offset is inside the allocation:
> 
> In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
> is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
> running up to extra_offset bytes past the allocation before its length
> check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
> below extra_offset makes "ctx->pos - extra_offset" negative, so a check
> that only bounds the top of the buffer is bypassed by underflow and de is
> formed before dir_buf.
> 
> In the cookie-rescan loop, entered when i_version changed since the last
> readdir(2), the walk restarts from the beginning with i bounded by
> extra_size, so as i approaches extra_size the unconditional read of
> de->rec_len runs past the allocation before any validation.
> 
> Both are the same defect, logical extra_size space versus the physical
> inline_size buffer. In each loop, reject a dirent whose header would not
> fit within inline_size before forming de, and in the main loop also reject a
> position that underflows below extra_offset. Validate the main-loop entry
> against inline_size rather than extra_size. Entries that legitimately fill
> the inline data still pass.
> 
> Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Xiang Mei <xmei5@asu.edu>

Thanks for the analysis and the patch. See some suggestions for improvement
below:

> @@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
>  			continue;
>  		}
>  
> +		/*
> +		 * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
> +		 * header must fit within inline_size.  ctx->pos is a signed,
> +		 * lseek()-controlled loff_t: check the lower bound first, or
> +		 * ctx->pos < extra_offset underflows and points de before dir_buf.
> +		 */
> +		if (ctx->pos < extra_offset ||
> +		    ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> +		    inline_size)
> +			goto out;

So I don't think this is really possible. ctx->pos isn't really fully user
controlled. When you use seek to modify ctx->pos, ext4_dir_llseek() does
set info->cookie to invalid value so the next time we enter
ext4_read_inline_dir() we are guaranteed to revalidate the offset and reset
it to 0, dotdot_offset, or some value greater than extra_size. 

>  		de = (struct ext4_dir_entry_2 *)
>  			(dir_buf + ctx->pos - extra_offset);
>  		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> -					 extra_size, ctx->pos))
> +					 inline_size, ctx->pos))
>  			goto out;
>  		if (le32_to_cpu(de->inode)) {
>  			if (!dir_emit(ctx, de->name, de->name_len,

Otherwise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [syzbot ci] Re: ext4: move inline data cleanup to ext4_writepages to fix deadlock
From: syzbot ci @ 2026-06-10  8:06 UTC (permalink / raw)
  To: adilger.kernel, daeho.jeong, jack, libaokun, linux-ext4,
	linux-kernel, ojaswin, ritesh.list, tytso, yi.zhang, yun.zhou
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260609154505.2104659-1-yun.zhou@windriver.com>

syzbot ci has tested the following series

[v1] ext4: move inline data cleanup to ext4_writepages to fix deadlock
https://lore.kernel.org/all/20260609154505.2104659-1-yun.zhou@windriver.com
* [PATCH] ext4: move inline data cleanup to ext4_writepages to fix deadlock

and found the following issue:
kernel BUG in ext4_writepages

Full report is available here:
https://ci.syzbot.org/series/1ede6029-df2a-4e08-bffc-05540c1f4934

***

kernel BUG in ext4_writepages

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      2d3090a8aeb596a26935db0955d46c9a5db5c6ce
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/63ee0324-2d17-4b32-aca2-c6230ff64be6/config
syz repro: https://ci.syzbot.org/findings/676a447c-ea73-43ea-9949-054dac1961e5/syz_repro

EXT4-fs warning (device loop2): ext4_expand_extra_isize_ea:2860: Unable to expand inode 15. Delete some EAs or run e2fsck.
------------[ cut here ]------------
kernel BUG at fs/ext4/inode.c:3047!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 5875 Comm: syz.2.19 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:ext4_writepages+0x622/0x630 fs/ext4/inode.c:3046
Code: ff e9 61 fc ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c de fc ff ff 4c 89 f7 e8 f9 2f a8 ff e9 d1 fc ff ff e8 ef d7 3c ff 90 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
RSP: 0018:ffffc900034df2e0 EFLAGS: 00010293
RAX: ffffffff8288dfb1 RBX: 1ffff9200069be60 RCX: ffff888110555940
RDX: 0000000000000000 RSI: 0000004000000000 RDI: 0000000000000000
RBP: ffffc900034df410 R08: ffff8881b48c2f0f R09: 1ffff110369185e1
R10: dffffc0000000000 R11: ffffed10369185e2 R12: dffffc0000000000
R13: 0000004000000000 R14: 0000004610000000 R15: 1ffff11020c6fcc5
FS:  00007f033e9346c0(0000) GS:ffff8882a92a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005556842f3058 CR3: 000000016d5c0000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 do_writepages+0x32e/0x550 mm/page-writeback.c:2571
 __writeback_single_inode+0x133/0x10e0 fs/fs-writeback.c:1764
 writeback_single_inode+0x4ac/0xdc0 fs/fs-writeback.c:1883
 write_inode_now+0x1c2/0x290 fs/fs-writeback.c:2974
 iput_final fs/inode.c:1950 [inline]
 iput+0x8c1/0xe80 fs/inode.c:2009
 ext4_orphan_cleanup+0xc38/0x1470 fs/ext4/orphan.c:472
 __ext4_fill_super fs/ext4/super.c:5701 [inline]
 ext4_fill_super+0x5a19/0x6330 fs/ext4/super.c:5824
 get_tree_bdev_flags+0x431/0x4f0 fs/super.c:1694
 vfs_get_tree+0x92/0x2a0 fs/super.c:1754
 fc_mount fs/namespace.c:1193 [inline]
 do_new_mount_fc fs/namespace.c:3758 [inline]
 do_new_mount+0x341/0xd30 fs/namespace.c:3834
 do_mount fs/namespace.c:4167 [inline]
 __do_sys_mount fs/namespace.c:4383 [inline]
 __se_sys_mount+0x31d/0x420 fs/namespace.c:4360
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f033d99e0ca
Code: 48 c7 c2 e8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f033e933e58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f033e933ee0 RCX: 00007f033d99e0ca
RDX: 0000200000000040 RSI: 00002000000016c0 RDI: 00007f033e933ea0
RBP: 0000200000000040 R08: 00007f033e933ee0 R09: 000000000000840e
R10: 000000000000840e R11: 0000000000000246 R12: 00002000000016c0
R13: 00007f033e933ea0 R14: 000000000000042f R15: 0000200000000080
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:ext4_writepages+0x622/0x630 fs/ext4/inode.c:3046
Code: ff e9 61 fc ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c de fc ff ff 4c 89 f7 e8 f9 2f a8 ff e9 d1 fc ff ff e8 ef d7 3c ff 90 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
RSP: 0018:ffffc900034df2e0 EFLAGS: 00010293
RAX: ffffffff8288dfb1 RBX: 1ffff9200069be60 RCX: ffff888110555940
RDX: 0000000000000000 RSI: 0000004000000000 RDI: 0000000000000000
RBP: ffffc900034df410 R08: ffff8881b48c2f0f R09: 1ffff110369185e1
R10: dffffc0000000000 R11: ffffed10369185e2 R12: dffffc0000000000
R13: 0000004000000000 R14: 0000004610000000 R15: 1ffff11020c6fcc5
FS:  00007f033e9346c0(0000) GS:ffff8882a92a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005556842f3058 CR3: 000000016d5c0000 CR4: 00000000000006f0


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* Re: [PATCH] ext4: fix circular lock dependency in ext4_ext_migrate
From: Zhou, Yun @ 2026-06-10  7:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, yi.zhang,
	ebiggers, linux-ext4, linux-kernel
In-Reply-To: <lr2gyeoay4eai2nujk3siaq7wnqwg3t46an6sipqkmhxarvcrb@tqxhmnstmwnv>



On 6/9/26 20:05, Jan Kara wrote:
> Looks good. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> Just one nit below:
>
>> @@ -591,9 +592,10 @@ int ext4_ext_migrate(struct inode *inode)
>>        ext4_journal_stop(handle);
>>   out_tmp_inode:
>>        unlock_new_inode(tmp_inode);
>> -     iput(tmp_inode);
>>   out_unlock:
>>        ext4_writepages_up_write(inode->i_sb, alloc_ctx);
>> +     if (tmp_inode)
>> +             iput(tmp_inode);
> iput(NULL) is properly handled so you don't need the if (tmp_inode) check
> here.
Hi Jan,

Thank you for your careful review. Should I remove this redundant check 
in v2?

BR,
Yun

^ permalink raw reply

* Re: [PATCH RFC 7/8] erofs: open via dedicated fs bdev helpers
From: Gao Xiang @ 2026-06-10  6:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christoph Hellwig, Jan Kara
In-Reply-To: <20260603-nieder-ausdehnen-siebdruck-aa96f40ebec6@brauner>

Hi Christian,

On 2026/6/3 21:42, Christian Brauner wrote:
>> May I ask if it's an urgent 7.2 work? If not, I could
> 
> No no, it's way too late for that this cycle.
> 
>> make a preparation patch for the upcoming 7.2 cycle
>> to handle erofs_map_dev() failure here so you don't
>> need to bother with this in this patchset.
> 
> Sounds good. I take it you can just do this yourself without me.
> 
>> I will seek more time to resolve the recent todos
> 
> Thanks!
> 
>> yet always intercepted by other unrelated stuffs.
> 
> :)

I removed .shutdown() and .remove_bdev() implementations since I
think it doesn't quite seem necessary for immutable fses, but
would like to know your thoughts too, my overall own comments are
documented in the commit message below:


 From 933f6c6f2e704116d9a15815c880196bec7b9ee3 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Tue, 2 Jun 2026 12:10:13 +0200
Subject: [PATCH] erofs: open via dedicated fs bdev helpers

Route opens through fs_bdev_file_open_by_path() so each external device
is registered against the correct superblock, and convert the matching
releases.

Gao Xiang: I think typical immutable filesystems don't need .shutdown()
and .remove_bdev() for the following reasons:

  - blk_mark_disk_dead() sets GD_DEAD in advance of fs_bdev_mark_dead()
    so that the following bios will fail immediately; block_device
    references are still valid so it seems overkill to handle dead
    blockdevs in the deep filesystem I/O submission path.

  - Immutable filesystems like EROFS don't have write paths and journals,
    so they don't need to block writes (i.e., new dirty pages), metadata
    changes, and abort journals.

  - The comment above loop_change_fd() documents a valid read-only use
    case we need to support anyway, but it calls disk_force_media_change()
    which will call fs_bdev_mark_dead() later: we don't want loop_change_fd()
    shutdowns the active filesystems and return -EIO unconditionally.

Currently I think the default behavior (shrink_dcache_sb + evict_inodes)
in fs_bdev_mark_dead() is enough for immutable filesystems, tried to
document in the commit here for later reference.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
  fs/erofs/super.c | 35 +++++++++++++++++++++++------------
  1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 802add6652fd..def9cbfbc9d8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -153,8 +153,8 @@ static int erofs_init_device(struct erofs_buf *buf, struct super_block *sb,
  	} else if (!sbi->devs->flatdev) {
  		file = erofs_is_fileio_mode(sbi) ?
  				filp_open(dif->path, O_RDONLY | O_LARGEFILE, 0) :
-				bdev_file_open_by_path(dif->path,
-						BLK_OPEN_READ, sb->s_type, NULL);
+				fs_bdev_file_open_by_path(dif->path,
+						BLK_OPEN_READ, sb->s_type, sb);
  		if (IS_ERR(file)) {
  			if (file == ERR_PTR(-ENOTBLK))
  				return -EINVAL;
@@ -843,11 +843,16 @@ static int erofs_fc_reconfigure(struct fs_context *fc)

  static int erofs_release_device_info(int id, void *ptr, void *data)
  {
+	struct super_block *sb = data;
  	struct erofs_device_info *dif = ptr;

  	fs_put_dax(dif->dax_dev, NULL);
-	if (dif->file)
-		fput(dif->file);
+	if (dif->file) {
+		if (S_ISBLK(file_inode(dif->file)->i_mode))
+			fs_bdev_file_release(dif->file, sb);
+		else
+			fput(dif->file);
+	}
  	erofs_fscache_unregister_cookie(dif->fscache);
  	dif->fscache = NULL;
  	kfree(dif->path);
@@ -855,18 +860,19 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
  	return 0;
  }

-static void erofs_free_dev_context(struct erofs_dev_context *devs)
+static void erofs_free_dev_context(struct erofs_dev_context *devs,
+				   struct super_block *sb)
  {
  	if (!devs)
  		return;
-	idr_for_each(&devs->tree, &erofs_release_device_info, NULL);
+	idr_for_each(&devs->tree, &erofs_release_device_info, sb);
  	idr_destroy(&devs->tree);
  	kfree(devs);
  }

-static void erofs_sb_free(struct erofs_sb_info *sbi)
+static void erofs_sb_free(struct erofs_sb_info *sbi, struct super_block *sb)
  {
-	erofs_free_dev_context(sbi->devs);
+	erofs_free_dev_context(sbi->devs, sb);
  	kfree(sbi->fsid);
  	kfree_sensitive(sbi->domain_id);
  	if (sbi->dif0.file)
@@ -879,8 +885,13 @@ static void erofs_fc_free(struct fs_context *fc)
  {
  	struct erofs_sb_info *sbi = fc->s_fs_info;

-	if (sbi) /* free here if an error occurs before transferring to sb */
-		erofs_sb_free(sbi);
+	/*
+	 * Freed here only if an error occurs before the sb is set up; at that
+	 * point no block-backed device has been claimed (that happens in
+	 * fill_super), so the NULL sb never reaches fs_bdev_file_release().
+	 */
+	if (sbi)
+		erofs_sb_free(sbi, NULL);
  }

  static const struct fs_context_operations erofs_context_ops = {
@@ -936,7 +947,7 @@ static void erofs_kill_sb(struct super_block *sb)
  	erofs_drop_internal_inodes(sbi);
  	fs_put_dax(sbi->dif0.dax_dev, NULL);
  	erofs_fscache_unregister_fs(sb);
-	erofs_sb_free(sbi);
+	erofs_sb_free(sbi, sb);
  	sb->s_fs_info = NULL;
  }

@@ -948,7 +959,7 @@ static void erofs_put_super(struct super_block *sb)
  	erofs_shrinker_unregister(sb);
  	erofs_xattr_prefixes_cleanup(sb);
  	erofs_drop_internal_inodes(sbi);
-	erofs_free_dev_context(sbi->devs);
+	erofs_free_dev_context(sbi->devs, sb);
  	sbi->devs = NULL;
  	erofs_fscache_unregister_fs(sb);
  }
--
2.43.5



^ permalink raw reply related

* [PATCH v3] ext4: drop s_writepages_rwsem around inline data handling in writepages
From: Yun Zhou @ 2026-06-10  6:37 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260609154505.2104659-1-yun.zhou@windriver.com>

ext4_do_writepages() calls ext4_destroy_inline_data() which acquires
xattr_sem while s_writepages_rwsem is held (read).  This creates a
circular lock dependency:

  CPU0                               CPU1
  ----                               ----
  ext4_writepages()
    ext4_writepages_down_read()
      [holds s_writepages_rwsem]
                                     ext4_evict_inode()
                                       __ext4_mark_inode_dirty()
                                         ext4_expand_extra_isize_ea()
                                           ext4_xattr_block_set()
                                             [holds xattr_sem]
                                             iput(old_bh inode)
                                               write_inode_now()
                                                 ext4_writepages()
                                                   ext4_writepages_down_read()
                                                   [BLOCKED on s_writepages_rwsem]
    ext4_do_writepages()
      ext4_destroy_inline_data()
        down_write(xattr_sem)
        [BLOCKED on xattr_sem]

Fix by temporarily dropping s_writepages_rwsem for the entire inline
data handling block, including the journal handle start/stop.  The
rwsem must be dropped before ext4_journal_start() -- not between
journal_start and journal_stop -- to avoid a secondary deadlock with
ext4_change_inode_journal_flag() which takes rwsem (write) and then
calls jbd2_journal_lock_updates() waiting for active handles to stop.

This is safe because:

 - This code runs before any block mapping or IO submission, so no
   writepages state depends on the rwsem being held at this point.

 - Inline data destruction is a one-way format transition (once cleared,
   EXT4_INODE_INLINE_DATA is never set again).  The rwsem is
   re-acquired after journal_stop, ensuring format stability for the
   remainder of writepages.

 - The can_map flag identifies the ext4_writepages() path (holds rwsem)
   vs ext4_normal_submit_inode_data_buffers() (does not), so the
   drop/reacquire is skipped when the rwsem is not held.

Also check the return value of ext4_destroy_inline_data() to avoid
proceeding with an inconsistent inode format on failure.

Reported-by: syzbot+bb2455d02bda0b5701e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bb2455d02bda0b5701e3
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Drop s_writepages_rwsem before ext4_journal_start() and reacquire
    after ext4_journal_stop(), instead of dropping between journal_start
    and journal_stop as in v2.  This avoids two issues identified in v2
    review:
    - memalloc_nofs_restore() in ext4_writepages_up_read() would clear
      PF_MEMALLOC_NOFS while the jbd2 handle is active.
    - Reacquiring s_writepages_rwsem while holding a handle creates an
      ABBA deadlock with ext4_change_inode_journal_flag() which takes
      the rwsem (write) then calls jbd2_journal_lock_updates().

v2: Instead of moving inline data handling to ext4_writepages(),
    temporarily drop s_writepages_rwsem around ext4_destroy_inline_data()
    in ext4_do_writepages(). The move approach had a race where concurrent
    writes could create dirty pages with inline data after the early check,
    and unconditional destruction without dirty pages would lose data.

v1: Moved inline data cleanup from ext4_do_writepages() to
      ext4_writepages() before acquiring s_writepages_rwsem.

 fs/ext4/inode.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..cd7588a3fa45 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1694,6 +1694,9 @@ struct mpage_da_data {
 	struct writeback_control *wbc;
 	unsigned int can_map:1;	/* Can writepages call map blocks? */
 
+	/* Saved memalloc context from ext4_writepages_down_read() */
+	int alloc_ctx;
+
 	/* These are internal state of ext4_do_writepages() */
 	loff_t start_pos;	/* The start pos to write */
 	loff_t next_pos;	/* Current pos to examine */
@@ -2816,16 +2819,35 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	 * we'd better clear the inline data here.
 	 */
 	if (ext4_has_inline_data(inode)) {
-		/* Just inode will be modified... */
+		/*
+		 * Temporarily drop s_writepages_rwsem because
+		 * ext4_destroy_inline_data() acquires xattr_sem, which has
+		 * a higher lock ordering rank.  Holding both would create a
+		 * circular dependency with ext4_xattr_block_set() -> iput()
+		 * -> ext4_writepages() -> s_writepages_rwsem.
+		 *
+		 * Drop the rwsem before starting the journal handle to also
+		 * avoid a deadlock with ext4_change_inode_journal_flag(),
+		 * which takes rwsem (write) then jbd2_journal_lock_updates().
+		 */
+		if (mpd->can_map)
+			ext4_writepages_up_read(inode->i_sb, mpd->alloc_ctx);
 		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
 		if (IS_ERR(handle)) {
+			if (mpd->can_map)
+				mpd->alloc_ctx =
+					ext4_writepages_down_read(inode->i_sb);
 			ret = PTR_ERR(handle);
 			goto out_writepages;
 		}
 		BUG_ON(ext4_test_inode_state(inode,
 				EXT4_STATE_MAY_INLINE_DATA));
-		ext4_destroy_inline_data(handle, inode);
+		ret = ext4_destroy_inline_data(handle, inode);
 		ext4_journal_stop(handle);
+		if (mpd->can_map)
+			mpd->alloc_ctx = ext4_writepages_down_read(inode->i_sb);
+		if (ret)
+			goto out_writepages;
 	}
 
 	/*
@@ -3032,13 +3054,12 @@ static int ext4_writepages(struct address_space *mapping,
 		.can_map = 1,
 	};
 	int ret;
-	int alloc_ctx;
 
 	ret = ext4_emergency_state(sb);
 	if (unlikely(ret))
 		return ret;
 
-	alloc_ctx = ext4_writepages_down_read(sb);
+	mpd.alloc_ctx = ext4_writepages_down_read(sb);
 	ret = ext4_do_writepages(&mpd);
 	/*
 	 * For data=journal writeback we could have come across pages marked
@@ -3047,7 +3068,7 @@ static int ext4_writepages(struct address_space *mapping,
 	 */
 	if (!ret && mpd.journalled_more_data)
 		ret = ext4_do_writepages(&mpd);
-	ext4_writepages_up_read(sb, alloc_ctx);
+	ext4_writepages_up_read(sb, mpd.alloc_ctx);
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2] ext4: drop s_writepages_rwsem around ext4_destroy_inline_data
From: Yun Zhou @ 2026-06-10  5:08 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260609154505.2104659-1-yun.zhou@windriver.com>

ext4_do_writepages() calls ext4_destroy_inline_data() which acquires
xattr_sem while s_writepages_rwsem is held (read).  This creates a
circular lock dependency:

  CPU0                               CPU1
  ----                               ----
  ext4_writepages()
    ext4_writepages_down_read()
      [holds s_writepages_rwsem]
                                     ext4_evict_inode()
                                       __ext4_mark_inode_dirty()
                                         ext4_expand_extra_isize_ea()
                                           ext4_xattr_block_set()
                                             [holds xattr_sem]
                                             iput(old_bh inode)
                                               write_inode_now()
                                                 ext4_writepages()
                                                   ext4_writepages_down_read()
                                                   [BLOCKED on s_writepages_rwsem]
    ext4_do_writepages()
      ext4_destroy_inline_data()
        down_write(xattr_sem)
        [BLOCKED on xattr_sem]

Fix by temporarily dropping s_writepages_rwsem around the call to
ext4_destroy_inline_data().  This is safe because:

 - This code runs before any block mapping or IO submission, so no
   writepages state depends on the rwsem being held at this point.

 - Inline data destruction is a one-way format transition (once cleared,
   EXT4_INODE_INLINE_DATA is never set again).  The rwsem is
   re-acquired immediately after, ensuring format stability for the
   remainder of writepages.

 - The can_map flag naturally identifies the ext4_writepages() path
   (holds rwsem) vs ext4_normal_submit_inode_data_buffers() (does not),
   so the drop/reacquire is skipped when the rwsem is not held.

Also check the return value of ext4_destroy_inline_data() -- previously
ignored, a failure would leave inline data intact while writepages
proceeds assuming block-mapped layout.

Reported-by: syzbot+bb2455d02bda0b5701e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bb2455d02bda0b5701e3
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v2:
 - Instead of moving inline data handling to ext4_writepages(),
   temporarily drop s_writepages_rwsem around ext4_destroy_inline_data()
   in ext4_do_writepages(). The move approach had a race where concurrent
   writes could create dirty pages with inline data after the early check,
   and unconditional destruction without dirty pages would lose data.

 fs/ext4/inode.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..7ec16adf4685 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1694,6 +1694,9 @@ struct mpage_da_data {
 	struct writeback_control *wbc;
 	unsigned int can_map:1;	/* Can writepages call map blocks? */
 
+	/* Saved memalloc context from ext4_writepages_down_read() */
+	int alloc_ctx;
+
 	/* These are internal state of ext4_do_writepages() */
 	loff_t start_pos;	/* The start pos to write */
 	loff_t next_pos;	/* Current pos to examine */
@@ -2824,8 +2827,21 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 		}
 		BUG_ON(ext4_test_inode_state(inode,
 				EXT4_STATE_MAY_INLINE_DATA));
-		ext4_destroy_inline_data(handle, inode);
+		/*
+		 * Temporarily drop s_writepages_rwsem because
+		 * ext4_destroy_inline_data() acquires xattr_sem, which has
+		 * a higher lock ordering rank.  Holding both would create a
+		 * circular dependency with ext4_xattr_block_set() -> iput()
+		 * -> ext4_writepages() -> s_writepages_rwsem.
+		 */
+		if (mpd->can_map)
+			ext4_writepages_up_read(inode->i_sb, mpd->alloc_ctx);
+		ret = ext4_destroy_inline_data(handle, inode);
+		if (mpd->can_map)
+			mpd->alloc_ctx = ext4_writepages_down_read(inode->i_sb);
 		ext4_journal_stop(handle);
+		if (ret)
+			goto out_writepages;
 	}
 
 	/*
@@ -3032,13 +3048,12 @@ static int ext4_writepages(struct address_space *mapping,
 		.can_map = 1,
 	};
 	int ret;
-	int alloc_ctx;
 
 	ret = ext4_emergency_state(sb);
 	if (unlikely(ret))
 		return ret;
 
-	alloc_ctx = ext4_writepages_down_read(sb);
+	mpd.alloc_ctx = ext4_writepages_down_read(sb);
 	ret = ext4_do_writepages(&mpd);
 	/*
 	 * For data=journal writeback we could have come across pages marked
@@ -3047,7 +3062,7 @@ static int ext4_writepages(struct address_space *mapping,
 	 */
 	if (!ret && mpd.journalled_more_data)
 		ret = ext4_do_writepages(&mpd);
-	ext4_writepages_up_read(sb, alloc_ctx);
+	ext4_writepages_up_read(sb, mpd.alloc_ctx);
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-10  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-kernel,
	syzbot+0c89d865531d053abb2d
In-Reply-To: <w7zmtkoa4ieb676gkl6m2ax5hp76dxr2rhkfzgqvlydvw4hpfr@hixijfpumliv>

Hi Jan,

Thank you for the review and for the Reviewed-by tag!

Best regards,
Aditya Prakash Srivastava

^ permalink raw reply

* [syzbot] [ext4?] KASAN: use-after-free Read in ext4_xattr_list_entries (2)
From: syzbot @ 2026-06-09 21:03 UTC (permalink / raw)
  To: adilger.kernel, jack, libaokun, linux-ext4, linux-kernel, ojaswin,
	ritesh.list, syzkaller-bugs, tytso, yi.zhang

Hello,

syzbot found the following issue on:

HEAD commit:    8e65320d91cd Merge tag 'drm-fixes-2026-06-06' of https://g..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=106aabec580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b4166e8ea5fbf7e3
dashboard link: https://syzkaller.appspot.com/bug?extid=3fbf2337de43f5581aec
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/d900f083ada3/non_bootable_disk-8e65320d.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/0bee42e3c28b/vmlinux-8e65320d.xz
kernel image: https://storage.googleapis.com/syzbot-assets/57e4c1a3c321/bzImage-8e65320d.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+3fbf2337de43f5581aec@syzkaller.appspotmail.com

loop0: detected capacity change from 0 to 2048
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
cdc_ether 5-1:1.0: probe with driver cdc_ether failed with error -22
loop0: detected capacity change from 2048 to 64
EXT4-fs error (device loop0): xattr_find_entry:337: inode #15: comm syz.0.0: corrupted xattr entries
==================================================================
BUG: KASAN: use-after-free in ext4_xattr_list_entries+0x302/0x3d0 fs/ext4/xattr.c:724
Read of size 4 at addr ffff8880568de014 by task syz.0.0/5342

CPU: 0 UID: 0 PID: 5342 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_xattr_list_entries+0x302/0x3d0 fs/ext4/xattr.c:724
 ext4_xattr_ibody_list fs/ext4/xattr.c:793 [inline]
 ext4_listxattr+0x221/0x670 fs/ext4/xattr.c:818
 vfs_listxattr fs/xattr.c:511 [inline]
 listxattr+0x112/0x2a0 fs/xattr.c:933
 filename_listxattr fs/xattr.c:966 [inline]
 path_listxattrat+0x1a3/0x3f0 fs/xattr.c:993
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f9709f9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f970af29fe8 EFLAGS: 00000246 ORIG_RAX: 00000000000000c3
RAX: ffffffffffffffda RBX: 00007f970a215fa0 RCX: 00007f9709f9ce59
RDX: 000000000000002d RSI: 0000200000000100 RDI: 0000200000000140
RBP: 00007f970a032d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f970a216038 R14: 00007f970a215fa0 R15: 00007fff4c5d5648
 </TASK>

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x568de
flags: 0x4fff00000000000(node=1|zone=1|lastcpupid=0x7ff)
raw: 04fff00000000000 ffffea00015a37c8 ffffea000141f208 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner info is not present (never set?)

Memory state around the buggy address:
 ffff8880568ddf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff8880568ddf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8880568de000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                         ^
 ffff8880568de080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff8880568de100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [syzbot] Monthly ext4 report (Jun 2026)
From: syzbot @ 2026-06-09 20:32 UTC (permalink / raw)
  To: linux-ext4, linux-kernel, syzkaller-bugs

Hello ext4 maintainers/developers,

This is a 31-day syzbot report for the ext4 subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/ext4

During the period, 2 new issues were detected and 0 were fixed.
In total, 45 issues are still open and 175 have already been fixed.
There are also 8 low-priority issues.

Some of the still happening issues:

Ref  Crashes Repro Title
<1>  10098   Yes   possible deadlock in ext4_writepages (2)
                   https://syzkaller.appspot.com/bug?extid=eb5b4ef634a018917f3c
<2>  7580    Yes   KASAN: out-of-bounds Read in ext4_xattr_set_entry
                   https://syzkaller.appspot.com/bug?extid=f792df426ff0f5ceb8d1
<3>  3212    Yes   kernel BUG in ext4_do_writepages
                   https://syzkaller.appspot.com/bug?extid=d1da16f03614058fdc48
<4>  481     Yes   possible deadlock in ext4_evict_inode (5)
                   https://syzkaller.appspot.com/bug?extid=212e8f62790f8e0bc63b
<5>  406     Yes   possible deadlock in wait_transaction_locked (3)
                   https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
<6>  139     Yes   KMSAN: uninit-value in fscrypt_crypt_data_unit
                   https://syzkaller.appspot.com/bug?extid=7add5c56bc2a14145d20
<7>  20      No    possible deadlock in evict (4)
                   https://syzkaller.appspot.com/bug?extid=a30a00d3e694e4fa1315
<8>  10      No    WARNING in ext4_write_inode (3)
                   https://syzkaller.appspot.com/bug?extid=070d9738dbe6a10fadc8
<9>  8       Yes   INFO: task hung in block_read_full_folio (3)
                   https://syzkaller.appspot.com/bug?extid=03afbb29537f0336b7ad
<10> 2951    Yes   INFO: task hung in sync_inodes_sb (5)
                   https://syzkaller.appspot.com/bug?extid=30476ec1b6dc84471133

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

To disable reminders for individual bugs, reply with the following command:
#syz set <Ref> no-reminders

To change bug's subsystems, reply with:
#syz set <Ref> subsystems: new-subsystem

You may send multiple commands in a single email message.

^ permalink raw reply

* [PATCH 5.10/5.15] ext4: validate p_idx bounds in ext4_ext_correct_indexes
From: Alexey Panov @ 2026-06-09 16:44 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Alexey Panov, Theodore Ts'o, Andreas Dilger, linux-ext4,
	linux-kernel, Baokun Li, Jan Kara, Ojaswin Mujoo,
	Ritesh Harjani (IBM), Zhang Yi, lvc-project,
	syzbot+04c4e65cab786a2e5b7e, Tejas Bharambe, stable

From: Tejas Bharambe <tejas.bharambe@outlook.com>

commit 2acb5c12ebd860f30e4faf67e6cc8c44ddfe5fe8 upstream.

ext4_ext_correct_indexes() walks up the extent tree correcting
index entries when the first extent in a leaf is modified. Before
accessing path[k].p_idx->ei_block, there is no validation that
p_idx falls within the valid range of index entries for that
level.

If the on-disk extent header contains a corrupted or crafted
eh_entries value, p_idx can point past the end of the allocated
buffer, causing a slab-out-of-bounds read.

Fix this by validating path[k].p_idx against EXT_LAST_INDEX() at
both access sites: before the while loop and inside it. Return
-EFSCORRUPTED if the index pointer is out of range, consistent
with how other bounds violations are handled in the ext4 extent
tree code.

Reported-by: syzbot+04c4e65cab786a2e5b7e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=04c4e65cab786a2e5b7e
Signed-off-by: Tejas Bharambe <tejas.bharambe@outlook.com>
Link: https://patch.msgid.link/JH0PR06MB66326016F9B6AD24097D232B897CA@JH0PR06MB6632.apcprd06.prod.outlook.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
[ Alexey: Adapt goto clean to break because the clean error path is not
  present in linux-5.10.y and linux-5.15.y. ]
Signed-off-by: Alexey Panov <apanov@astralinux.ru>
---
Backport fix for CVE-2026-31449
 fs/ext4/extents.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 80b7783c65b4..e6dbb2dfb331 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1736,6 +1736,13 @@ static int ext4_ext_correct_indexes(handle_t *handle, struct inode *inode,
 	err = ext4_ext_get_access(handle, inode, path + k);
 	if (err)
 		return err;
+	if (unlikely(path[k].p_idx > EXT_LAST_INDEX(path[k].p_hdr))) {
+		EXT4_ERROR_INODE(inode,
+				 "path[%d].p_idx %p > EXT_LAST_INDEX %p",
+				 k, path[k].p_idx,
+				 EXT_LAST_INDEX(path[k].p_hdr));
+		return -EFSCORRUPTED;
+	}
 	path[k].p_idx->ei_block = border;
 	err = ext4_ext_dirty(handle, inode, path + k);
 	if (err)
@@ -1748,6 +1755,14 @@ static int ext4_ext_correct_indexes(handle_t *handle, struct inode *inode,
 		err = ext4_ext_get_access(handle, inode, path + k);
 		if (err)
 			break;
+		if (unlikely(path[k].p_idx > EXT_LAST_INDEX(path[k].p_hdr))) {
+			EXT4_ERROR_INODE(inode,
+					 "path[%d].p_idx %p > EXT_LAST_INDEX %p",
+					 k, path[k].p_idx,
+					 EXT_LAST_INDEX(path[k].p_hdr));
+			err = -EFSCORRUPTED;
+			break;
+		}
 		path[k].p_idx->ei_block = border;
 		err = ext4_ext_dirty(handle, inode, path + k);
 		if (err)
-- 
2.47.3

^ permalink raw reply related

* [PATCH] ext4: move inline data cleanup to ext4_writepages to fix deadlock
From: Yun Zhou @ 2026-06-09 15:45 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, daeho.jeong
  Cc: linux-ext4, linux-kernel, yun.zhou

ext4_do_writepages() calls ext4_destroy_inline_data() which acquires
xattr_sem while s_writepages_rwsem is held (read).  This creates a
circular lock dependency with the xattr writeback path:

  CPU0                               CPU1
  ----                               ----
  ext4_writepages()
    ext4_writepages_down_read()
      [holds s_writepages_rwsem]
                                     ext4_evict_inode()
                                       __ext4_mark_inode_dirty()
                                         ext4_expand_extra_isize_ea()
                                           ext4_xattr_block_set()
                                             [holds xattr_sem]
                                             iput(old_bh inode)
                                               write_inode_now()
                                                 ext4_writepages()
                                                   ext4_writepages_down_read()
                                                   [BLOCKED on s_writepages_rwsem]
    ext4_do_writepages()
      ext4_destroy_inline_data()
        down_write(xattr_sem)
        [BLOCKED on xattr_sem]

Move inline data destruction from ext4_do_writepages() into
ext4_writepages(), before acquiring s_writepages_rwsem.

This is safe because the other caller of ext4_do_writepages()
(ext4_normal_submit_inode_data_buffers, invoked by jbd2 during commit)
can never encounter inline data: jbd2 only tracks inodes with
block-mapped dirty ranges registered via ext4_jbd2_inode_add_write(),
and all such registration paths either explicitly bail out when inline
data is present (ext4_journalled_write_end) or are logically
unreachable for inline data inodes (ext4_map_blocks requires block
allocation, ext4_block_zero_eof requires existing blocks).

Reported-by: syzbot+bb2455d02bda0b5701e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bb2455d02bda0b5701e3
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/inode.c | 47 +++++++++++++++++++++++++++++------------------
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..0c7461ab4fd0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2810,24 +2810,6 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	if (unlikely(ret))
 		goto out_writepages;
 
-	/*
-	 * If we have inline data and arrive here, it means that
-	 * we will soon create the block for the 1st page, so
-	 * we'd better clear the inline data here.
-	 */
-	if (ext4_has_inline_data(inode)) {
-		/* Just inode will be modified... */
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out_writepages;
-		}
-		BUG_ON(ext4_test_inode_state(inode,
-				EXT4_STATE_MAY_INLINE_DATA));
-		ext4_destroy_inline_data(handle, inode);
-		ext4_journal_stop(handle);
-	}
-
 	/*
 	 * data=journal mode does not do delalloc so we just need to writeout /
 	 * journal already mapped buffers. On the other hand we need to commit
@@ -3038,6 +3020,35 @@ static int ext4_writepages(struct address_space *mapping,
 	if (unlikely(ret))
 		return ret;
 
+	/*
+	 * Clearing inline data acquires xattr_sem, which ranks above
+	 * s_writepages_rwsem.  Do it here before taking the rwsem to avoid
+	 * a circular dependency:
+	 *   ext4_writepages (s_writepages_rwsem) -> ext4_destroy_inline_data
+	 *     (xattr_sem)
+	 *   ext4_xattr_block_set (xattr_sem) -> iput -> ext4_writepages
+	 *     (s_writepages_rwsem)
+	 *
+	 * This is only needed in the ext4_writepages() path.  The other
+	 * caller of ext4_do_writepages() -- ext4_normal_submit_inode_data_buffers
+	 * (jbd2 commit callback) -- cannot encounter inline data because jbd2
+	 * only tracks inodes with block-mapped dirty ranges registered via
+	 * ext4_jbd2_inode_add_write(), and all such callers either bail out
+	 * for inline data inodes (e.g. ext4_journalled_write_end) or are
+	 * unreachable for them (ext4_map_blocks, ext4_block_zero_eof).
+	 */
+	if (ext4_has_inline_data(mapping->host)) {
+		handle_t *handle;
+
+		handle = ext4_journal_start(mapping->host, EXT4_HT_INODE, 1);
+		if (IS_ERR(handle))
+			return PTR_ERR(handle);
+		BUG_ON(ext4_test_inode_state(mapping->host,
+				EXT4_STATE_MAY_INLINE_DATA));
+		ext4_destroy_inline_data(handle, mapping->host);
+		ext4_journal_stop(handle);
+	}
+
 	alloc_ctx = ext4_writepages_down_read(sb);
 	ret = ext4_do_writepages(&mpd);
 	/*
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v6 01/11] fstests: add _loop_image_create_clone() helper
From: Darrick J. Wong @ 2026-06-09 14:37 UTC (permalink / raw)
  To: Anand Suveer Jain
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <9c0989d8-202f-42ab-9347-df082c25aa72@kernel.org>

On Mon, Jun 08, 2026 at 10:39:04PM +0800, Anand Suveer Jain wrote:
> On 29/5/26 12:27, Darrick J. Wong wrote:
> > On Thu, May 28, 2026 at 12:05:32PM +0800, Anand Jain wrote:
> > > Introduce _loop_image_create_clone() and _loop_image_destroy() to mkfs an
> > > image file and clone it to another image file, and attach a loop device to
> > > them. And its destroy part.
> > > 
> > > Signed-off-by: Anand Jain <asj@kernel.org>
> > > ---
> > >  common/rc | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 63 insertions(+)
> > > 
> > > diff --git a/common/rc b/common/rc
> > > index 79189e7e6e94..d7e3e0bdfb1e 100644
> > > --- a/common/rc
> > > +++ b/common/rc
> > > @@ -1520,6 +1520,69 @@ _scratch_resvblks()
> > >  	esac
> > >  }
> > > +# Create a small loop image, run an optional tuning function ($2) on it,
> > > +# clone it, and attach both to loop devices, returned in ($1).
> > > +# Args:
> > > +#   $1: Nameref to return the array of allocated loop devices [base, clone].
> > > +#   $2: Optional callback function to tune the base filesystem before cloning.
> > > +_loop_image_create_clone()
> > > +{
> > > +	local -n _ret=$1
> > 
> > That switch   ^^ is very clever.  I always wondered how one did indirect
> > variables in bash.
> > 
> > > +	local pre_clone_tune_func="$2"
> > > +	local img_file=$TEST_DIR/${seq}.img
> > > +	local img_file_clone=$TEST_DIR/${seq}_clone.img
> > > +	local size=$(_small_fs_size_mb 128) # Smallest possible
> > > +	local loop_devs
> > > +
> > > +	# Since we copy the block device image, we keep its size small.
> > > +	_require_fs_space $TEST_DIR $((size * 1024))
> > > +
> > > +	_create_file_sized $((size * 1024 * 1024)) $img_file ||
> > > +				_fail "Failed: Create $img_file $size"
> > > +
> > > +	loop_devs=$(_create_loop_device $img_file)
> > > +	_ret=($loop_devs)
> > 
> > Should this check that a loopdev actually got created?
> > 
> 
> Hmm, in the function _create_loop_device(), we are
> calling _fail if create fails, so no need to duplicate, right?

Oh right.  Question withdrawn.

> > > +	case $FSTYP in
> > > +	xfs)
> > > +		_mkfs_dev "-s size=4096" ${loop_devs[0]}
> > > +		;;
> > > +	btrfs)
> > > +		_mkfs_dev ${loop_devs[0]}
> > > +		;;
> > > +	*)
> > > +		_mkfs_dev ${loop_devs[0]}
> > > +		;;
> > > +	esac
> > > +
> > > +	# Only execute if the function argument is not empty
> > > +	if [ -n "$pre_clone_tune_func" ]; then
> > > +		$pre_clone_tune_func ${loop_devs[0]}
> > > +	fi
> > > +
> > > +	sync ${loop_devs[0]}
> > > +	cp $img_file $img_file_clone
> > > +
> 
> 
> > > +	loop_devs="$loop_devs $(_create_loop_device $img_file_clone)"
> > 
> > 	local lodev="$(_create_loop_device ...)"
> > 
> > 	test -z "$lodev" && _fail "second loopdev not created"
> > 	_ret+=("$lodev")
> > 
> > ?
> 
> If the second `_create_loop_device()` happens to fail, it will
> already have called `_fail`, so "second loopdev..." won't be
> used at all.

<nod> Both comments withdrawn :)

--D

> 
> Thanks, Anand
> 
> 
> 
> > > +
> > > +	_ret=($loop_devs)
> > > +}
> > > +
> > > +# Teardown loop devices and delete their underlying backing image files.
> > > +# Accepts a list of loop device paths (e.g., /dev/loop0 /dev/loop1).
> > > +_loop_image_destroy()
> > > +{
> > > +	for d in "$@"; do
> > > +		# Retrieve the path of the backing file
> > > +		local f=$(losetup --noheadings --output BACK-FILE $d)
> > > +
> > > +		# Detach the loop device from the backing file
> > > +		_destroy_loop_device "$d"
> > > +
> > > +		# Clean up the backing disk image file
> > > +		[ -n "$f" ] && rm -f "$f"
> > > +	done
> > > +}
> > >  # Repair scratch filesystem.  Returns 0 if the FS is good to go (either no
> > >  # errors found or errors were fixed) and nonzero otherwise; also spits out
> > > -- 
> > > 2.43.0
> > > 
> > > 
> 
> 

^ permalink raw reply

* Re: [PATCH v4] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-09 13:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, sashiko-reviews, linux-ext4,
	linux-kernel, syzbot+0c89d865531d053abb2d
In-Reply-To: <o3k4wongcbuacu4rjsb7h2utzsrhpnun55vzdnp46imnlbn5x6@matvyu6j2xhc>

Hi Jan,

Thank you very much for the incredibly detailed review and the design
insights!

I completely agree with your suggestion. Rushing too many fixes for
complex, concurrent race conditions into a single patch makes the
code harder to review and risks introducing subtle regressions.

Let's go ahead with the simple and straightforward v1 patch (which
has your Reviewed-by) to fix the original syzbot crash for now.

I will take your excellent suggestion to use fsdata for state
communication between write_begin and write_end, and I will work on
formulating a separate, cleaner patch series in the future to
address the remaining concurrent locking races you mentioned.

I will withdraw this v4 thread for now.

Thanks again for your guidance!

Best regards,
Aditya Prakash Srivastava

^ permalink raw reply

* Re: [PATCH v4] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Jan Kara @ 2026-06-09 12:46 UTC (permalink / raw)
  To: Aditya Prakash Srivastava
  Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, sashiko-reviews,
	linux-ext4, linux-kernel, syzbot+0c89d865531d053abb2d
In-Reply-To: <20260609062005.1702-1-aditya.ansh182@gmail.com>

On Tue 09-06-26 06:20:05, Aditya Prakash Srivastava wrote:
> When the data=journal mount option is used, the ext4_journalled_write_end()
> function incorrectly calls ext4_write_inline_data_end() without checking
> if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.
> 
> If a previous attempt to convert the inline data to an extent failed (e.g.
> due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
> the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
> call to ext4_write_begin() will not prepare the inline data xattr for
> writing, but ext4_journalled_write_end() will incorrectly attempt to write
> to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
> ext4_write_inline_data() since i_inline_size was not expanded.
> 
> Additionally, two separate TOCTOU race conditions exist due to concurrent
> ext4_page_mkwrite() execution:
> 1) A concurrent ext4_page_mkwrite() can execute ext4_convert_inline_data()
> between write_begin and write_end, clearing the inline flags. Since block
> buffers were not allocated in write_begin, this results in a NULL pointer
> dereference in the write_end fallback paths because folio_buffers(folio) is
> NULL.
> 2) If ext4_convert_inline_data() clears the flags exactly after the inline
> flags checks pass in write_end, but before ext4_write_inline_data_end()
> acquires the xattr semaphore, the subsequent check will hit a panic via
> BUG_ON(!ext4_has_inline_data(inode)).

Yes, locking of inline data writes is broken (and difficult to fix). Your
v1 patch was actually simple and obvious improvement of the situation.
These additional fixes belong into separate patches.

> Fix these issues completely by:
> 1) Having write_end functions (ext4_write_end(),
> ext4_journalled_write_end(), and ext4_da_do_write_end()) return 0
> (VFS retry) if they fall through to the block fallback path and detect
> that folio_buffers(folio) is NULL, after safely stopping any active
> journal handle (protecting against a NULL handle panic in
> ext4_put_nojournal()).
> 2) Replacing BUG_ON(!ext4_has_inline_data(inode)) inside
> ext4_write_inline_data_end() with a graceful error path. If the inline flag
> is cleared after locking the xattr, we unlock the xattr, release the iloc,
> unlock/put the folio, stop the journal, and return 0 to trigger a retry.
> 
> Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
> Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> ---
> v4:
>   - Address critical TOCTOU race condition (reported by Sashiko AI review):
>     * Scenario: A buffered write holds the folio lock and evaluates the inline
>       flags checks in write_end to true. Before it enters or locks the xattr_sem
>       in ext4_write_inline_data_end(), a concurrent memory-mapped page fault
>       (ext4_page_mkwrite()) converts the inline data to an extent. This page fault
>       bypasses the folio lock (since ext4_convert_inline_data() runs lockless),
>       acquires the xattr_sem, and clears the inline flags. When the buffered write
>       resumes and enters ext4_write_inline_data_end(), it acquires the xattr_sem
>       and immediately triggers BUG_ON(!ext4_has_inline_data(inode)) causing a
>       kernel panic.
>     * Fix: Replace the BUG_ON() with a graceful error-handling retry path that
>       releases all resources (locks/buffers/folios/journals) and returns 0.
> v3:
>   - Fix journal handle leak and NULL handle crash (reported by Sashiko AI review):
>     * Scenario 1 (leak): During a delayed allocation write (ext4_da_write_begin),
>       inline data was prepared and a transaction handle started. If a concurrent
>       page fault converts the inline data before write_end, ext4_da_write_end()
>       falls through to ext4_da_do_write_end(). If the fallback check for
>       !folio_buffers(folio) returns 0 to retry without calling ext4_journal_stop(),
>       the transaction handle is leaked open-ended, eventually hanging the filesystem.
>     * Scenario 2 (crash): If we blindly call ext4_journal_stop() on a NULL handle
>       (e.g., when no transaction was started because we never took the inline path),
>       __ext4_journal_stop() delegates to ext4_put_nojournal(NULL) which triggers
>       BUG_ON(ref_cnt == 0), panicking the kernel.
>     * Fix: Retrieve the active handle in ext4_da_do_write_end() and stop it
>       if non-NULL. Also explicitly check "if (handle)" before calling
>       ext4_journal_stop() in ext4_write_end() and ext4_journalled_write_end().
> v2:
>   - Address TOCTOU race condition (reported by Sashiko AI review):
>     * Scenario: A concurrent ext4_page_mkwrite() converts inline data to extents
>       and clears the flags between ext4_write_begin() and write_end(). The
>       write_end function falls through to the block fallback path. Since block
>       buffers were not allocated in write_begin (because it took the inline path),
>       folio_buffers(folio) is NULL, causing a NULL pointer dereference in
>       ext4_journalled_zero_new_buffers() or ext4_walk_page_buffers(), or silent
>       data loss in the standard write path.
>     * Fix: Have the write_end functions return 0 if folio_buffers(folio) is NULL,
>       triggering a safe VFS-level retry. On the next write attempt, the inline
>       flags will be detected as cleared, and blocks/buffers will be properly allocated.
>  fs/ext4/inline.c |  9 ++++++++-
>  fs/ext4/inode.c  | 24 ++++++++++++++++++++++--
>  2 files changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 8045e4ff270c..161136e84661 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -812,7 +812,14 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
>  			goto out;
>  		}
>  		ext4_write_lock_xattr(inode, &no_expand);
> -		BUG_ON(!ext4_has_inline_data(inode));
> +		if (unlikely(!ext4_has_inline_data(inode))) {
> +			ext4_write_unlock_xattr(inode, &no_expand);
> +			brelse(iloc.bh);
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			ext4_journal_stop(handle);
> +			return 0;
> +		}

This deserves a comment before the 'if' that we could have raced with
ext4_page_mkwrite() converting the inode and so we just retry the whole
write.

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c2c2d6ac7f3d..bc2688e03c19 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1455,6 +1455,14 @@ static int ext4_write_end(const struct kiocb *iocb,
>  		return ext4_write_inline_data_end(inode, pos, len, copied,
>  						  folio);
>  
> +	if (unlikely(!folio_buffers(folio))) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		if (handle)
> +			ext4_journal_stop(handle);
> +		return 0;
> +	}
> +

Ouch, this is a crude hack. I think much cleaner solution would be for
ext4_write_begin() to set in fsdata in what state it prepared the inode
(inline, extent based) - we already use that mechanism to communicate some
state for delayed allocations. Then ->write_end handler will use fsdata
(not inode state) to determine what function to call. IMHO the code will be
much more obvious that way.

> @@ -3231,7 +3248,10 @@ static int ext4_da_do_write_end(struct address_space *mapping,
>  	if (unlikely(!folio_buffers(folio))) {
>  		folio_unlock(folio);
>  		folio_put(folio);
> -		return -EIO;
> +		handle = ext4_journal_current_handle();
> +		if (handle)
> +			ext4_journal_stop(handle);
> +		return 0;
>  	}
>  	/*
>  	 * block_write_end() will mark the inode as dirty with I_DIRTY_PAGES

Huh, what is this about? It definitely looks very suspicious...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] ext4: fix circular lock dependency in ext4_ext_migrate
From: Jan Kara @ 2026-06-09 12:05 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260609084007.3432061-1-yun.zhou@windriver.com>

On Tue 09-06-26 16:40:07, Yun Zhou wrote:
> Move iput(tmp_inode) after ext4_writepages_up_write() to avoid a
> circular lock dependency between s_writepages_rwsem and sb_internal
> (freeze protection).
> 
> The deadlock scenario:
> 
>   CPU0 (EXT4_IOC_MIGRATE)        CPU1 (orphan cleanup during mount)
>   ----                           ----
>   ext4_ext_migrate()
>     ext4_writepages_down_write()
>       s_writepages_rwsem (write)
>                                  ext4_evict_inode()
>                                    sb_start_intwrite()   [sb_internal]
>                                    ...
>                                      ext4_writepages()
>                                        s_writepages_rwsem (read) [BLOCKED]
>     iput(tmp_inode)
>       ext4_evict_inode()
>         sb_start_intwrite()         [BLOCKED]
> 
> The tmp_inode is a temporary inode with nlink=0 created solely for
> building the extent tree.  Its eviction does not require
> s_writepages_rwsem protection, so deferring iput() until after
> releasing the rwsem is safe.
> 
> Reported-by: syzbot+f0b58a1f5075a90dd9a5@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=f0b58a1f5075a90dd9a5
> Fixes: cb85f4d23f79 ("ext4: fix race between writepages and enabling EXT4_EXTENTS_FL")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just one nit below:

> @@ -591,9 +592,10 @@ int ext4_ext_migrate(struct inode *inode)
>  	ext4_journal_stop(handle);
>  out_tmp_inode:
>  	unlock_new_inode(tmp_inode);
> -	iput(tmp_inode);
>  out_unlock:
>  	ext4_writepages_up_write(inode->i_sb, alloc_ctx);
> +	if (tmp_inode)
> +		iput(tmp_inode);

iput(NULL) is properly handled so you don't need the if (tmp_inode) check
here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Jan Kara @ 2026-06-09 11:17 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, dmonakhov, linux-ext4, linux-kernel
In-Reply-To: <20260608152521.1292656-1-yun.zhou@windriver.com>

On Mon 08-06-26 23:25:21, Yun Zhou wrote:
> Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
> belong to the same superblock as the original file.  Currently, this
> validation is performed inside ext4_move_extents() by
> mext_check_validity(), but only after lock_two_nondirectories() has
> already acquired the inode locks.  When the donor fd refers to a file
> on a different filesystem (e.g., overlayfs), this late validation
> creates a circular lock dependency:
> 
>   CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
>   ----                              ----
>   inode_lock(ovl_inode)
>                                     mnt_want_write_file(filp)
>                                       sb_start_write(ext4_sb)   [sb_writers]
>     backing_file_write_iter()
>       vfs_iter_write(real_file)
>         file_start_write(real_file)
>           sb_start_write(ext4_sb)   [blocked by freeze]
>                                     lock_two_nondirectories()
>                                       inode_lock(ovl_inode)     [blocked]
> 
> With a concurrent freeze operation holding sb_writers write side, this
> forms a deadlock cycle: CPU0 waits for freeze to complete, freeze waits
> for CPU1's sb_writers reader to exit, CPU1 waits for CPU0's inode lock.
> 
> Since EXT4_IOC_MOVE_EXT exchanges physical extents between two files,
> it fundamentally requires both files to reside on the same ext4
> filesystem.  Moving the superblock check before any lock acquisition
> is both semantically correct and eliminates the circular dependency
> by ensuring that cross-filesystem donor fds are rejected before
> sb_writers or inode locks are taken.
> 
> Fixes: fcf6b1b729bc ("ext4: refactor ext4_move_extents code base")
> Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Good catch. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ioctl.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 1d0c3d4bdf47..f7cd419a3218 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -1650,6 +1650,9 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  		if (!(fd_file(donor)->f_mode & FMODE_WRITE))
>  			return -EBADF;
>  
> +		if (file_inode(filp)->i_sb != file_inode(fd_file(donor))->i_sb)
> +			return -EXDEV;
> +
>  		err = mnt_want_write_file(filp);
>  		if (err)
>  			return err;
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 0/4] show orphan file inode detail info
From: Jan Kara @ 2026-06-09 11:13 UTC (permalink / raw)
  To: yebin; +Cc: Jan Kara, tytso, adilger.kernel, linux-ext4
In-Reply-To: <6A26AB14.8050508@huaweicloud.com>

On Mon 08-06-26 19:44:20, yebin wrote:
> On 2026/4/16 1:59, Jan Kara wrote:
> > On Wed 15-04-26 18:55:01, Ye Bin wrote:
> > > From: Ye Bin <yebin10@huawei.com>
> > > 
> > > Diffs v2 vs v1:
> > > (1) Fix sashiko review issues:
> > > https://sashiko.dev/#/patchset/20260403082507.1882703-1-yebin%40huaweicloud.com
> > > (2) Change "orphan_list" file mode from 0444 to 0400;
> > > (3) The display format of the "orphan_list" file is modified according
> > >      to Andreas' suggestions.
> > > Fault injection tests have been conducted to address the issues raised
> > > in the sashik review. There is no UAF issue in the ext4_seq_orphan_release()
> > > function. The reason for this has already been explained in the code comments.
> > > In addition to the fault injection tests, we also performed a stress test by
> > > observing the /proc/fs/ext4/XX/orphan_list and the concurrent processes of
> > > adding and removing orphan nodes, and no issues were found so far.
> > > 
> > > 
> > > In actual production environments, the issue of inconsistency between
> > > df and du is frequently encountered. In many cases, the cause of the
> > > problem can be identified through the use of lsof. However, when
> > > overlayfs is combined with project quota configuration, the issue becomes
> > > more complex and troublesome to diagnose. First, to determine the project
> > > ID, one needs to obtain orphaned nodes using `fsck.ext4 -fn /dev/xx`, and
> > > then retrieve file information through `debugfs`. However, the file names
> > > cannot always be obtained, and it is often unclear which files they are.
> > > To identify which files these are, one would need to use crash for online
> > > debugging or use kprobe to gather information incrementally. However, some
> > > customers in production environments do not agree to upload any tools, and
> > > online debugging might impact the business. There are also scenarios where
> > > files are opened in kernel mode, which do not generate file descriptors(fds),
> > > making it impossible to identify which files were deleted but still have
> > > references through lsof. This patchset adds a procfs interface to query
> > > information about orphaned nodes, which can assist in the analysis and
> > > localization of such issues.
> > 
> > Ye, did you read my comments to the v1 of the patchset [1]? I didn't see
> > any reply from you. I don't think this is a good way how to expose orphan
> > information for a filesystem for reasons I've outlined in that email.
> > 
> 
> Hi Jan
> 
> I thought about how to prevent resource exhaustion caused by making too many
> FDs in a single application. My idea is that IOCTL should only obtain one FD
> at a time, and the next time it should start obtaining orphan nodes from the
> inode after the previous one. Each time an fd is obtained, the previous fd
> should be closed. I expect that after traversing all the fds from the beginning,
> they will all be closed and there will be no need for user space to close them
> manually. I wonder if this approach is feasible? Or do you have any good
> suggestions?

Hum, I think you've misunderstood my suggestion in [1]. What I suggested
is:

1) Provide ioctl GET_ORPHAN_FILES that will return one "virtual" fd that
tracks state of iteration over orphan entries of a superblock

2) Reading from this fd will be returning file *handles* (as struct
file_handle) describing the orphan inodes. There are no kernel resources
struct file_handle occupies in the kernel. It is essentially just a
filesystem agnostic container for inode number and inode generation number.
Userspace can then use open_by_handle() syscall to convert struct
file_handle into normal file descriptor but that is upto userspace and what
it wants orphan information for.

Is the design clearer now?

								Honza

[1] https://lore.kernel.org/all/n4sccudy5avcgnkdhc27rzofzoprxqtwhfrlmsh3yyrj6vbc6d@mmu73gmtawkq/
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Zhang Yi @ 2026-06-09 11:12 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, ojaswin, ritesh.list,
	Sashiko
In-Reply-To: <d3afb165-49f7-4c7a-b110-95d6a94a8845@linux.alibaba.com>

On 6/9/2026 5:43 PM, Baokun Li wrote:
> On 2026/6/9 10:48, Zhang Yi wrote:
>> Hi, Baokun!
> 
> Hi Yi,
> 
> Thank you for you review!
> 
>> On 6/8/2026 7:11 PM, Baokun Li wrote:
>>> The mke2fs man page documents:
>>>
>>>   Valid cluster-size values are from 2048 to 256M bytes per cluster.
>> Hmm, I checked the mke2fs(8)[1] and didn't find this sentence, instead,
>> it said:
>>
>>   Valid cluster-size values range from 2 to 32768 times the filesystem
>>   block size and must be a power of 2.
> 
> Oh, I was indeed looking at a slightly older man page. It was changed to
> the current description in commit 87cbb381f2e2 ("e2fsprogs:
> misc/mke2fs.8.in: Correct valid cluster-size values").
> 
> That commit assumes the number of blocks in a cluster cannot exceed the
> maximum number of blocks that a single extent can hold
> (EXT_INIT_MAX_LEN = 32768). I believe this is wrong — a single cluster
> can be covered by multiple extents. So we should fix the bug in
> ext_falloc_helper() rather than working around it by adding a
> restriction to the documentation.
> 
> The root cause is in ext_falloc_helper():
> 
>     max_uninit_len = EXT_UNINIT_MAX_LEN & ~EXT2FS_CLUSTER_MASK(fs);
>     max_init_len = EXT_INIT_MAX_LEN & ~EXT2FS_CLUSTER_MASK(fs);
> 
> When cluster_ratio >= EXT_INIT_MAX_LEN (32768), the `& ~CLUSTER_MASK`
> operation zeroes out both values, causing ext2fs_new_range() to receive
> len=0 and return EXT2_ET_INVALID_ARGUMENT.
> 
> The kernel handles this correctly — ext4_ext_map_blocks() simply
> truncates m_len to EXT_INIT_MAX_LEN and lets the caller loop.  On
> subsequent iterations, get_implied_cluster_alloc() detects that the
> requested block falls within an already-allocated cluster by examining
> adjacent extents, and returns the physical address without allocating a
> new cluster.
> 
> I'll send a patch to fix ext_falloc_helper() so that when
> cluster_ratio >= EXT_INIT_MAX_LEN, the allocation loop can create
> multiple extents within the same cluster — skipping ext2fs_new_range()
> and claim_range() for mid-cluster iterations where the physical blocks
> are already claimed.
> 
>>
>> However, the implementation of mkfs does not seem to respect this
>> constraint and instead directly uses static macros to limit the
>> user-specified cluster size.
>>
>>   #define EXT2_MIN_BLOCK_LOG_SIZE         10      /* 1024 */
>>   #define EXT2_MIN_CLUSTER_LOG_SIZE       EXT2_MIN_BLOCK_LOG_SIZE
>>   #define EXT2_MAX_CLUSTER_LOG_SIZE       29      /* 512MB  */
> 
> I was going to set EXT2_MAX_CLUSTER_LOG_SIZE to 28 to align with the
> kernel's EXT4_MAX_CLUSTER_LOG_SIZE.
> 
>>
>> This is confusing, or perhaps I missed something. If I understand
>> correctly, users can now format an image with a maximum cluster size of
>> 512 MB (I tried it, and it worked).
> 
> 
> Yes, a 512M cluster size also works correctly on the current kernel.
> The upper limit on cluster size here is to avoid the 32-bit overflow issue
> that Sashiko mentioned — theoretically the maximum could be 512M.
> 
> It's just that the old mke2fs documentation originally specified 256M as
> the intended maximum, so both the kernel and e2fsprogs target 256M as the
> upper limit. If anyone prefers 512M, I'm happy to change it.
> 
>>  If the kernel's
>> EXT4_MAX_CLUSTER_LOG_SIZE is limited to 28, this would cause such
>> existing images to become unmountable, even though I don't think images
>> with such a large cluster size actually exist in practice. Therefore,
>> I'm not sure this is safe.
>>
>> [1] https://man7.org/linux/man-pages/man8/mke2fs.8.html
>>
> 
> I personally don't think such images are likely to exist in practice,
> since formatting with the default 4K block size would already fail.
> 
> 

FYI, I tried the following parameter:

 # mkfs.ext4 -F -C 512M -O bigalloc,^has_journal /dev/sda
 mke2fs 1.47.0 (5-Feb-2023)

 Warning: bigalloc file systems with a cluster size greater than
 16 times the block size is considered experimental
 /dev/sda contains a ext4 file system
         created on Tue Jun  9 07:01:10 2026
 Creating filesystem with 39321600 4k blocks and 304 inodes

 Allocating group tables: done
 Writing inode tables: done
 Writing superblocks and filesystem accounting information: done

After this patch, the kernel refuse to mount this filesystem and complain:

 EXT4-fs (sda): Invalid log cluster size: 19

Thanks,
Yi.


^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Jan Kara @ 2026-06-09 11:04 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-3-libaokun@linux.alibaba.com>

On Mon 08-06-26 19:11:49, Baokun Li wrote:
> The mke2fs man page documents:
> 
>   Valid cluster-size values are from 2048 to 256M bytes per cluster.
> 
> but EXT4_MAX_CLUSTER_LOG_SIZE was set to 30 (1GB), allowing crafted
> filesystem images to specify cluster sizes up to 1GB.
> 
> On 32-bit systems with bigalloc enabled, the consistency check in
> ext4_handle_clustersize():
> 
>   s_blocks_per_group == s_clusters_per_group * (clustersize / blocksize)
> 
> can overflow when the cluster ratio is large enough. Since
> s_blocks_per_group is not range-checked in the bigalloc path, the
> wrapped product can pass the consistency check, leading to inconsistent
> group geometry and potential out-of-bounds block allocation.
> 
> Reduce EXT4_MAX_CLUSTER_LOG_SIZE to 28 to match the documented 256MB
> limit. With this cap, the maximum product is:
> 
>   (blocksize * 8) * (256M / blocksize) = 2^31
> 
> which fits safely in a 32-bit unsigned long for all block sizes.
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

I guess I'll leave it up to Ted to decide whether 28 or 29 in the right
value. Both are fine with me if we are consistent with them :)

								Honza

> ---
>  fs/ext4/ext4.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..11e41a864db8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -334,7 +334,7 @@ struct ext4_io_submit {
>  #define	EXT4_MAX_BLOCK_SIZE		65536
>  #define EXT4_MIN_BLOCK_LOG_SIZE		10
>  #define EXT4_MAX_BLOCK_LOG_SIZE		16
> -#define EXT4_MAX_CLUSTER_LOG_SIZE	30
> +#define EXT4_MAX_CLUSTER_LOG_SIZE	28
>  #ifdef __KERNEL__
>  # define EXT4_BLOCK_SIZE(s)		((s)->s_blocksize)
>  #else
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Jan Kara @ 2026-06-09 11:00 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-4-libaokun@linux.alibaba.com>

On Mon 08-06-26 19:11:50, Baokun Li wrote:
> If s_inodes_per_group is not a multiple of s_inodes_per_block, the
> division that computes s_itb_per_group truncates, reserving fewer blocks
> for the inode table than needed.
> 
> On a crafted filesystem image, this allows __ext4_get_inode_loc() to
> compute a block offset beyond the inode table, reading unrelated data as
> an inode structure.
> 
> Add the missing divisibility check alongside the existing validation in
> ext4_block_group_meta_init().
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3ddcb4a8d4db..5ec9e1ef00c0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5306,7 +5306,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
>  	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> -	    sbi->s_inodes_per_group & 7) {
> +	    sbi->s_inodes_per_group & 7 ||
> +	    sbi->s_inodes_per_group % sbi->s_inodes_per_block) {
>  		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 1/3] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Jan Kara @ 2026-06-09 10:57 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-2-libaokun@linux.alibaba.com>

On Mon 08-06-26 19:11:48, Baokun Li wrote:
> The block and inode bitmap checksums are computed over a whole number of
> bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
> ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
> length passed to ext4_chksum().
> 
> If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
> trailing fractional bits are excluded from the checksum.  Those bits are
> then unprotected, and any incremental csum update path that assumes a
> byte-aligned bitmap can compute a checksum inconsistent with the full
> recalculation, corrupting the on-disk bitmap checksum.
> 
> Reject such filesystems at mount time by adding the missing " & 7"
> alignment checks alongside the existing range validation.
> 
> Suggested-by: Theodore Ts'o <tytso@mit.edu>
> Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..3ddcb4a8d4db 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
>  		sbi->s_cluster_bits = 0;
>  	}
>  	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
> -	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
> -		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
> +	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_clusters_per_group & 7) {
> +		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
>  			 sbi->s_clusters_per_group);
>  		return -EINVAL;
>  	}
> @@ -5304,8 +5305,9 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>  		return -EINVAL;
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
> -	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
> -		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
> +	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_inodes_per_group & 7) {
> +		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;
>  	}
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Jan Kara @ 2026-06-09 10:52 UTC (permalink / raw)
  To: Aditya Prakash Srivastava
  Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4, linux-kernel,
	syzbot+0c89d865531d053abb2d
In-Reply-To: <20260608065227.3018-1-aditya.ansh182@gmail.com>

On Mon 08-06-26 06:52:27, Aditya Prakash Srivastava wrote:
> When the data=journal mount option is used, the ext4_journalled_write_end()
> function incorrectly calls ext4_write_inline_data_end() without checking
> if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.
> 
> If a previous attempt to convert the inline data to an extent failed (e.g.
> due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
> the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
> call to ext4_write_begin() will not prepare the inline data xattr for
> writing, but ext4_journalled_write_end() will incorrectly attempt to write
> to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
> ext4_write_inline_data() since i_inline_size was not expanded.
> 
> Fix this by ensuring that ext4_journalled_write_end() only calls
> ext4_write_inline_data_end() if the EXT4_STATE_MAY_INLINE_DATA flag is
> set, mirroring the behavior of ext4_write_end() and ext4_da_write_end().
> 
> Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
> Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c2c2d6ac7f3d..4fce9ec176f8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1560,7 +1560,8 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,
>  
>  	BUG_ON(!ext4_handle_valid(handle));
>  
> -	if (ext4_has_inline_data(inode))
> +	if (ext4_has_inline_data(inode) &&
> +	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
>  		return ext4_write_inline_data_end(inode, pos, len, copied,
>  						  folio);
>  
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Jan Kara @ 2026-06-09 10:43 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, Sashiko
In-Reply-To: <20260608061112.392391-1-libaokun@linux.alibaba.com>

On Mon 08-06-26 14:11:12, Baokun Li wrote:
> The block and inode bitmap checksums are computed over a whole number of
> bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
> ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
> length passed to ext4_chksum().
> 
> If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
> trailing fractional bits are excluded from the checksum.  Those bits are
> then unprotected, and any incremental csum update path that assumes a
> byte-aligned bitmap can compute a checksum inconsistent with the full
> recalculation, corrupting the on-disk bitmap checksum.
> 
> Reject such filesystems at mount time by adding the missing " & 7"
> alignment checks alongside the existing range validation.
> 
> Suggested-by: Theodore Ts'o <tytso@mit.edu>
> Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..3daf4cdcf07e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
>  		sbi->s_cluster_bits = 0;
>  	}
>  	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
> -	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
> -		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
> +	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_clusters_per_group & 7) {
> +		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
>  			 sbi->s_clusters_per_group);
>  		return -EINVAL;
>  	}
> @@ -5304,7 +5305,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>  		return -EINVAL;
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
> -	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
> +	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_inodes_per_group & 7) {
>  		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Baokun Li @ 2026-06-09  9:43 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, ojaswin, ritesh.list,
	Sashiko
In-Reply-To: <06543725-5f38-4cac-9fdd-8a72cfb28f84@huaweicloud.com>

On 2026/6/9 10:48, Zhang Yi wrote:
> Hi, Baokun!

Hi Yi,

Thank you for you review!

> On 6/8/2026 7:11 PM, Baokun Li wrote:
>> The mke2fs man page documents:
>>
>>   Valid cluster-size values are from 2048 to 256M bytes per cluster.
> Hmm, I checked the mke2fs(8)[1] and didn't find this sentence, instead,
> it said:
>
>   Valid cluster-size values range from 2 to 32768 times the filesystem
>   block size and must be a power of 2.

Oh, I was indeed looking at a slightly older man page. It was changed to
the current description in commit 87cbb381f2e2 ("e2fsprogs:
misc/mke2fs.8.in: Correct valid cluster-size values").

That commit assumes the number of blocks in a cluster cannot exceed the
maximum number of blocks that a single extent can hold
(EXT_INIT_MAX_LEN = 32768). I believe this is wrong — a single cluster
can be covered by multiple extents. So we should fix the bug in
ext_falloc_helper() rather than working around it by adding a
restriction to the documentation.

The root cause is in ext_falloc_helper():

    max_uninit_len = EXT_UNINIT_MAX_LEN & ~EXT2FS_CLUSTER_MASK(fs);
    max_init_len = EXT_INIT_MAX_LEN & ~EXT2FS_CLUSTER_MASK(fs);

When cluster_ratio >= EXT_INIT_MAX_LEN (32768), the `& ~CLUSTER_MASK`
operation zeroes out both values, causing ext2fs_new_range() to receive
len=0 and return EXT2_ET_INVALID_ARGUMENT.

The kernel handles this correctly — ext4_ext_map_blocks() simply
truncates m_len to EXT_INIT_MAX_LEN and lets the caller loop.  On
subsequent iterations, get_implied_cluster_alloc() detects that the
requested block falls within an already-allocated cluster by examining
adjacent extents, and returns the physical address without allocating a
new cluster.

I'll send a patch to fix ext_falloc_helper() so that when
cluster_ratio >= EXT_INIT_MAX_LEN, the allocation loop can create
multiple extents within the same cluster — skipping ext2fs_new_range()
and claim_range() for mid-cluster iterations where the physical blocks
are already claimed.

>
> However, the implementation of mkfs does not seem to respect this
> constraint and instead directly uses static macros to limit the
> user-specified cluster size.
>
>   #define EXT2_MIN_BLOCK_LOG_SIZE         10      /* 1024 */
>   #define EXT2_MIN_CLUSTER_LOG_SIZE       EXT2_MIN_BLOCK_LOG_SIZE
>   #define EXT2_MAX_CLUSTER_LOG_SIZE       29      /* 512MB  */

I was going to set EXT2_MAX_CLUSTER_LOG_SIZE to 28 to align with the
kernel's EXT4_MAX_CLUSTER_LOG_SIZE.

>
> This is confusing, or perhaps I missed something. If I understand
> correctly, users can now format an image with a maximum cluster size of
> 512 MB (I tried it, and it worked).

Yes, a 512M cluster size also works correctly on the current kernel.
The upper limit on cluster size here is to avoid the 32-bit overflow issue
that Sashiko mentioned — theoretically the maximum could be 512M.

It's just that the old mke2fs documentation originally specified 256M as
the intended maximum, so both the kernel and e2fsprogs target 256M as the
upper limit. If anyone prefers 512M, I'm happy to change it.

>  If the kernel's
> EXT4_MAX_CLUSTER_LOG_SIZE is limited to 28, this would cause such
> existing images to become unmountable, even though I don't think images
> with such a large cluster size actually exist in practice. Therefore,
> I'm not sure this is safe.
>
> [1] https://man7.org/linux/man-pages/man8/mke2fs.8.html
>

I personally don't think such images are likely to exist in practice,
since formatting with the default 4K block size would already fail.

Thanks,
Baokun

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox