Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH v5 1/3] ext4: skip extra isize expansion during mount to prevent deadlock
From: Jan Kara @ 2026-06-15 16:20 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260615115317.3549469-2-yun.zhou@windriver.com>

On Mon 15-06-26 19:53:15, Yun Zhou wrote:
> ext4_try_to_expand_extra_isize() is called from __ext4_mark_inode_dirty()
> while holding an active jbd2 handle.  During mount (!SB_ACTIVE), the
> expand path may move xattrs to external blocks and release ea_inodes via
> iput().  When !SB_ACTIVE, iput() calls write_inode_now() which acquires
> s_writepages_rwsem, creating a circular lock dependency:
> 
>   s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
> 
> This can be triggered via:
> 
>   ext4_process_orphan() -> ext4_truncate() -> ext4_mark_inode_dirty()
>     -> ext4_try_to_expand_extra_isize()
> 
> or:
> 
>   ext4_evict_inode() -> ext4_mark_inode_dirty()
>     -> ext4_try_to_expand_extra_isize()
> 
> Skip expansion when !SB_ACTIVE.  This is a minor loss of functionality
> (extra isize won't grow for these inodes during mount), which e2fsck
> can resolve later if needed.
> 
> Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
> Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Yeah, looks as a sensible compromise. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index cd7588a3fa45..be1e3eaa4f23 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6479,6 +6479,16 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
>  	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
>  		return -EOVERFLOW;
>  
> +	/*
> +	 * Skip expansion during mount (!SB_ACTIVE).  Expanding extra isize
> +	 * may move xattrs to external blocks and release ea_inodes via iput.
> +	 * When !SB_ACTIVE, iput triggers write_inode_now() which acquires
> +	 * s_writepages_rwsem, causing a deadlock with the caller's active
> +	 * jbd2 handle (lock order: s_writepages_rwsem -> jbd2_handle).
> +	 */
> +	if (unlikely(!(inode->i_sb->s_flags & SB_ACTIVE)))
> +		return -EBUSY;
> +
>  	/*
>  	 * In nojournal mode, we can immediately attempt to expand
>  	 * the inode.  When journaled, we first need to obtain extra
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [GIT PULL] udf, isofs, ext2, quota fixes and cleanups for 7.2-rc1
From: Jan Kara @ 2026-06-15 14:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, linux-ext4

  Hello Linus,

  could you please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fs_for_v7.2-rc1

to get:
  * Assorted udf & isofs fixes for maliciously formatted devices
  * Cleanups to use kmalloc() instead of __get_free_page()
  * Removal of deprecated DAX code from ext2

Top of the tree is 5163e6ee1ea7. The full shortlog is:

Ashwin Gundarapu (1):
      ext2: Remove deprecated DAX support

Bryam Vargas (3):
      isofs: bound Rock Ridge symlink components to the SL record
      udf: validate sparing table length as an entry count, not a byte count
      udf: validate VAT header length against the VAT inode size

Danila Chernetsov (1):
      ext2: fix ignored return value of generic_write_sync()

Jan Kara (1):
      udf: validate VAT inode size for old VAT format

Michael Bommarito (1):
      udf: validate free block extents against the partition length

Mike Rapoport (Microsoft) (2):
      quota: allocate dquot_hash with kmalloc()
      isofs: replace __get_free_page() with kmalloc()

The diffstat is

 fs/ext2/ext2.h   |   4 --
 fs/ext2/file.c   | 125 ++++---------------------------------------------------
 fs/ext2/inode.c  |  60 ++------------------------
 fs/ext2/super.c  |  39 ++---------------
 fs/isofs/dir.c   |   5 ++-
 fs/isofs/rock.c  |  11 +++++
 fs/quota/dquot.c |  11 +++--
 fs/udf/balloc.c  |   5 ++-
 fs/udf/super.c   |  16 ++++++-
 9 files changed, 51 insertions(+), 225 deletions(-)

							Thanks
								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v5 0/3] ext4: fix xattr iput deadlock with s_writepages_rwsem
From: Yun Zhou @ 2026-06-15 11:53 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

This series fixes a circular lock dependency reported by syzbot:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

The deadlock occurs when iput() on an ea_inode triggers write_inode_now()
while xattr_sem and a jbd2 handle are held.  The triggering path is
during mount-time orphan cleanup (!SB_ACTIVE) where iput_final() calls
write_inode_now() synchronously.

Patch 1 blocks the deadlock by skipping extra isize expansion when
!SB_ACTIVE -- this prevents the xattr manipulation path from being
entered during mount.

Patch 2 is a belt-and-suspenders semantic improvement: an inode under
eviction never needs extra isize expansion.

Patch 3 is a structural improvement: defer all ea_inode iput() calls
until after xattr_sem is released, reducing lock nesting depth and
preventing future code paths from reintroducing the lock ordering issue.

Link: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5

v5:
 - Split into 3 patches for easier review.
 - Add explicit !SB_ACTIVE early-return in ext4_try_to_expand_extra_isize()
   to block ALL mount-time paths (ext4_process_orphan -> ext4_truncate ->
   ext4_mark_inode_dirty), not just the eviction path. v4 only relied on
   EXT4_STATE_NO_EXPAND which doesn't cover orphan truncation.

v4:
 - Comprehensive rewrite of the deferred iput mechanism.
 - Thread ea_inode_array through ext4_expand_extra_isize_ea() and
   ext4_xattr_move_to_block() so ALL ea_inode iputs in the expand
   path are deferred, not just those in ext4_xattr_block_set().
 - Add NULL safety to ext4_expand_inode_array(): when ea_inode_array
   pointer is NULL, fall back to synchronous iput (for callers like
   ext4_initxattrs that only run with SB_ACTIVE).
 - Use __GFP_NOFAIL to guarantee deferred array growth, eliminating
   fallback to synchronous iput under locks.
 - Update ext4_xattr_ibody_set() and ext4_xattr_set_entry() signatures
   to accept ea_inode_array, converting ALL iput(ea_inode) calls.
 - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() before
   ext4_mark_inode_dirty().

v3:
 - Check ext4_expand_inode_array() return value; fallback to
   direct iput() on ENOMEM to prevent inode leak.
 - Make ext4_xattr_set_handle() take an optional ea_inode_array
   output parameter so callers can free after ext4_journal_stop(),
   avoiding the jbd2_handle vs s_writepages_rwsem AB-BA.
 - Pass ea_inode_array directly to ext4_xattr_release_block()
   instead of using a local array freed under xattr_sem.
 - Move ext4_xattr_inode_array_free() after ext4_journal_stop()

v2:
 - Defer iput() in ext4_xattr_block_set() via ea_inode_array,
   freed after xattr_sem is released. Fixes the root cause.

v1:
 - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
   on inodes being deleted. Only fixes the syzbot reproducer, not
   the underlying lock ordering violation.

Yun Zhou (3):
  ext4: skip extra isize expansion during mount to prevent deadlock
  ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
  ext4: defer iput() on ea_inodes to reduce lock holding scope

 fs/ext4/acl.c            |  2 +-
 fs/ext4/crypto.c         |  4 +-
 fs/ext4/inline.c         |  8 ++--
 fs/ext4/inode.c          | 26 +++++++++--
 fs/ext4/xattr.c          | 93 ++++++++++++++++++++++++----------------
 fs/ext4/xattr.h          | 10 +++--
 fs/ext4/xattr_security.c |  3 +-
 7 files changed, 95 insertions(+), 51 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH v5 1/3] ext4: skip extra isize expansion during mount to prevent deadlock
From: Yun Zhou @ 2026-06-15 11:53 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260615115317.3549469-1-yun.zhou@windriver.com>

ext4_try_to_expand_extra_isize() is called from __ext4_mark_inode_dirty()
while holding an active jbd2 handle.  During mount (!SB_ACTIVE), the
expand path may move xattrs to external blocks and release ea_inodes via
iput().  When !SB_ACTIVE, iput() calls write_inode_now() which acquires
s_writepages_rwsem, creating a circular lock dependency:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

This can be triggered via:

  ext4_process_orphan() -> ext4_truncate() -> ext4_mark_inode_dirty()
    -> ext4_try_to_expand_extra_isize()

or:

  ext4_evict_inode() -> ext4_mark_inode_dirty()
    -> ext4_try_to_expand_extra_isize()

Skip expansion when !SB_ACTIVE.  This is a minor loss of functionality
(extra isize won't grow for these inodes during mount), which e2fsck
can resolve later if needed.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..be1e3eaa4f23 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6479,6 +6479,16 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
 
+	/*
+	 * Skip expansion during mount (!SB_ACTIVE).  Expanding extra isize
+	 * may move xattrs to external blocks and release ea_inodes via iput.
+	 * When !SB_ACTIVE, iput triggers write_inode_now() which acquires
+	 * s_writepages_rwsem, causing a deadlock with the caller's active
+	 * jbd2 handle (lock order: s_writepages_rwsem -> jbd2_handle).
+	 */
+	if (unlikely(!(inode->i_sb->s_flags & SB_ACTIVE)))
+		return -EBUSY;
+
 	/*
 	 * In nojournal mode, we can immediately attempt to expand
 	 * the inode.  When journaled, we first need to obtain extra
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 3/3] ext4: defer iput() on ea_inodes to reduce lock holding scope
From: Yun Zhou @ 2026-06-15 11:53 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260615115317.3549469-1-yun.zhou@windriver.com>

ext4_xattr_block_set(), ext4_xattr_ibody_set() and ext4_xattr_set_entry()
call iput() on ea_inodes while xattr_sem and a jbd2 handle are held.
This creates an unnecessarily wide lock scope and, without the preceding
!SB_ACTIVE guard, could trigger write_inode_now() -> s_writepages_rwsem
creating a circular dependency.

Structurally eliminate iput-under-xattr_sem by collecting ea_inodes into
a deferred-iput array and releasing them after locks are dropped:

1. Convert all iput(ea_inode) calls within ext4_xattr_set_entry(),
   ext4_xattr_ibody_set(), and ext4_xattr_block_set() to deferred iput
   via ext4_expand_inode_array().  Thread the ea_inode_array through
   ext4_xattr_set_handle(), ext4_xattr_move_to_block(), and
   ext4_expand_extra_isize_ea() as an output parameter.

2. Callers that own the journal handle (ext4_xattr_set,
   ext4_expand_extra_isize) free the array AFTER ext4_journal_stop().
   For ext4_try_to_expand_extra_isize (called from __ext4_mark_inode_dirty
   with the caller's handle), free after xattr_sem release -- safe
   because the preceding patches block the !SB_ACTIVE path.

3. Callers that cannot control the handle lifetime (ext4_initxattrs,
   __ext4_set_acl, ext4_set_context, inline data ops) pass NULL.
   ext4_expand_inode_array() falls back to synchronous iput() when
   ea_inode_array is NULL -- safe because these callers only run with
   SB_ACTIVE where iput cannot trigger write_inode_now().

4. Use __GFP_NOFAIL in ext4_expand_inode_array() to guarantee the
   deferred-iput array never fails to grow, eliminating any fallback
   to synchronous iput() under locks.

5. Pass ea_inode_array directly to ext4_xattr_release_block() in
   ext4_xattr_block_set() instead of using a local array freed
   synchronously under xattr_sem.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/acl.c            |  2 +-
 fs/ext4/crypto.c         |  4 +-
 fs/ext4/inline.c         |  8 ++--
 fs/ext4/inode.c          | 16 +++++--
 fs/ext4/xattr.c          | 93 ++++++++++++++++++++++++----------------
 fs/ext4/xattr.h          | 10 +++--
 fs/ext4/xattr_security.c |  3 +-
 7 files changed, 85 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 3bffe862f954..21de8276b558 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -215,7 +215,7 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
 	}
 
 	error = ext4_xattr_set_handle(handle, inode, name_index, "",
-				      value, size, xattr_flags);
+				      value, size, xattr_flags, NULL);
 
 	kfree(value);
 	if (!error)
diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index f41f320f4437..bca760751c1d 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -173,7 +173,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 		res = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_ENCRYPTION,
 					    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-					    ctx, len, XATTR_CREATE);
+					    ctx, len, XATTR_CREATE, NULL);
 		if (!res) {
 			ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 			ext4_clear_inode_state(inode,
@@ -202,7 +202,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 
 	res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
 				    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-				    ctx, len, 0);
+				    ctx, len, 0, NULL);
 	if (!res) {
 		ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 		/*
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..2bf2b771929d 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -309,7 +309,7 @@ static int ext4_create_inline_data(handle_t *handle,
 		goto out;
 	}
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error) {
 		if (error == -ENOSPC)
 			ext4_clear_inode_state(inode,
@@ -386,7 +386,7 @@ static int ext4_update_inline_data(handle_t *handle, struct inode *inode,
 	i.value = value;
 	i.value_len = len;
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error)
 		goto out;
 
@@ -469,7 +469,7 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
 	if (error)
 		goto out;
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error)
 		goto out;
 
@@ -1917,7 +1917,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
 			i.value = value;
 			i.value_len = i_size > EXT4_MIN_INLINE_DATA_SIZE ?
 					i_size - EXT4_MIN_INLINE_DATA_SIZE : 0;
-			err = ext4_xattr_ibody_set(handle, inode, &i, &is);
+			err = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 			if (err)
 				goto out_error;
 		}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 60c91c098fa0..a1b0f47f8f4f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6409,7 +6409,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 static int __ext4_expand_extra_isize(struct inode *inode,
 				     unsigned int new_extra_isize,
 				     struct ext4_iloc *iloc,
-				     handle_t *handle, int *no_expand)
+				     handle_t *handle, int *no_expand,
+				     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_inode *raw_inode;
 	struct ext4_xattr_ibody_header *header;
@@ -6454,7 +6455,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
 
 	/* try to expand with EAs present */
 	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
-					   raw_inode, handle);
+					   raw_inode, handle, ea_inode_array);
 	if (error) {
 		/*
 		 * Inode size expansion failed; don't try again
@@ -6476,6 +6477,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 {
 	int no_expand;
 	int error;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
@@ -6507,8 +6509,11 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 		return -EBUSY;
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	/* Safe with caller's handle active: !SB_ACTIVE is blocked above */
+	ext4_xattr_inode_array_free(ea_inode_array);
 
 	return error;
 }
@@ -6520,6 +6525,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	handle_t *handle;
 	int no_expand;
 	int error, rc;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
 		brelse(iloc->bh);
@@ -6545,7 +6551,8 @@ int ext4_expand_extra_isize(struct inode *inode,
 	}
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 
 	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
 	if (!error)
@@ -6554,6 +6561,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 out_unlock:
 	ext4_write_unlock_xattr(inode, &no_expand);
 	ext4_journal_stop(handle);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return error;
 }
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..4eb83917e6b4 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1630,7 +1630,8 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
 				struct ext4_xattr_search *s,
 				handle_t *handle, struct inode *inode,
 				struct inode *new_ea_inode,
-				bool is_block)
+				bool is_block,
+				struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_entry *last, *next;
 	struct ext4_xattr_entry *here = s->here;
@@ -1848,7 +1849,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
 
 	ret = 0;
 out:
-	iput(old_ea_inode);
+	ext4_expand_inode_array(ea_inode_array, old_ea_inode);
 	return ret;
 }
 
@@ -1898,7 +1899,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
 static int
 ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		     struct ext4_xattr_info *i,
-		     struct ext4_xattr_block_find *bs)
+		     struct ext4_xattr_block_find *bs,
+		     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *new_bh = NULL;
@@ -1961,7 +1963,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			}
 			ea_bdebug(bs->bh, "modifying in-place");
 			error = ext4_xattr_set_entry(i, s, handle, inode,
-					     ea_inode, true /* is_block */);
+					     ea_inode, true /* is_block */,
+					     ea_inode_array);
 			ext4_xattr_block_csum_set(inode, bs->bh);
 			unlock_buffer(bs->bh);
 			if (error == -EFSCORRUPTED)
@@ -2030,7 +2033,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	}
 
 	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
-				     true /* is_block */);
+				     true /* is_block */, ea_inode_array);
 	if (error == -EFSCORRUPTED)
 		goto bad_block;
 	if (error)
@@ -2150,7 +2153,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				ext4_expand_inode_array(ea_inode_array,
+							ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2182,12 +2186,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
-
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
+					 ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
 	}
 	error = 0;
 
@@ -2203,7 +2204,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		ext4_expand_inode_array(ea_inode_array, ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2253,7 +2254,8 @@ int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
 
 int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 				struct ext4_xattr_info *i,
-				struct ext4_xattr_ibody_find *is)
+				struct ext4_xattr_ibody_find *is,
+				struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_xattr_search *s = &is->s;
@@ -2273,7 +2275,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 			return PTR_ERR(ea_inode);
 	}
 	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
-				     false /* is_block */);
+				     false /* is_block */, ea_inode_array);
 	if (error) {
 		if (ea_inode) {
 			int error2;
@@ -2285,7 +2287,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
-			iput(ea_inode);
+			ext4_expand_inode_array(ea_inode_array, ea_inode);
 		}
 		return error;
 	}
@@ -2297,7 +2299,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 		header->h_magic = cpu_to_le32(0);
 		ext4_clear_inode_state(inode, EXT4_STATE_XATTR);
 	}
-	iput(ea_inode);
+	ext4_expand_inode_array(ea_inode_array, ea_inode);
 	return 0;
 }
 
@@ -2348,7 +2350,8 @@ static struct buffer_head *ext4_xattr_get_block(struct inode *inode)
 int
 ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		      const char *name, const void *value, size_t value_len,
-		      int flags)
+		      int flags,
+		      struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_info i = {
 		.name_index = name_index,
@@ -2428,9 +2431,11 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 
 	if (!value) {
 		if (!is.s.not_found)
-			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+			error = ext4_xattr_ibody_set(handle, inode, &i, &is,
+						     ea_inode_array);
 		else if (!bs.s.not_found)
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 	} else {
 		error = 0;
 		/* Xattr value did not change? Save us some work and bail out */
@@ -2444,10 +2449,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 			EXT4_XATTR_MIN_LARGE_EA_SIZE(inode->i_sb->s_blocksize)))
 			i.in_inode = 1;
 retry_inode:
-		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+		error = ext4_xattr_ibody_set(handle, inode, &i, &is,
+					     ea_inode_array);
 		if (!error && !bs.s.not_found) {
 			i.value = NULL;
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 		} else if (error == -ENOSPC) {
 			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
 				brelse(bs.bh);
@@ -2456,11 +2463,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 				if (error)
 					goto cleanup;
 			}
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 			if (!error && !is.s.not_found) {
 				i.value = NULL;
 				error = ext4_xattr_ibody_set(handle, inode, &i,
-							     &is);
+							     &is, ea_inode_array);
 			} else if (error == -ENOSPC) {
 				/*
 				 * Xattr does not fit in the block, store at
@@ -2539,6 +2547,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 {
 	handle_t *handle;
 	struct super_block *sb = inode->i_sb;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	int error, retries = 0;
 	int credits;
 
@@ -2559,10 +2568,13 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 		int error2;
 
 		error = ext4_xattr_set_handle(handle, inode, name_index, name,
-					      value, value_len, flags);
+					      value, value_len, flags,
+					      &ea_inode_array);
 		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR,
 					handle);
 		error2 = ext4_journal_stop(handle);
+		ext4_xattr_inode_array_free(ea_inode_array);
+		ea_inode_array = NULL;
 		if (error == -ENOSPC &&
 		    ext4_should_retry_alloc(sb, &retries))
 			goto retry;
@@ -2604,7 +2616,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  */
 static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 				    struct ext4_inode *raw_inode,
-				    struct ext4_xattr_entry *entry)
+				    struct ext4_xattr_entry *entry,
+				    struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_find *is = NULL;
 	struct ext4_xattr_block_find *bs = NULL;
@@ -2668,14 +2681,14 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Move ea entry from the inode into the block */
-	error = ext4_xattr_block_set(handle, inode, &i, bs);
+	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
 	if (error)
 		goto out;
 
 	/* Remove the chosen entry from the inode */
 	i.value = NULL;
 	i.value_len = 0;
-	error = ext4_xattr_ibody_set(handle, inode, &i, is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, is, ea_inode_array);
 
 out:
 	kfree(b_entry_name);
@@ -2694,7 +2707,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 				       struct ext4_inode *raw_inode,
 				       int isize_diff, size_t ifree,
-				       size_t bfree, int *total_ino)
+				       size_t bfree, int *total_ino,
+				       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
 	struct ext4_xattr_entry *small_entry;
@@ -2744,7 +2758,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 			total_size += EXT4_XATTR_SIZE(
 					      le32_to_cpu(entry->e_value_size));
 		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
-						 entry);
+						 entry, ea_inode_array);
 		if (error)
 			return error;
 
@@ -2761,7 +2775,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
  * Returns 0 on success or negative error number on failure.
  */
 int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			       struct ext4_inode *raw_inode, handle_t *handle)
+			       struct ext4_inode *raw_inode, handle_t *handle,
+			       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2833,7 +2848,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 
 	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
 					    isize_diff, ifree, bfree,
-					    &total_ino);
+					    &total_ino, ea_inode_array);
 	if (error) {
 		if (error == -ENOSPC && !tried_min_extra_isize &&
 		    s_min_extra_isize) {
@@ -2869,19 +2884,27 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 /* Add the large xattr @inode into @ea_inode_array for deferred iput().
  * If @ea_inode_array is new or full it will be grown and the old
  * contents copied over.
+ *
+ * If @inode is NULL this is a no-op.  If @ea_inode_array is NULL the
+ * caller guarantees SB_ACTIVE so synchronous iput is safe.
  */
 static int
 ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
 			struct inode *inode)
 {
+	if (!inode)
+		return 0;
+	if (!ea_inode_array) {
+		iput(inode);
+		return 0;
+	}
 	if (*ea_inode_array == NULL) {
 		/*
 		 * Start with 15 inodes, so it fits into a power-of-two size.
 		 */
 		(*ea_inode_array) = kmalloc_flex(**ea_inode_array, inodes,
-						 EIA_MASK, GFP_NOFS);
-		if (*ea_inode_array == NULL)
-			return -ENOMEM;
+						 EIA_MASK,
+						 GFP_NOFS | __GFP_NOFAIL);
 		(*ea_inode_array)->count = 0;
 	} else if (((*ea_inode_array)->count & EIA_MASK) == EIA_MASK) {
 		/* expand the array once all 15 + n * 16 slots are full */
@@ -2889,9 +2912,7 @@ ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
 
 		new_array = kmalloc_flex(**ea_inode_array, inodes,
 					 (*ea_inode_array)->count + EIA_INCR,
-					 GFP_NOFS);
-		if (new_array == NULL)
-			return -ENOMEM;
+					 GFP_NOFS | __GFP_NOFAIL);
 		memcpy(new_array, *ea_inode_array,
 		       struct_size(*ea_inode_array, inodes,
 				   (*ea_inode_array)->count));
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..6771d00d3fa4 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -179,7 +179,9 @@ extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);
 
 extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
 extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
-extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
+extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *,
+		const void *, size_t, int,
+		struct ext4_xattr_inode_array **ea_inode_array);
 extern int ext4_xattr_set_credits(struct inode *inode, size_t value_len,
 				  bool is_create, int *credits);
 extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
@@ -192,7 +194,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			    struct ext4_inode *raw_inode, handle_t *handle);
+			    struct ext4_inode *raw_inode, handle_t *handle,
+			    struct ext4_xattr_inode_array **ea_inode_array);
 extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler * const ext4_xattr_handlers[];
@@ -204,7 +207,8 @@ extern int ext4_xattr_ibody_get(struct inode *inode, int name_index,
 				void *buffer, size_t buffer_size);
 extern int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 				struct ext4_xattr_info *i,
-				struct ext4_xattr_ibody_find *is);
+				struct ext4_xattr_ibody_find *is,
+				struct ext4_xattr_inode_array **ea_inode_array);
 
 extern struct mb_cache *ext4_xattr_create_cache(void);
 extern void ext4_xattr_destroy_cache(struct mb_cache *);
diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
index 776cf11d24ca..6b7ab6e703ad 100644
--- a/fs/ext4/xattr_security.c
+++ b/fs/ext4/xattr_security.c
@@ -44,7 +44,8 @@ ext4_initxattrs(struct inode *inode, const struct xattr *xattr_array,
 		err = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_SECURITY,
 					    xattr->name, xattr->value,
-					    xattr->value_len, XATTR_CREATE);
+					    xattr->value_len, XATTR_CREATE,
+					    NULL);
 		if (err < 0)
 			break;
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 2/3] ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
From: Yun Zhou @ 2026-06-15 11:53 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260615115317.3549469-1-yun.zhou@windriver.com>

An inode being evicted will never need its extra isize expanded.  Set
EXT4_STATE_NO_EXPAND before ext4_mark_inode_dirty() in ext4_evict_inode()
to make this explicit and prevent any unnecessary work in
ext4_try_to_expand_extra_isize().

This also provides defense-in-depth for the s_writepages_rwsem deadlock
during mount-time orphan cleanup, ensuring the expand path is blocked
for inodes under eviction regardless of how they are reached.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index be1e3eaa4f23..60c91c098fa0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -264,6 +264,7 @@ void ext4_evict_inode(struct inode *inode)
 	if (ext4_inode_is_fast_symlink(inode))
 		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
 	inode->i_size = 0;
+	ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (err) {
 		ext4_warning(inode->i_sb,
-- 
2.43.0

^ permalink raw reply related

* [PATCH v3 2/2] ext4: get ext4_group_desc in ext4_mb_prefetch only when necessary
From: Bohdan Trach @ 2026-06-15 10:03 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi
  Cc: mchehab+huawei, bohdan.trach, lilith.oberhauser, Bohdan Trach,
	linux-ext4, linux-kernel
In-Reply-To: <20260615100331.163997-1-bohdan.trach@huaweicloud.com>

Getting ext4_group_desc structure can contribute to the cost of
ext4_mb_prefetch() without any need, as most groups fail the
!EXT4_MB_GRP_TEST_AND_SET_READ check.

Optimize ext4_mb_prefetch by getting the group description only when
necessary.

The result is further increase in performance of fallocate() system call
path that triggers ext4_mb_prefetch() via a linear group scan.

Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index ed1bd00e11cd..06171a11db12 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2861,8 +2861,6 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 
 	blk_start_plug(&plug);
 	while (nr-- > 0) {
-		struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group,
-								  NULL);
 		struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 
 		/*
@@ -2872,14 +2870,17 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 		 * prefetch once, so we avoid getblk() call, which can
 		 * be expensive.
 		 */
-		if (gdp && grp && !EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
-		    EXT4_MB_GRP_NEED_INIT(grp) &&
-		    ext4_free_group_clusters(sb, gdp) > 0 ) {
-			bh = ext4_read_block_bitmap_nowait(sb, group, true);
-			if (!IS_ERR_OR_NULL(bh)) {
-				if (!buffer_uptodate(bh) && cnt)
-					(*cnt)++;
-				brelse(bh);
+		if (grp && !EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
+		    EXT4_MB_GRP_NEED_INIT(grp)) {
+			struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
+
+			if (gdp && ext4_free_group_clusters(sb, gdp) > 0) {
+				bh = ext4_read_block_bitmap_nowait(sb, group, true);
+				if (!IS_ERR_OR_NULL(bh)) {
+					if (!buffer_uptodate(bh) && cnt)
+						(*cnt)++;
+					brelse(bh);
+				}
 			}
 		}
 		if (++group >= ngroups)
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ
From: Bohdan Trach @ 2026-06-15 10:03 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi
  Cc: mchehab+huawei, bohdan.trach, lilith.oberhauser, Bohdan Trach,
	linux-ext4, linux-kernel
In-Reply-To: <20260615100331.163997-1-bohdan.trach@huaweicloud.com>

EXT4_MB_GRP_TEST_AND_SET_READ uses test_and_set_bit function which
issues an atomic write. This can cause high overhead due to cache
contention when multiple threads iterate over groups in a tight loop,
as is the case for ext4_mb_prefetch(). We have seen this to be a
problem for Kunpeng 920b CPUs which uses a single ARM LSE instruction
for this purpose.

Avoid this unconditional atomic write by testing the bit first without
changing its value. This is OK for this use case as this bit is never
unset.

This change significantly reduces costs of fallocate() operations which
trigger linear group scans on large multicore machines where
test_and_set_bit issues an atomic write operation unconditionally.

Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ddc903738c6b..7cb2f86296c8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3639,7 +3639,13 @@ struct ext4_group_info {
 #define EXT4_MB_GRP_CLEAR_TRIMMED(grp)	\
 	(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
 #define EXT4_MB_GRP_TEST_AND_SET_READ(grp)	\
-	(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
+	(ext4_mb_grp_test_and_set_read((grp)))
+
+static inline int ext4_mb_grp_test_and_set_read(struct ext4_group_info *grp)
+{
+	return (test_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state) ||
+		test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state));
+}

 #define EXT4_MAX_CONTENTION		8
 #define EXT4_CONTENTION_THRESHOLD	2
-- 
2.43.0

^ permalink raw reply related

* [PATCH v3 0/2] ext4: optimize ext4_mb_prefetch
From: Bohdan Trach @ 2026-06-15 10:03 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi
  Cc: mchehab+huawei, bohdan.trach, lilith.oberhauser, Bohdan Trach,
	linux-ext4, linux-kernel

v3:
  Rebase to latest ext4/dev.
  Added R-b by Jan Kara for patch 1/2.
v2:
  Fix issues found by Jan Kara, added R-b for patch 2/2.
  Extend commit message of patch 1/2 a bit.
v1:
https://lore.kernel.org/linux-ext4/20260521125931.16474-1-bohdan.trach@huaweicloud.com/

Original cover letter below:

Dear Ted,

We have been profiling scalability of some rocksdb-related workloads on
ext4 file system and have found a case where significant time ends up
being spent in ext4_mb_prefetch() function. This happens because
ext4_mb_scan_groups_linear() path is triggered in ext4_mb_scan_groups().
We have noticed that on larger, filled disks, this function can take
lots of time.

We have added a test for this issue to our fork of will-it-scale [1],
which you can use to reproduce the issue.(the actual workload does a few
writes after fallocate, they have been dropped to better illustrate the
issue).
1) https://github.com/open-s4c/will-it-scale/blob/master/tests/fallocate3.c

On this series, we optimize this code path:
Patch 1: change EXT4_MB_GRP_TEST_AND_SET_READ() to reduce the rate of
         atomic RMW operation via test_and_set_bit, which has quite
         high cost on large multicore CPUs, especially under
         contention for the group's flag cache lines.
         As this bit is only ever set, but never unset, it should be
         possible to reduce the cost of this check by calling
         test_bit[_acquire]() first.
Patch 2: restructure the ext4_mb_prefetch loop operations such that
         ext4_group_desc is fetched only after the checks based on
         ext4_group_info succeed.

This series has been tested with
        kvm-xfstests -c ext4/all -g auto
and did not introduce any new issues.

Performance test: we have used a our will-it-scale drop-in test we have
provided above, and used three machines for running it:
- Kunpeng 920 (arm64, 96 CPUs * 1 socket, 128G RAM, SAS HDD: Seagate
  Exos 10E2400 1.2TB)
- Kunpeng 920b (arm64, 80 CPUs * 2 sockets, 502G RAM, SATA SSD: Huawei
  ES3000 V6 0.96TB)
- AMD 9654 (x86_64, 96 CPUs * 2 sockets, 1.5T RAM, NVME SSD: Samsung SSD
  970 EVO Plus 1TB)
We have performed tests with existing file systems, as well as more limited
tests with a fixed-size file systems.

Benchmark on an existing file system for Kunpeng 920 (842G FS, 31% space
used) with the patch based on kernel 7.0.6:
| thr. | base | patched |      improv. |
|      | perf |    perf |              |
|------|------|---------|--------------|
|    1 | 1286 |    1608 |  +25.0388802 |
|    2 | 1673 |    1680 |   +0.4184100 |
|    4 | 1698 |    1712 |   +0.8244994 |
|    8 | 1721 |    1730 |   +0.5229518 |
|   16 | 1739 |    2313 |  +33.0074756 |
|   32 | 1742 |    3571 | +104.9942595 |
|   64 | 1735 |    3427 |  +97.5216138 |
|   96 | 1688 |    1814 |   +7.4644550 |

Benchmark on an existing file system for Kunpeng 920b (802G ext4 FS, 68%
space used) with the patch based on kernel 6.6:
| thr. | base | patched |  improv. |
|      | perf |    perf |          |
|------|------|---------|----------|
|    1 | 1613 |   1625  |   +0.74% |
|    2 | 1620 |   2603  |  +60.67% |
|    4 | 1624 |   4894  | +201.35% |
|    8 | 2505 |   8328  | +232.45% |
|   16 | 4736 |  11632  | +145.60% |
|   32 | 7784 |  13124  |  +68.60% |
|   64 | 8094 |   8636  |   +6.69% |
|  128 | 6914 |   7890  |  +14.11% |

Benchmark on an existing file system for AMD 9654 (15T FS, 6% space
used), kernel 7.1-rc3. This shows the performance impact on a mostly
free file system.
| thr. |  base | patched |    improv. |
|      |  perf |    perf |            |
|------|-------|---------|------------|
|    1 | 30901 |   31191 | +0.9384810 |
|    2 | 50874 |   50504 | -0.7272870 |
|    4 | 66068 |   64108 | -2.9666404 |
|    8 | 63963 |   61927 | -3.1830902 |
|   16 | 47809 |   47044 | -1.6001171 |
|   32 | 42441 |   42326 | -0.2709644 |
|   64 | 39773 |   39929 | +0.3922259 |
|  128 | 37065 |   36413 | -1.7590719 |

We have also performed the test with kernel 6.6 on both Kunpeng920b and
AMD 9654 with much smaller FS image (133G) to have more controlled
benchmarking environment, although this reduces the measured benefits as
well compared to a bigger FS with more groups to iterate over:

AMD 9654 performance:
| thr. |  base | patched |  improv. |
|      |  perf |    perf |          |
|------|----------------------------|
| 25% full file system:             |
|------|----------------------------|
|    1 |  5964 |    6778 |  +13.64% |
|    2 | 11811 |   13415 |  +13.58% |
|    4 | 20111 |   23570 |  +17.19% |
|    8 | 30083 |   36296 |  +20.65% |
|   16 | 27781 |   38302 |  +37.87% |
|   32 | 28325 |   36930 |  +30.37% |
|   64 | 26044 |   29952 |  +15.00% |
|  128 | 19969 |   20882 |   +4.57% |
|------|----------------------------|
| 50% full file system:             |
|------|----------------------------|
|    1 |  4093 |    7380 |  +80.30% |
|    2 | 13168 |   13906 |   +5.60% |
|    4 | 21440 |   22623 |   +5.51% |
|    8 | 30523 |   32360 |   +6.01% |
|   16 | 27502 |   34017 |  +23.68% |
|   32 | 27189 |   32480 |  +19.46% |
|   64 | 24146 |   26463 |   +9.59% |
|  128 | 18386 |   18631 |   +1.33% |
|------|----------------------------|
| 75% full file system:             |
|------|----------------------------|
|    1 |  5738 |    7208 |  +25.61% |
|    2 | 13869 |   15309 |  +10.38% |
|    4 | 21803 |   23447 |   +7.54% |
|    8 | 29004 |   30766 |   +6.07% |
|   16 | 25542 |   30584 |  +19.74% |
|   32 | 24242 |   28631 |  +18.10% |
|   64 | 20631 |   22833 |  +10.67% |
|  128 | 14603 |   15086 |   +3.30% |

Kunpeng K920b performance:
| thr. |  base | patched | improv. |
|      |  perf |    perf |         |
|------|---------------------------|
| 25% full file system:            |
|------|---------------------------|
|    1 |  5398 |    7025 | +30.14% |
|    2 |  7451 |   12299 | +65.06% |
|    4 | 12574 |   20899 | +66.20% |
|    8 | 18645 |   27694 | +48.53% |
|   16 | 25088 |   31739 | +26.51% |
|   32 | 26699 |   27632 |  +3.49% |
|   64 | 14943 |   19547 | +30.81% |
|  128 | 13047 |   14544 | +11.47% |
|------|---------------------------|
| 50% full file system:            |
|------|---------------------------|
|    1 |  4881 |    6618 | +35.58% |
|    2 |  6544 |   11660 | +78.17% |
|    4 | 11156 |   19506 | +74.84% |
|    8 | 16842 |   25835 | +53.39% |
|   16 | 23305 |   29260 | +25.55% |
|   32 | 24622 |   25303 |  +2.76% |
|   64 | 13814 |   17707 | +28.18% |
|  128 | 12061 |   13180 |  +9.27% |
|------|---------------------------|
| 75% full file system:            |
|------|---------------------------|
|    1 |  7037 |   10580 | +50.34% |
|    2 |  9216 |    9075 |  -1.52% |
|    4 | 14534 |   22076 | +51.89% |
|    8 | 19341 |   25936 | +34.09% |
|   16 | 23592 |   27409 | +16.17% |
|   32 | 23680 |   23078 |  -2.54% |
|   64 | 12836 |   15902 | +23.88% |
|  128 |  9614 |   10341 |  +7.56% |

Thanks,
Bohdan.

Bohdan Trach (2):
  ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ
  ext4: get ext4_group_desc in ext4_mb_prefetch only when necessary

 fs/ext4/ext4.h    |  8 +++++++-
 fs/ext4/mballoc.c | 21 +++++++++++----------
 2 files changed, 18 insertions(+), 11 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH ext4 v3] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Jan Kara @ 2026-06-15  9:27 UTC (permalink / raw)
  To: Xiang Mei
  Cc: linux-ext4, Jan Kara, Theodore Ts'o, Andreas Dilger,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <20260613221836.556019-1-xmei5@asu.edu>

On Sat 13-06-26 15:18:36, Xiang Mei wrote:
> ext4_read_inline_dir() can read a dirent header past the end of its inline
> buffer, triggering a slab-out-of-bounds read during getdents64():
> 
>   BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
>   Read of size 2 at addr ffff88800f3dd23c by task exploit/148
>    ...
>    __ext4_check_dir_entry
>    ext4_read_inline_dir
>    iterate_dir
> 
> The dirent payload lives in a buffer of exactly inline_size bytes:
> 
> 	dir_buf = kmalloc(inline_size, GFP_NOFS);
> 
> but iteration runs in a position space extra_offset bytes larger
> (extra_size = extra_offset + inline_size) so the synthetic "." and ".."
> land at their block-dir offsets. A dirent is formed at "dir_buf + pos -
> extra_offset", yet the loop bound (ctx->pos < extra_size) and the
> ext4_check_dir_entry() length argument both use the larger extra_size. A
> ctx->pos that is valid in extra_size space but whose de lies past 
> inline_size is therefore accepted, and the rescan loop's rec_len probe 
> and ext4_check_dir_entry() dereference de->rec_len before the entry is
> rejected.
> 
> Bound the dirent header by inline_size in both loops: break out of the
> rescan loop once a minimum-size header no longer fits, reject such a
> position in the main loop before forming de, and pass inline_size rather
> than extra_size to ext4_check_dir_entry().
> 
> Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Xiang Mei <xmei5@asu.edu>

Looks mostly good, just one simplification suggestion below:

> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 8045e4ff270c..1266c8827cca 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
>  			/* for other entry, the real offset in
>  			 * the buf has to be tuned accordingly.
>  			 */
> +			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
> +			    inline_size)
> +				break;

Since 'i' lives in "dir logical offset space", it might be simpler to check
this as:
			if (i + ext4_dir_rec_len(1, NULL) > extra_size)


>  			de = (struct ext4_dir_entry_2 *)
>  				(dir_buf + i - extra_offset);
>  			/* It's too expensive to do a full
> @@ -1488,10 +1491,18 @@ int ext4_read_inline_dir(struct file *file,
>  			continue;
>  		}
>  
> +		/*
> +		 * de lives at dir_buf + ctx->pos - extra_offset, within the
> +		 * kmalloc(inline_size) buffer.  Make sure its header fits before
> +		 * ext4_check_dir_entry() dereferences de->rec_len.
> +		 */
> +		if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> +		    inline_size)

Similarly here this could be checked as:

		if (ctx->pos + ext4_dir_rec_len(1, NULL) > extra_size)

which is both more consistent with the loop termination condition and saves
the conversion from logical offset to physical offset.

Otherwise the patch looks good.

								Honza

> +			goto out;
>  		de = (struct ext4_dir_entry_2 *)
>  			(dir_buf + ctx->pos - extra_offset);
>  		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> -					 extra_size, ctx->pos))
> +					 inline_size, ctx->pos))
>  			goto out;
>  		if (le32_to_cpu(de->inode)) {
>  			if (!dir_emit(ctx, de->name, de->name_len,
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v4] ext4: defer iput() in ext4_xattr_block_set() to avoid deadlock with writepages
From: Yun Zhou @ 2026-06-15  8:58 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

ext4_xattr_block_set() calls iput() on ea_inode while its callers hold
xattr_sem.  If this iput() drops the last reference, it can trigger
write_inode_now() -> ext4_writepages() -> s_writepages_rwsem, which
violates the lock ordering since ext4_writepages() already establishes
s_writepages_rwsem -> jbd2_handle ordering:

  CPU0 (writeback worker)            CPU1 (file create)
  ----                               ----
  ext4_writepages()
    s_writepages_rwsem (read)        ext4_create()
    ext4_do_writepages()               __ext4_new_inode()
      ext4_journal_start()               [holds jbd2 handle]
        wait_transaction_locked()        ext4_xattr_set_handle()
        [WAIT for jbd2_handle]             xattr_sem (write)

  CPU2 (xattr set or isize expand)
  ----
  ext4_xattr_set_handle() or ext4_try_to_expand_extra_isize()
    xattr_sem (write)
    ext4_xattr_block_set()
      iput(ea_inode)
        write_inode_now()
          ext4_writepages()
            s_writepages_rwsem (read) [DEADLOCK]

This forms a circular dependency on lock classes:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

Fix by deferring all ea_inode iput() calls until after locks are released:

1. Convert all iput(ea_inode) calls within ext4_xattr_set_entry(),
   ext4_xattr_ibody_set(), and ext4_xattr_block_set() to deferred iput
   via ext4_expand_inode_array().  Thread the ea_inode_array through
   ext4_xattr_set_handle(), ext4_xattr_move_to_block(), and
   ext4_expand_extra_isize_ea() as an output parameter.

2. Callers that own the journal handle (ext4_xattr_set,
   ext4_expand_extra_isize) free the array AFTER ext4_journal_stop().
   For ext4_try_to_expand_extra_isize (called from __ext4_mark_inode_dirty
   with the caller's handle), free after xattr_sem release -- safe
   because EXT4_STATE_NO_EXPAND blocks the !SB_ACTIVE path.

3. Callers that cannot control the handle lifetime (ext4_initxattrs,
   __ext4_set_acl, ext4_set_context, inline data ops) pass NULL.
   ext4_expand_inode_array() falls back to synchronous iput() when
   ea_inode_array is NULL -- safe because these callers only run with
   SB_ACTIVE where iput cannot trigger write_inode_now().

4. Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() before
   ext4_mark_inode_dirty() to block the only code path where !SB_ACTIVE
   (mount-time orphan cleanup) could reach ext4_try_to_expand_extra_isize.

5. Use __GFP_NOFAIL in ext4_expand_inode_array() to guarantee the
   deferred-iput array never fails to grow, eliminating any fallback
   to synchronous iput() under locks.

6. Pass ea_inode_array directly to ext4_xattr_release_block() in
   ext4_xattr_block_set() instead of using a local array freed
   synchronously under xattr_sem.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/acl.c            |  2 +-
 fs/ext4/crypto.c         |  4 +-
 fs/ext4/inline.c         |  8 ++--
 fs/ext4/inode.c          | 16 +++++--
 fs/ext4/xattr.c          | 93 ++++++++++++++++++++++++----------------
 fs/ext4/xattr.h          | 10 +++--
 fs/ext4/xattr_security.c |  3 +-
 7 files changed, 85 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 3bffe862f954..21de8276b558 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -215,7 +215,7 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
 	}
 
 	error = ext4_xattr_set_handle(handle, inode, name_index, "",
-				      value, size, xattr_flags);
+				      value, size, xattr_flags, NULL);
 
 	kfree(value);
 	if (!error)
diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index f41f320f4437..bca760751c1d 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -173,7 +173,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 		res = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_ENCRYPTION,
 					    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-					    ctx, len, XATTR_CREATE);
+					    ctx, len, XATTR_CREATE, NULL);
 		if (!res) {
 			ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 			ext4_clear_inode_state(inode,
@@ -202,7 +202,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 
 	res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
 				    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-				    ctx, len, 0);
+				    ctx, len, 0, NULL);
 	if (!res) {
 		ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 		/*
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..2bf2b771929d 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -309,7 +309,7 @@ static int ext4_create_inline_data(handle_t *handle,
 		goto out;
 	}
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error) {
 		if (error == -ENOSPC)
 			ext4_clear_inode_state(inode,
@@ -386,7 +386,7 @@ static int ext4_update_inline_data(handle_t *handle, struct inode *inode,
 	i.value = value;
 	i.value_len = len;
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error)
 		goto out;
 
@@ -469,7 +469,7 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
 	if (error)
 		goto out;
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error)
 		goto out;
 
@@ -1917,7 +1917,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
 			i.value = value;
 			i.value_len = i_size > EXT4_MIN_INLINE_DATA_SIZE ?
 					i_size - EXT4_MIN_INLINE_DATA_SIZE : 0;
-			err = ext4_xattr_ibody_set(handle, inode, &i, &is);
+			err = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 			if (err)
 				goto out_error;
 		}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..ebd501109985 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -264,6 +264,7 @@ void ext4_evict_inode(struct inode *inode)
 	if (ext4_inode_is_fast_symlink(inode))
 		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
 	inode->i_size = 0;
+	ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (err) {
 		ext4_warning(inode->i_sb,
@@ -6408,7 +6409,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 static int __ext4_expand_extra_isize(struct inode *inode,
 				     unsigned int new_extra_isize,
 				     struct ext4_iloc *iloc,
-				     handle_t *handle, int *no_expand)
+				     handle_t *handle, int *no_expand,
+				     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_inode *raw_inode;
 	struct ext4_xattr_ibody_header *header;
@@ -6453,7 +6455,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
 
 	/* try to expand with EAs present */
 	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
-					   raw_inode, handle);
+					   raw_inode, handle, ea_inode_array);
 	if (error) {
 		/*
 		 * Inode size expansion failed; don't try again
@@ -6475,6 +6477,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 {
 	int no_expand;
 	int error;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
@@ -6496,8 +6499,10 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 		return -EBUSY;
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 
 	return error;
 }
@@ -6509,6 +6514,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	handle_t *handle;
 	int no_expand;
 	int error, rc;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
 		brelse(iloc->bh);
@@ -6534,7 +6540,8 @@ int ext4_expand_extra_isize(struct inode *inode,
 	}
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 
 	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
 	if (!error)
@@ -6543,6 +6550,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 out_unlock:
 	ext4_write_unlock_xattr(inode, &no_expand);
 	ext4_journal_stop(handle);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return error;
 }
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 77c11e4096bb..5ba4ae16e386 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1651,7 +1651,8 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
 				struct ext4_xattr_search *s,
 				handle_t *handle, struct inode *inode,
 				struct inode *new_ea_inode,
-				bool is_block)
+				bool is_block,
+				struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_entry *last, *next;
 	struct ext4_xattr_entry *here = s->here;
@@ -1869,7 +1870,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
 
 	ret = 0;
 out:
-	iput(old_ea_inode);
+	ext4_expand_inode_array(ea_inode_array, old_ea_inode);
 	return ret;
 }
 
@@ -1919,7 +1920,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
 static int
 ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		     struct ext4_xattr_info *i,
-		     struct ext4_xattr_block_find *bs)
+		     struct ext4_xattr_block_find *bs,
+		     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *new_bh = NULL;
@@ -1982,7 +1984,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			}
 			ea_bdebug(bs->bh, "modifying in-place");
 			error = ext4_xattr_set_entry(i, s, handle, inode,
-					     ea_inode, true /* is_block */);
+					     ea_inode, true /* is_block */,
+					     ea_inode_array);
 			ext4_xattr_block_csum_set(inode, bs->bh);
 			unlock_buffer(bs->bh);
 			if (error == -EFSCORRUPTED)
@@ -2051,7 +2054,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	}
 
 	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
-				     true /* is_block */);
+				     true /* is_block */, ea_inode_array);
 	if (error == -EFSCORRUPTED)
 		goto bad_block;
 	if (error)
@@ -2171,7 +2174,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				ext4_expand_inode_array(ea_inode_array,
+							ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2203,12 +2207,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
-
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
+					 ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
 	}
 	error = 0;
 
@@ -2224,7 +2225,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		ext4_expand_inode_array(ea_inode_array, ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2274,7 +2275,8 @@ int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
 
 int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 				struct ext4_xattr_info *i,
-				struct ext4_xattr_ibody_find *is)
+				struct ext4_xattr_ibody_find *is,
+				struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_xattr_search *s = &is->s;
@@ -2294,7 +2296,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 			return PTR_ERR(ea_inode);
 	}
 	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
-				     false /* is_block */);
+				     false /* is_block */, ea_inode_array);
 	if (error) {
 		if (ea_inode) {
 			int error2;
@@ -2306,7 +2308,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
-			iput(ea_inode);
+			ext4_expand_inode_array(ea_inode_array, ea_inode);
 		}
 		return error;
 	}
@@ -2318,7 +2320,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 		header->h_magic = cpu_to_le32(0);
 		ext4_clear_inode_state(inode, EXT4_STATE_XATTR);
 	}
-	iput(ea_inode);
+	ext4_expand_inode_array(ea_inode_array, ea_inode);
 	return 0;
 }
 
@@ -2369,7 +2371,8 @@ static struct buffer_head *ext4_xattr_get_block(struct inode *inode)
 int
 ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		      const char *name, const void *value, size_t value_len,
-		      int flags)
+		      int flags,
+		      struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_info i = {
 		.name_index = name_index,
@@ -2449,9 +2452,11 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 
 	if (!value) {
 		if (!is.s.not_found)
-			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+			error = ext4_xattr_ibody_set(handle, inode, &i, &is,
+						     ea_inode_array);
 		else if (!bs.s.not_found)
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 	} else {
 		error = 0;
 		/* Xattr value did not change? Save us some work and bail out */
@@ -2465,10 +2470,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 			EXT4_XATTR_MIN_LARGE_EA_SIZE(inode->i_sb->s_blocksize)))
 			i.in_inode = 1;
 retry_inode:
-		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+		error = ext4_xattr_ibody_set(handle, inode, &i, &is,
+						     ea_inode_array);
 		if (!error && !bs.s.not_found) {
 			i.value = NULL;
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 		} else if (error == -ENOSPC) {
 			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
 				brelse(bs.bh);
@@ -2477,11 +2484,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 				if (error)
 					goto cleanup;
 			}
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 			if (!error && !is.s.not_found) {
 				i.value = NULL;
 				error = ext4_xattr_ibody_set(handle, inode, &i,
-							     &is);
+							     &is, ea_inode_array);
 			} else if (error == -ENOSPC) {
 				/*
 				 * Xattr does not fit in the block, store at
@@ -2560,6 +2568,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 {
 	handle_t *handle;
 	struct super_block *sb = inode->i_sb;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	int error, retries = 0;
 	int credits;
 
@@ -2580,10 +2589,13 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 		int error2;
 
 		error = ext4_xattr_set_handle(handle, inode, name_index, name,
-					      value, value_len, flags);
+					      value, value_len, flags,
+					      &ea_inode_array);
 		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR,
 					handle);
 		error2 = ext4_journal_stop(handle);
+		ext4_xattr_inode_array_free(ea_inode_array);
+		ea_inode_array = NULL;
 		if (error == -ENOSPC &&
 		    ext4_should_retry_alloc(sb, &retries))
 			goto retry;
@@ -2625,7 +2637,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  */
 static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 				    struct ext4_inode *raw_inode,
-				    struct ext4_xattr_entry *entry)
+				    struct ext4_xattr_entry *entry,
+				    struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_find *is = NULL;
 	struct ext4_xattr_block_find *bs = NULL;
@@ -2689,14 +2702,14 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Move ea entry from the inode into the block */
-	error = ext4_xattr_block_set(handle, inode, &i, bs);
+	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
 	if (error)
 		goto out;
 
 	/* Remove the chosen entry from the inode */
 	i.value = NULL;
 	i.value_len = 0;
-	error = ext4_xattr_ibody_set(handle, inode, &i, is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, is, ea_inode_array);
 
 out:
 	kfree(b_entry_name);
@@ -2715,7 +2728,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 				       struct ext4_inode *raw_inode,
 				       int isize_diff, size_t ifree,
-				       size_t bfree, int *total_ino)
+				       size_t bfree, int *total_ino,
+				       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
 	struct ext4_xattr_entry *small_entry;
@@ -2765,7 +2779,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 			total_size += EXT4_XATTR_SIZE(
 					      le32_to_cpu(entry->e_value_size));
 		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
-						 entry);
+						 entry, ea_inode_array);
 		if (error)
 			return error;
 
@@ -2782,7 +2796,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
  * Returns 0 on success or negative error number on failure.
  */
 int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			       struct ext4_inode *raw_inode, handle_t *handle)
+			       struct ext4_inode *raw_inode, handle_t *handle,
+			       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2854,7 +2869,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 
 	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
 					    isize_diff, ifree, bfree,
-					    &total_ino);
+					    &total_ino, ea_inode_array);
 	if (error) {
 		if (error == -ENOSPC && !tried_min_extra_isize &&
 		    s_min_extra_isize) {
@@ -2890,19 +2905,27 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 /* Add the large xattr @inode into @ea_inode_array for deferred iput().
  * If @ea_inode_array is new or full it will be grown and the old
  * contents copied over.
+ *
+ * If @inode is NULL this is a no-op.  If @ea_inode_array is NULL the
+ * caller guarantees SB_ACTIVE so synchronous iput is safe.
  */
 static int
 ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
 			struct inode *inode)
 {
+	if (!inode)
+		return 0;
+	if (!ea_inode_array) {
+		iput(inode);
+		return 0;
+	}
 	if (*ea_inode_array == NULL) {
 		/*
 		 * Start with 15 inodes, so it fits into a power-of-two size.
 		 */
 		(*ea_inode_array) = kmalloc_flex(**ea_inode_array, inodes,
-						 EIA_MASK, GFP_NOFS);
-		if (*ea_inode_array == NULL)
-			return -ENOMEM;
+						 EIA_MASK,
+						 GFP_NOFS | __GFP_NOFAIL);
 		(*ea_inode_array)->count = 0;
 	} else if (((*ea_inode_array)->count & EIA_MASK) == EIA_MASK) {
 		/* expand the array once all 15 + n * 16 slots are full */
@@ -2910,9 +2933,7 @@ ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
 
 		new_array = kmalloc_flex(**ea_inode_array, inodes,
 					 (*ea_inode_array)->count + EIA_INCR,
-					 GFP_NOFS);
-		if (new_array == NULL)
-			return -ENOMEM;
+					 GFP_NOFS | __GFP_NOFAIL);
 		memcpy(new_array, *ea_inode_array,
 		       struct_size(*ea_inode_array, inodes,
 				   (*ea_inode_array)->count));
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..6771d00d3fa4 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -179,7 +179,9 @@ extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);
 
 extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
 extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
-extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
+extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *,
+		const void *, size_t, int,
+		struct ext4_xattr_inode_array **ea_inode_array);
 extern int ext4_xattr_set_credits(struct inode *inode, size_t value_len,
 				  bool is_create, int *credits);
 extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
@@ -192,7 +194,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			    struct ext4_inode *raw_inode, handle_t *handle);
+			    struct ext4_inode *raw_inode, handle_t *handle,
+			    struct ext4_xattr_inode_array **ea_inode_array);
 extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler * const ext4_xattr_handlers[];
@@ -204,7 +207,8 @@ extern int ext4_xattr_ibody_get(struct inode *inode, int name_index,
 				void *buffer, size_t buffer_size);
 extern int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 				struct ext4_xattr_info *i,
-				struct ext4_xattr_ibody_find *is);
+				struct ext4_xattr_ibody_find *is,
+				struct ext4_xattr_inode_array **ea_inode_array);
 
 extern struct mb_cache *ext4_xattr_create_cache(void);
 extern void ext4_xattr_destroy_cache(struct mb_cache *);
diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
index 776cf11d24ca..6b7ab6e703ad 100644
--- a/fs/ext4/xattr_security.c
+++ b/fs/ext4/xattr_security.c
@@ -44,7 +44,8 @@ ext4_initxattrs(struct inode *inode, const struct xattr *xattr_array,
 		err = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_SECURITY,
 					    xattr->name, xattr->value,
-					    xattr->value_len, XATTR_CREATE);
+					    xattr->value_len, XATTR_CREATE,
+					    NULL);
 		if (err < 0)
 			break;
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3] ext4: drop s_writepages_rwsem around inline data handling in writepages
From: Yun Zhou @ 2026-06-15  6:10 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

ext4_do_writepages() calls ext4_destroy_inline_data() which acquires
xattr_sem while s_writepages_rwsem is held (read).  This creates a
circular lock dependency:

  CPU0                               CPU1
  ----                               ----
  ext4_writepages()
    ext4_writepages_down_read()
      [holds s_writepages_rwsem]
                                     ext4_evict_inode()
                                       __ext4_mark_inode_dirty()
                                         ext4_expand_extra_isize_ea()
                                           ext4_xattr_block_set()
                                             [holds xattr_sem]
                                             iput(old_bh inode)
                                               write_inode_now()
                                                 ext4_writepages()
                                                   ext4_writepages_down_read()
                                                   [BLOCKED on s_writepages_rwsem]
    ext4_do_writepages()
      ext4_destroy_inline_data()
        down_write(xattr_sem)
        [BLOCKED on xattr_sem]

Fix by temporarily dropping s_writepages_rwsem for the entire inline
data handling block, including the journal handle start/stop.  The
rwsem must be dropped before ext4_journal_start() -- not between
journal_start and journal_stop -- to avoid a secondary deadlock with
ext4_change_inode_journal_flag() which takes rwsem (write) and then
calls jbd2_journal_lock_updates() waiting for active handles to stop.

This is safe because:

 - This code runs before any block mapping or IO submission, so no
   writepages state depends on the rwsem being held at this point.

 - Inline data destruction is a one-way format transition (once cleared,
   EXT4_INODE_INLINE_DATA is never set again).  The rwsem is
   re-acquired after journal_stop, ensuring format stability for the
   remainder of writepages.

 - The can_map flag identifies the ext4_writepages() path (holds rwsem)
   vs ext4_normal_submit_inode_data_buffers() (does not), so the
   drop/reacquire is skipped when the rwsem is not held.

Also check the return value of ext4_destroy_inline_data() to avoid
proceeding with an inconsistent inode format on failure.

Reported-by: syzbot+bb2455d02bda0b5701e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bb2455d02bda0b5701e3
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Drop s_writepages_rwsem before ext4_journal_start() and reacquire
    after ext4_journal_stop(), instead of dropping between journal_start
    and journal_stop as in v2.  This avoids two issues identified in v2
    review:
    - memalloc_nofs_restore() in ext4_writepages_up_read() would clear
      PF_MEMALLOC_NOFS while the jbd2 handle is active.
    - Reacquiring s_writepages_rwsem while holding a handle creates an
      ABBA deadlock with ext4_change_inode_journal_flag() which takes
      the rwsem (write) then calls jbd2_journal_lock_updates().

v2: Instead of moving inline data handling to ext4_writepages(),
    temporarily drop s_writepages_rwsem around ext4_destroy_inline_data()
    in ext4_do_writepages(). The move approach had a race where concurrent
    writes could create dirty pages with inline data after the early check,
    and unconditional destruction without dirty pages would lose data.

v1: Moved inline data cleanup from ext4_do_writepages() to
      ext4_writepages() before acquiring s_writepages_rwsem.

 fs/ext4/inode.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..cd7588a3fa45 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1694,6 +1694,9 @@ struct mpage_da_data {
 	struct writeback_control *wbc;
 	unsigned int can_map:1;	/* Can writepages call map blocks? */
 
+	/* Saved memalloc context from ext4_writepages_down_read() */
+	int alloc_ctx;
+
 	/* These are internal state of ext4_do_writepages() */
 	loff_t start_pos;	/* The start pos to write */
 	loff_t next_pos;	/* Current pos to examine */
@@ -2816,16 +2819,35 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	 * we'd better clear the inline data here.
 	 */
 	if (ext4_has_inline_data(inode)) {
-		/* Just inode will be modified... */
+		/*
+		 * Temporarily drop s_writepages_rwsem because
+		 * ext4_destroy_inline_data() acquires xattr_sem, which has
+		 * a higher lock ordering rank.  Holding both would create a
+		 * circular dependency with ext4_xattr_block_set() -> iput()
+		 * -> ext4_writepages() -> s_writepages_rwsem.
+		 *
+		 * Drop the rwsem before starting the journal handle to also
+		 * avoid a deadlock with ext4_change_inode_journal_flag(),
+		 * which takes rwsem (write) then jbd2_journal_lock_updates().
+		 */
+		if (mpd->can_map)
+			ext4_writepages_up_read(inode->i_sb, mpd->alloc_ctx);
 		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
 		if (IS_ERR(handle)) {
+			if (mpd->can_map)
+				mpd->alloc_ctx =
+					ext4_writepages_down_read(inode->i_sb);
 			ret = PTR_ERR(handle);
 			goto out_writepages;
 		}
 		BUG_ON(ext4_test_inode_state(inode,
 				EXT4_STATE_MAY_INLINE_DATA));
-		ext4_destroy_inline_data(handle, inode);
+		ret = ext4_destroy_inline_data(handle, inode);
 		ext4_journal_stop(handle);
+		if (mpd->can_map)
+			mpd->alloc_ctx = ext4_writepages_down_read(inode->i_sb);
+		if (ret)
+			goto out_writepages;
 	}
 
 	/*
@@ -3032,13 +3054,12 @@ static int ext4_writepages(struct address_space *mapping,
 		.can_map = 1,
 	};
 	int ret;
-	int alloc_ctx;
 
 	ret = ext4_emergency_state(sb);
 	if (unlikely(ret))
 		return ret;
 
-	alloc_ctx = ext4_writepages_down_read(sb);
+	mpd.alloc_ctx = ext4_writepages_down_read(sb);
 	ret = ext4_do_writepages(&mpd);
 	/*
 	 * For data=journal writeback we could have come across pages marked
@@ -3047,7 +3068,7 @@ static int ext4_writepages(struct address_space *mapping,
 	 */
 	if (!ret && mpd.journalled_more_data)
 		ret = ext4_do_writepages(&mpd);
-	ext4_writepages_up_read(sb, alloc_ctx);
+	ext4_writepages_up_read(sb, mpd.alloc_ctx);
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Yun Zhou @ 2026-06-15  5:35 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Validate EA inode state in ext4_xattr_inode_iget() to prevent
WARN_ONCE triggers in ext4_xattr_inode_update_ref() and reject
corrupted EA inodes before they can cause further damage.

When a corrupted ext4 image has an EA inode with inconsistent i_nlink
and ref_count values, the code currently allows it through and later
hits WARN_ONCE when ref_count transitions cross the 0/1 boundary.
While this is not a security or stability issue -- it only fires on
crafted filesystem images and merely prints a call trace -- it is
better handled as an early sanity check that returns -EFSCORRUPTED,
consistent with how ext4 treats other on-disk corruption.

Since ext4_xattr_inode_iget() resolves references from active xattr
entries, the target EA inode must be in active state (i_nlink=1,
ref_count>0).  Reject any inode that does not satisfy this.

Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v4:
 - Take I_MUTEX_XATTR before checking orphan state to safely decide
   whether to call make_bad_inode(), avoiding orphan list corruption
   if another thread is concurrently freeing the EA inode

v3:
 - Move check after Lustre branch to avoid false positives on Lustre EA inodes
 - Merge into single condition: i_nlink != 1 || !ref_count
 - Add make_bad_inode() before iput() to avoid truncation in active txn

v2:
 - Add ref_count validation to also catch i_nlink=1, ref_count=0 case

 fs/ext4/xattr.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..8efd6368f956 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -464,6 +464,33 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 		inode_unlock(inode);
 	}
 
+	/*
+	 * Since this function resolves references from active xattr entries,
+	 * the EA inode must be in active state (i_nlink=1, ref_count>0).
+	 * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
+	 * (inconsistent with an active entry) all indicate corruption or
+	 * a concurrent last-reference drop.
+	 */
+	if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
+		ext4_error(parent->i_sb,
+			   "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
+			   ea_ino, inode->i_nlink,
+			   ext4_xattr_inode_get_ref(inode));
+		/*
+		 * Mark rejected inode to prevent ext4_evict_inode() from
+		 * attempting truncation on a corrupted inode within an active
+		 * transaction, which could exhaust journal credits. The lock
+		 * serializes against ext4_xattr_inode_update_ref() which
+		 * does clear_nlink() + ext4_orphan_add() under the same lock.
+		 */
+		inode_lock_nested(inode, I_MUTEX_XATTR);
+		if (!ext4_inode_orphan_tracked(inode))
+			make_bad_inode(inode);
+		inode_unlock(inode);
+		iput(inode);
+		return -EFSCORRUPTED;
+	}
+
 	*ea_inode = inode;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Zhang Yi @ 2026-06-15  3:28 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-3-libaokun@linux.alibaba.com>

On 6/12/2026 12:34 AM, Baokun Li wrote:
> For unaligned DIO writes, the previous ext4_overwrite_io() required the
> entire range to fall within a single written extent.  This was overly
> conservative: the DIO layer only performs partial block zeroing for the
> head and tail blocks when they are partially covered by the write.
> Middle blocks that are fully covered are written as whole blocks
> without any zeroing, so they are safe regardless of extent state.
> 
> Therefore exclusive lock is only required when partial block zeroing
> will actually happen:
>  - The head partial block (if any) lands on a hole or unwritten extent.
>  - The tail partial block (if any) lands on a hole or unwritten extent.
> 
> Middle full-cover blocks can be in any state (hole, unwritten, or
> written) - block allocation under shared lock is safe per the previous
> patch's analysis (inode_dio_begin + i_data_sem protection).
> 
> Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
> directly answers the question driving the lock decision.  It uses at
> most two ext4_map_blocks() calls: one for the head partial block (also
> catching the case where it spans through the tail), and one for the
> tail partial block if not already covered.
> 
> This enables shared lock for previously-rejected scenarios such as:
>  - Unaligned write spanning written extent + mid-range hole + written
>    extent at the tail.
>  - Unaligned write where the partial blocks land on written extents but
>    the middle has unwritten extents.
> 
> Performance:
> 
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
> 
> Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
> Each stripe is laid out as [written][unwritten][unwritten][written],
> so the head and tail partial blocks land on written extents but the
> middle is unwritten.  Metric: IOPS.
> 
>   JOBS      Before        After    speedup
>   ----    --------    ---------    -------
>      1      15,547       17,381      1.12x
>      2      15,910       34,172      2.15x
>      4      15,014       57,567      3.83x
>      8      15,022       81,947      5.46x
>     16      14,586       99,126      6.80x
>     32      14,047       92,519      6.59x
> 
> Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.
> 
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

A nice conclusion and a solid speed-up.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>

> ---
>  fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
>  1 file changed, 73 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6f3886465ce3..aa926e641739 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
>  	return false;
>  }
>  
> -/* Is IO overwriting allocated or initialized blocks? */
> -static bool ext4_overwrite_io(struct inode *inode,
> -			      loff_t pos, loff_t len, bool *unwritten)
> +/*
> + * Does an unaligned DIO write require partial block zeroing?
> + *
> + * Partial block zeroing is performed only for the head and tail blocks
> + * when they are partially covered by the write and the underlying extent
> + * is a hole or unwritten. Middle blocks (fully covered by the write)
> + * are written as whole blocks without zeroing.
> + *
> + * When zeroing is required, two concurrent unaligned DIO writes to the
> + * same partial block can race and corrupt each other's data, so the
> + * caller must take the exclusive i_rwsem and drain in-flight DIO. When
> + * zeroing is not required, shared lock is safe -- block allocation and
> + * unwritten conversion for middle blocks are protected by i_data_sem
> + * and inode_dio_begin().
> + */
> +static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
>  {
>  	struct ext4_map_blocks map;
>  	unsigned int blkbits = inode->i_blkbits;
> -	int err, blklen;
> +	unsigned long blockmask = inode->i_sb->s_blocksize - 1;
> +	bool head_partial, tail_partial;
> +	ext4_lblk_t head_lblk, tail_lblk;
> +	int err;
>  
>  	if (pos + len > i_size_read(inode))
> -		return false;
> +		return true;
>  
> -	map.m_lblk = pos >> blkbits;
> -	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
> -	blklen = map.m_len;
> +	head_partial = (pos & blockmask) != 0;
> +	tail_partial = ((pos + len) & blockmask) != 0;
> +	head_lblk = pos >> blkbits;
> +	tail_lblk = (pos + len - 1) >> blkbits;
> +
> +	/* Check the head partial block. */
> +	if (head_partial) {
> +		map.m_lblk = head_lblk;
> +		map.m_len = tail_lblk - head_lblk + 1;
> +		err = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> +			return true;
> +		/* If this mapping already covers the tail block, we're done. */
> +		if (!tail_partial || map.m_lblk + err > tail_lblk)
> +			return false;
> +	}
>  
> -	err = ext4_map_blocks(NULL, inode, &map, 0);
> -	if (err != blklen)
> -		return false;
> -	/*
> -	 * 'err==len' means that all of the blocks have been preallocated,
> -	 * regardless of whether they have been initialized or not. We need to
> -	 * check m_flags to distinguish the unwritten extents.
> -	 */
> -	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
> -	return true;
> +	/* Check the tail partial block. */
> +	if (tail_partial) {
> +		map.m_lblk = tail_lblk;
> +		map.m_len = 1;
> +		err = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> +			return true;
> +	}
> +
> +	return false;
>  }
>  
>  static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
> @@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>   *    i_data_sem serializes concurrent extent tree modifications.
>   *
>   * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> - *    only safe for pure written-extent overwrites. Unwritten extents or
> - *    holes require exclusive lock because concurrent partial block zeroing
> - *    in the DIO layer could corrupt data.
> + *    safe unless the DIO layer needs to perform partial block zeroing --
> + *    i.e. the head or tail partial block sits on a hole or unwritten
> + *    extent. In that case upgrade to the exclusive lock and drain
> + *    in-flight DIO to avoid races with concurrent partial block zeroing.
>   */
>  static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  				     bool *ilock_shared, bool *extend,
> @@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	loff_t offset;
>  	size_t count;
>  	ssize_t ret;
> -	bool overwrite = true, unaligned_io, unwritten = false;
> +	bool needs_zeroing = false;
>  
>  restart:
>  	ret = ext4_generic_write_checks(iocb, from);
> @@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	offset = iocb->ki_pos;
>  	count = ret;
>  
> -	unaligned_io = ext4_unaligned_io(inode, from, offset);
>  	*extend = ext4_extending_io(inode, offset, count);
>  
>  	/*
> -	 * For unaligned writes we need to know the extent state to determine
> -	 * whether shared lock is safe. For aligned writes we skip this check
> -	 * entirely since allocation under shared lock is safe.
> +	 * For unaligned writes, check whether partial block zeroing will be
> +	 * needed. If so, exclusive lock is required to serialize against
> +	 * concurrent DIO that could race with the zeroing.
> +	 *
> +	 * For aligned writes we skip this check entirely since allocation
> +	 * under shared lock is safe.
>  	 */
> -	if (unaligned_io)
> -		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> +	if (ext4_unaligned_io(inode, from, offset))
> +		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
>  
>  	/* Determine whether we need to upgrade to an exclusive lock. */
>  	if (*ilock_shared &&
> -	    ((!IS_NOSEC(inode) || *extend ||
> -	     (unaligned_io && (!overwrite || unwritten))))) {
> +	    (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
> @@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	/*
>  	 * Now that locking is settled, determine dio flags and exclusivity
>  	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
> -	 * behavior already. The inode lock is already held exclusive if the
> -	 * write is unaligned non-overwrite or extending, so drain all
> -	 * outstanding dio and set the force wait dio flag.
> +	 * behavior already. When holding the exclusive lock for a write that
> +	 * needs partial block zeroing or is extending the file, we must wait
> +	 * for the I/O to complete synchronously:
> +	 *
> +	 *  - needs_zeroing: drain in-flight DIO whose end_io could race with
> +	 *    our partial block zeroing, and force synchronous completion so we
> +	 *    don't leave in-flight zeroing bios for the next writer to drain.
> +	 *
> +	 *  - extend: the caller must update i_disksize after I/O completion,
> +	 *    which requires the data to be on disk first.
>  	 */
> -	if (!*ilock_shared && (unaligned_io || *extend)) {
> +	if (!*ilock_shared && (needs_zeroing || *extend)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
>  		}
> -		if (unaligned_io && (!overwrite || unwritten))
> +		if (needs_zeroing)
>  			inode_dio_wait(inode);
>  		*dio_flags = IOMAP_DIO_FORCE_WAIT;
>  	}


^ permalink raw reply

* Re: [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
From: Zhang Yi @ 2026-06-15  3:24 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-2-libaokun@linux.alibaba.com>

On 6/12/2026 12:34 AM, Baokun Li wrote:
> Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
> determine if a write is a pure overwrite, and upgrades to exclusive
> i_rwsem if not. However, ext4_overwrite_io() uses a single
> ext4_map_blocks() call which only returns the first contiguous extent of
> the same type. A write spanning multiple pre-allocated extents (e.g.
> written + unwritten, or two physically discontiguous written extents)
> produces a false negative, forcing an unnecessary exclusive lock upgrade.
> 
> After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
> writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
> useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
> accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
> journal transaction. The iomap iteration naturally handles multi-extent
> ranges: each call returns the mapping for the current segment, and
> unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
> This means the common case of mixed written/unwritten extents never
> reaches ext4_iomap_alloc() at all.
> 
> Even for the less common case where the range contains a hole and
> ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
> aligned non-extending writes:
> 
>  - truncate/punch_hole are kept out: they require exclusive i_rwsem
>    (blocked by our shared lock during allocation), and inode_dio_begin()
>    keeps their inode_dio_wait() blocked until in-flight bios complete.
>  - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
>    extent tree modifications (parallel writers to the same hole).
>  - The journal handle is per-thread and does not require i_rwsem
>    exclusion.
>  - i_disksize and orphan list are not involved in non-extending writes.
> 
> Skip the ext4_overwrite_io() check entirely for aligned writes by
> initializing overwrite to true and only calling ext4_overwrite_io() for
> unaligned writes. Unaligned writes still need the extent state check
> because concurrent partial block zeroing in the DIO layer requires
> exclusive serialization unless the range is a pure written-extent
> overwrite.
> 
> Performance:
> 
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
> 
> Aligned 8K DIO writes spanning written+unwritten extent boundaries.
> Each thread writes its own 1G region sequentially; the file is rebuilt
> between runs so every block is written exactly once. Metric: IOPS.
> 
>   JOBS      Before        After    speedup
>   ----    --------    ---------    -------
>      1      42,322       43,329      1.02x
>      2      68,516       70,677      1.03x
>      4      62,489       97,072      1.55x
>      8      58,701      110,819      1.89x
>     16      58,569      116,392      1.99x
>     32      60,860      117,244      1.93x
> 
> Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.
> 
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Ha, this is a good idea and a nice speed-up!

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>

> ---
>  fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..6f3886465ce3 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>   * condition requires an exclusive inode lock. If yes, then we restart the
>   * whole operation by releasing the shared lock and acquiring exclusive lock.
>   *
> - * - For unaligned_io we never take shared lock as it may cause data corruption
> - *   when two unaligned IO tries to modify the same block e.g. while zeroing.
> + * The decision is layered, evaluated in this order:
>   *
> - * - For extending writes case we don't take the shared lock, since it requires
> - *   updating inode i_disksize and/or orphan handling with exclusive lock.
> + * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
> + *    to the exclusive lock -- the security update itself requires it,
> + *    regardless of whether the write extends the file or is aligned.
>   *
> - * - shared locking will only be true mostly with overwrites, including
> - *   initialized blocks and unwritten blocks.
> + * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
> + *    lock to safely update i_disksize and the orphan list, regardless of
> + *    alignment.
>   *
> - * - Otherwise we will switch to exclusive i_rwsem lock.
> + * 3. Otherwise, for aligned non-extending writes, shared lock is always
> + *    sufficient regardless of extent state (written, unwritten, or hole).
> + *    truncate/punch_hole cannot run while we hold the shared i_rwsem
> + *    (they need it exclusively); after we release it, inode_dio_begin()
> + *    keeps their inode_dio_wait() blocked until in-flight bios complete.
> + *    i_data_sem serializes concurrent extent tree modifications.
> + *
> + * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> + *    only safe for pure written-extent overwrites. Unwritten extents or
> + *    holes require exclusive lock because concurrent partial block zeroing
> + *    in the DIO layer could corrupt data.
>   */
>  static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  				     bool *ilock_shared, bool *extend,
> @@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	loff_t offset;
>  	size_t count;
>  	ssize_t ret;
> -	bool overwrite, unaligned_io, unwritten;
> +	bool overwrite = true, unaligned_io, unwritten = false;
>  
>  restart:
>  	ret = ext4_generic_write_checks(iocb, from);
> @@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  
>  	unaligned_io = ext4_unaligned_io(inode, from, offset);
>  	*extend = ext4_extending_io(inode, offset, count);
> -	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
>  
>  	/*
> -	 * Determine whether we need to upgrade to an exclusive lock. This is
> -	 * required to change security info in file_modified(), for extending
> -	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
> -	 * extents (as partial block zeroing may be required).
> -	 *
> -	 * Note that unaligned writes are allowed under shared lock so long as
> -	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
> -	 * data corruption due to partial block zeroing in the dio layer, and so
> -	 * the I/O must occur exclusively.
> +	 * For unaligned writes we need to know the extent state to determine
> +	 * whether shared lock is safe. For aligned writes we skip this check
> +	 * entirely since allocation under shared lock is safe.
>  	 */
> +	if (unaligned_io)
> +		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> +
> +	/* Determine whether we need to upgrade to an exclusive lock. */
>  	if (*ilock_shared &&
> -	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
> -	     (unaligned_io && unwritten)))) {
> +	    ((!IS_NOSEC(inode) || *extend ||
> +	     (unaligned_io && (!overwrite || unwritten))))) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
> @@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	 * Now that locking is settled, determine dio flags and exclusivity
>  	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
>  	 * behavior already. The inode lock is already held exclusive if the
> -	 * write is non-overwrite or extending, so drain all outstanding dio and
> -	 * set the force wait dio flag.
> +	 * write is unaligned non-overwrite or extending, so drain all
> +	 * outstanding dio and set the force wait dio flag.
>  	 */
>  	if (!*ilock_shared && (unaligned_io || *extend)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {


^ permalink raw reply

* [PATCH v3] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Yun Zhou @ 2026-06-14  3:54 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Validate EA inode state in ext4_xattr_inode_iget() to prevent
WARN_ONCE triggers in ext4_xattr_inode_update_ref() and reject
corrupted EA inodes before they can cause further damage.

When a corrupted ext4 image has an EA inode with inconsistent i_nlink
and ref_count values, the code currently allows it through and later
hits WARN_ONCE when ref_count transitions cross the 0/1 boundary.
While this is not a security or stability issue -- it only fires on
crafted filesystem images and merely prints a call trace -- it is
better handled as an early sanity check that returns -EFSCORRUPTED,
consistent with how ext4 treats other on-disk corruption.

Since ext4_xattr_inode_iget() resolves references from active xattr
entries, the target EA inode must be in active state (i_nlink=1,
ref_count>0).  Reject any inode that does not satisfy this.

Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3:
 - Move check after Lustre branch to avoid false positives on Lustre EA inodes
 - Merge into single condition: i_nlink != 1 || !ref_count
 - Add make_bad_inode() before iput() to avoid truncation in active txn

v2:
 - Add ref_count validation to also catch i_nlink=1, ref_count=0 case

 fs/ext4/xattr.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..77c11e4096bb 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -464,6 +464,27 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 		inode_unlock(inode);
 	}
 
+	/*
+	 * Since this function resolves references from active xattr entries,
+	 * the EA inode must be in active state (i_nlink=1, ref_count>0).
+	 * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
+	 * (inconsistent with an active entry) all indicate corruption.
+	 */
+	if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
+		ext4_error(parent->i_sb,
+			   "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
+			   ea_ino, inode->i_nlink,
+			   ext4_xattr_inode_get_ref(inode));
+		/*
+		 * Mark rejected inode to prevent ext4_evict_inode() from
+		 * attempting truncation on a corrupted inode within an active
+		 * transaction, which could exhaust journal credits.
+		 */
+		make_bad_inode(inode);
+		iput(inode);
+		return -EFSCORRUPTED;
+	}
+
 	*ea_inode = inode;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH ext4 v3] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 22:18 UTC (permalink / raw)
  To: linux-ext4, Jan Kara
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, Weiming Shi, Xiang Mei

ext4_read_inline_dir() can read a dirent header past the end of its inline
buffer, triggering a slab-out-of-bounds read during getdents64():

  BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
  Read of size 2 at addr ffff88800f3dd23c by task exploit/148
   ...
   __ext4_check_dir_entry
   ext4_read_inline_dir
   iterate_dir

The dirent payload lives in a buffer of exactly inline_size bytes:

	dir_buf = kmalloc(inline_size, GFP_NOFS);

but iteration runs in a position space extra_offset bytes larger
(extra_size = extra_offset + inline_size) so the synthetic "." and ".."
land at their block-dir offsets. A dirent is formed at "dir_buf + pos -
extra_offset", yet the loop bound (ctx->pos < extra_size) and the
ext4_check_dir_entry() length argument both use the larger extra_size. A
ctx->pos that is valid in extra_size space but whose de lies past 
inline_size is therefore accepted, and the rescan loop's rec_len probe 
and ext4_check_dir_entry() dereference de->rec_len before the entry is
rejected.

Bound the dirent header by inline_size in both loops: break out of the
rescan loop once a minimum-size header no longer fits, reject such a
position in the main loop before forming de, and pass inline_size rather
than extra_size to ext4_check_dir_entry().

Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
 fs/ext4/inline.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..1266c8827cca 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
 			/* for other entry, the real offset in
 			 * the buf has to be tuned accordingly.
 			 */
+			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
+			    inline_size)
+				break;
 			de = (struct ext4_dir_entry_2 *)
 				(dir_buf + i - extra_offset);
 			/* It's too expensive to do a full
@@ -1488,10 +1491,18 @@ int ext4_read_inline_dir(struct file *file,
 			continue;
 		}

+		/*
+		 * de lives at dir_buf + ctx->pos - extra_offset, within the
+		 * kmalloc(inline_size) buffer.  Make sure its header fits before
+		 * ext4_check_dir_entry() dereferences de->rec_len.
+		 */
+		if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
+		    inline_size)
+			goto out;
 		de = (struct ext4_dir_entry_2 *)
 			(dir_buf + ctx->pos - extra_offset);
 		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
-					 extra_size, ctx->pos))
+					 inline_size, ctx->pos))
 			goto out;
 		if (le32_to_cpu(de->inode)) {
 			if (!dir_emit(ctx, de->name, de->name_len,
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 22:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <CAPpSM+TS=mXfKXqN65ic5xFA+S3YvJfa5BuSQ5VwEA1O19VqzA@mail.gmail.com>

On Sat, Jun 13, 2026 at 3:06 PM Xiang Mei <xmei5@asu.edu> wrote:
>
> On Wed, Jun 10, 2026 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
> >
> > What does the 'net' in [PATCH net v2] mean?
> >
> > On Mon 08-06-26 18:07:39, Xiang Mei wrote:
> > > ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> > > inline buffer for a crafted or corrupted inline directory, triggering a
> > > slab-out-of-bounds read during getdents64():
> > >
> > >   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
> > >   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
> > >    ...
> > >    kasan_report (mm/kasan/report.c:595)
> > >    filldir64 (fs/readdir.c:371)
> > >    iterate_dir (fs/readdir.c:110)
> > >    ...
> > >
> > > The payload is copied into a buffer of exactly inline_size bytes:
> > >
> > >       dir_buf = kmalloc(inline_size, GFP_NOFS);
> > >
> > > but iteration runs in a logical position space extra_offset bytes larger
> > > than the buffer (extra_size = extra_offset + inline_size), so the synthetic
> > > "." and ".." entries land at the offsets they would have in a block-based
> > > directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
> > > the loop bounds and the ext4_check_dir_entry() length argument are all
> > > expressed in the larger extra_size. Two reachable sites dereference a
> > > dirent before confirming its physical offset is inside the allocation:
> > >
> > > In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
> > > is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
> > > running up to extra_offset bytes past the allocation before its length
> > > check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
> > > below extra_offset makes "ctx->pos - extra_offset" negative, so a check
> > > that only bounds the top of the buffer is bypassed by underflow and de is
> > > formed before dir_buf.
> > >
> > > In the cookie-rescan loop, entered when i_version changed since the last
> > > readdir(2), the walk restarts from the beginning with i bounded by
> > > extra_size, so as i approaches extra_size the unconditional read of
> > > de->rec_len runs past the allocation before any validation.
> > >
> > > Both are the same defect, logical extra_size space versus the physical
> > > inline_size buffer. In each loop, reject a dirent whose header would not
> > > fit within inline_size before forming de, and in the main loop also reject a
> > > position that underflows below extra_offset. Validate the main-loop entry
> > > against inline_size rather than extra_size. Entries that legitimately fill
> > > the inline data still pass.
> > >
> > > Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> > > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > > Assisted-by: Claude:claude-opus-4-8
> > > Signed-off-by: Xiang Mei <xmei5@asu.edu>
> >
> > Thanks for the analysis and the patch. See some suggestions for improvement
> > below:
> >
> > > @@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
> > >                       continue;
> > >               }
> > >
> > > +             /*
> > > +              * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
> > > +              * header must fit within inline_size.  ctx->pos is a signed,
> > > +              * lseek()-controlled loff_t: check the lower bound first, or
> > > +              * ctx->pos < extra_offset underflows and points de before dir_buf.
> > > +              */
> > > +             if (ctx->pos < extra_offset ||
> > > +                 ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> > > +                 inline_size)
> > > +                     goto out;
> >
> > So I don't think this is really possible. ctx->pos isn't really fully user
> > controlled. When you use seek to modify ctx->pos, ext4_dir_llseek() does
> > set info->cookie to invalid value so the next time we enter
> > ext4_read_inline_dir() we are guaranteed to revalidate the offset and reset
> > it to 0, dotdot_offset, or some value greater than extra_size.
>
> Please ignore v3. It doesn't actually fix the original bug, and the PoC
> still triggers KASAN with v3 applied.
>
> You're right that lseek() poisons info->cookie and the next
> ext4_read_inline_dir() always takes the !inode_eq_iversion() rescan path.
> But the rescan doesn't reset ctx->pos to a safe value. It walks forward:
>
> i += ext4_rec_len_from_disk(de->rec_len, extra_size);
>
> i.e. it advances i by the rec_len field of each inline dirent, which is
> attacker-controlled, and the only per-entry check is the "non-zero rec_len"
> probe (which itself reads de->rec_len). It then commits "offset = i;
> ctx->pos = offset;". So the rescan can legitimately land ctx->pos at a
> position whose dirent header already lies past inline_size - it is not
> clamped to a valid in-bounds entry.
>
> The main loop then forms de = dir_buf + ctx->pos - extra_offset and calls
> ext4_check_dir_entry(), which dereferences de->rec_len before the length
> check fires. With v3 (which only fixed the rescan probe and the size
> argument, but had no bounds check on the committed ctx->pos), this still
> reads past the buffer:
>
> BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
> Read of size 2 at addr ... (0 bytes to the right of a 60-byte
> kmalloc-64 region)
> __ext4_check_dir_entry
> ext4_read_inline_dir
> iterate_dir
>
> You were right about one thing: the lower-bound (ctx->pos < extra_offset,
> "de before dir_buf") case I once guarded against can't happen, since
> ctx->pos never comes in below the dotdot offsets. v4 drops that and keeps
> only the upper-bound check in the main loop, which I've confirmed stops the
> reproducer. Sorry for sending a patch I hadn't fully verified.
>
> v4 to follow.
My bad again. I didn't send out the broken v3. So the correct version
would be the corrected v3.
v3 to follow.

Xiang
>
> Xiang
>
> >
> > >               de = (struct ext4_dir_entry_2 *)
> > >                       (dir_buf + ctx->pos - extra_offset);
> > >               if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> > > -                                      extra_size, ctx->pos))
> > > +                                      inline_size, ctx->pos))
> > >                       goto out;
> > >               if (le32_to_cpu(de->inode)) {
> > >                       if (!dir_emit(ctx, de->name, de->name_len,
> >
> > Otherwise the patch looks good.
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 22:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <p2kykhe5532mxqgnmgyrg24cmlpcz24qn2erw3lkzabjjubbyq@i2c3mfdm7mpc>

On Wed, Jun 10, 2026 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
>
> What does the 'net' in [PATCH net v2] mean?
>
> On Mon 08-06-26 18:07:39, Xiang Mei wrote:
> > ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> > inline buffer for a crafted or corrupted inline directory, triggering a
> > slab-out-of-bounds read during getdents64():
> >
> >   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
> >   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
> >    ...
> >    kasan_report (mm/kasan/report.c:595)
> >    filldir64 (fs/readdir.c:371)
> >    iterate_dir (fs/readdir.c:110)
> >    ...
> >
> > The payload is copied into a buffer of exactly inline_size bytes:
> >
> >       dir_buf = kmalloc(inline_size, GFP_NOFS);
> >
> > but iteration runs in a logical position space extra_offset bytes larger
> > than the buffer (extra_size = extra_offset + inline_size), so the synthetic
> > "." and ".." entries land at the offsets they would have in a block-based
> > directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
> > the loop bounds and the ext4_check_dir_entry() length argument are all
> > expressed in the larger extra_size. Two reachable sites dereference a
> > dirent before confirming its physical offset is inside the allocation:
> >
> > In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
> > is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
> > running up to extra_offset bytes past the allocation before its length
> > check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
> > below extra_offset makes "ctx->pos - extra_offset" negative, so a check
> > that only bounds the top of the buffer is bypassed by underflow and de is
> > formed before dir_buf.
> >
> > In the cookie-rescan loop, entered when i_version changed since the last
> > readdir(2), the walk restarts from the beginning with i bounded by
> > extra_size, so as i approaches extra_size the unconditional read of
> > de->rec_len runs past the allocation before any validation.
> >
> > Both are the same defect, logical extra_size space versus the physical
> > inline_size buffer. In each loop, reject a dirent whose header would not
> > fit within inline_size before forming de, and in the main loop also reject a
> > position that underflows below extra_offset. Validate the main-loop entry
> > against inline_size rather than extra_size. Entries that legitimately fill
> > the inline data still pass.
> >
> > Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > Assisted-by: Claude:claude-opus-4-8
> > Signed-off-by: Xiang Mei <xmei5@asu.edu>
>
> Thanks for the analysis and the patch. See some suggestions for improvement
> below:
>
> > @@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
> >                       continue;
> >               }
> >
> > +             /*
> > +              * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
> > +              * header must fit within inline_size.  ctx->pos is a signed,
> > +              * lseek()-controlled loff_t: check the lower bound first, or
> > +              * ctx->pos < extra_offset underflows and points de before dir_buf.
> > +              */
> > +             if (ctx->pos < extra_offset ||
> > +                 ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> > +                 inline_size)
> > +                     goto out;
>
> So I don't think this is really possible. ctx->pos isn't really fully user
> controlled. When you use seek to modify ctx->pos, ext4_dir_llseek() does
> set info->cookie to invalid value so the next time we enter
> ext4_read_inline_dir() we are guaranteed to revalidate the offset and reset
> it to 0, dotdot_offset, or some value greater than extra_size.

Please ignore v3. It doesn't actually fix the original bug, and the PoC
still triggers KASAN with v3 applied.

You're right that lseek() poisons info->cookie and the next
ext4_read_inline_dir() always takes the !inode_eq_iversion() rescan path.
But the rescan doesn't reset ctx->pos to a safe value. It walks forward:

i += ext4_rec_len_from_disk(de->rec_len, extra_size);

i.e. it advances i by the rec_len field of each inline dirent, which is
attacker-controlled, and the only per-entry check is the "non-zero rec_len"
probe (which itself reads de->rec_len). It then commits "offset = i;
ctx->pos = offset;". So the rescan can legitimately land ctx->pos at a
position whose dirent header already lies past inline_size - it is not
clamped to a valid in-bounds entry.

The main loop then forms de = dir_buf + ctx->pos - extra_offset and calls
ext4_check_dir_entry(), which dereferences de->rec_len before the length
check fires. With v3 (which only fixed the rescan probe and the size
argument, but had no bounds check on the committed ctx->pos), this still
reads past the buffer:

BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
Read of size 2 at addr ... (0 bytes to the right of a 60-byte
kmalloc-64 region)
__ext4_check_dir_entry
ext4_read_inline_dir
iterate_dir

You were right about one thing: the lower-bound (ctx->pos < extra_offset,
"de before dir_buf") case I once guarded against can't happen, since
ctx->pos never comes in below the dotdot offsets. v4 drops that and keeps
only the upper-bound check in the main loop, which I've confirmed stops the
reproducer. Sorry for sending a patch I hadn't fully verified.

v4 to follow.

Xiang

>
> >               de = (struct ext4_dir_entry_2 *)
> >                       (dir_buf + ctx->pos - extra_offset);
> >               if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> > -                                      extra_size, ctx->pos))
> > +                                      inline_size, ctx->pos))
> >                       goto out;
> >               if (le32_to_cpu(de->inode)) {
> >                       if (!dir_emit(ctx, de->name, de->name_len,
>
> Otherwise the patch looks good.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-13 21:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <p2kykhe5532mxqgnmgyrg24cmlpcz24qn2erw3lkzabjjubbyq@i2c3mfdm7mpc>

On Wed, Jun 10, 2026 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
>
> What does the 'net' in [PATCH net v2] mean?
>
Sorry, it was a mistake. It should be ext4.
> On Mon 08-06-26 18:07:39, Xiang Mei wrote:
> > ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> > inline buffer for a crafted or corrupted inline directory, triggering a
> > slab-out-of-bounds read during getdents64():
> >
> >   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
> >   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
> >    ...
> >    kasan_report (mm/kasan/report.c:595)
> >    filldir64 (fs/readdir.c:371)
> >    iterate_dir (fs/readdir.c:110)
> >    ...
> >
> > The payload is copied into a buffer of exactly inline_size bytes:
> >
> >       dir_buf = kmalloc(inline_size, GFP_NOFS);
> >
> > but iteration runs in a logical position space extra_offset bytes larger
> > than the buffer (extra_size = extra_offset + inline_size), so the synthetic
> > "." and ".." entries land at the offsets they would have in a block-based
> > directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
> > the loop bounds and the ext4_check_dir_entry() length argument are all
> > expressed in the larger extra_size. Two reachable sites dereference a
> > dirent before confirming its physical offset is inside the allocation:
> >
> > In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
> > is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
> > running up to extra_offset bytes past the allocation before its length
> > check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
> > below extra_offset makes "ctx->pos - extra_offset" negative, so a check
> > that only bounds the top of the buffer is bypassed by underflow and de is
> > formed before dir_buf.
> >
> > In the cookie-rescan loop, entered when i_version changed since the last
> > readdir(2), the walk restarts from the beginning with i bounded by
> > extra_size, so as i approaches extra_size the unconditional read of
> > de->rec_len runs past the allocation before any validation.
> >
> > Both are the same defect, logical extra_size space versus the physical
> > inline_size buffer. In each loop, reject a dirent whose header would not
> > fit within inline_size before forming de, and in the main loop also reject a
> > position that underflows below extra_offset. Validate the main-loop entry
> > against inline_size rather than extra_size. Entries that legitimately fill
> > the inline data still pass.
> >
> > Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > Assisted-by: Claude:claude-opus-4-8
> > Signed-off-by: Xiang Mei <xmei5@asu.edu>
>
> Thanks for the analysis and the patch. See some suggestions for improvement
> below:
>
> > @@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
> >                       continue;
> >               }
> >
> > +             /*
> > +              * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
> > +              * header must fit within inline_size.  ctx->pos is a signed,
> > +              * lseek()-controlled loff_t: check the lower bound first, or
> > +              * ctx->pos < extra_offset underflows and points de before dir_buf.
> > +              */
> > +             if (ctx->pos < extra_offset ||
> > +                 ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> > +                 inline_size)
> > +                     goto out;
>
> So I don't think this is really possible. ctx->pos isn't really fully user
> controlled. When you use seek to modify ctx->pos, ext4_dir_llseek() does
> set info->cookie to invalid value so the next time we enter
> ext4_read_inline_dir() we are guaranteed to revalidate the offset and reset
> it to 0, dotdot_offset, or some value greater than extra_size.

You're right, thanks. The rescan path rebuilds ctx->pos, so the
underflow can't happen. I'll drop this in v3.

Xiang
>
> >               de = (struct ext4_dir_entry_2 *)
> >                       (dir_buf + ctx->pos - extra_offset);
> >               if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> > -                                      extra_size, ctx->pos))
> > +                                      inline_size, ctx->pos))
> >                       goto out;
> >               if (le32_to_cpu(de->inode)) {
> >                       if (!dir_emit(ctx, de->name, de->name_len,
>
> Otherwise the patch looks good.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] e2fsprogs/resize2fs: allow online resizing with CONFIG_BLK_DEV_WRITE_MOUNTED disabled
From: Theodore Tso @ 2026-06-13 15:41 UTC (permalink / raw)
  To: Hanno Böck; +Cc: linux-ext4, theodore.tso
In-Reply-To: <20260613110507.67289f68@hboeck.de>

On Sat, Jun 13, 2026 at 11:05:07AM +0200, Hanno Böck wrote:
> Kernels since version 6.8 have an option CONFIG_BLK_DEV_WRITE_MOUNTED
> that, if disabled, prevents accidental writes to and data corruption of
> mounted devices.
> 
> This breaks online resizing with resize2fs, as it opens the device with
> O_RDWR. However, it appears that this is not necessary and by changing
> it to O_RDONLY, online resizing works again.
> 
> See also: https://unix.stackexchange.com/a/796881
> 
> Signed-off-by: Hanno Böck <hanno@hboeck.de>

If you try run e2fssprogs's regression test (using "make check"), with
your patch applied, you'll find:

% make -j24 ; make -j24 check

...
t_mmp_2off: disable MMP using tune2fs: ok
377 tests succeeded	15 tests failed
Tests failed: m_minrootdir r_1024_small_bg r_32to64bit_expand_full r_bigalloc_big_expand r_ext4_small_bg r_fixup_lastbg_big r_fixup_lastbg r_inline_xattr r_min_itable r_move_inode_int_extent r_move_itable r_move_itable_nostride r_move_itable_realloc r_orphan_file r_resize_inode
make[1]: *** [Makefile:403: test_post] Error 1
make[1]: Leaving directory '/build/e2fsprogs/tests'
make: *** [Makefile:431: check-recursive] Error 1

Specifically, your change breaks off-line resizes, where the file
system is not mounted.  E2fsprogs builds on FreeBSD, GNU/Hurd, MacOS,
and other operating systems which don't support online resizing, which
is a Linux-only feature.

Furthermore, even on Linux, online resizing only supports growing the
file system.  If you want to shrink the file system, this must be done
using off-line resizing, with the file system unmounted.

And as I pointed out on another e-mail thread, fixing resize2fs is not
sufficient to disable CONFIG_BLK_DEV_WRITE_MOUNTED and retain full
functionality.  Specifically, certain file system parameters can be
modified using tune2fs while the file system is mounted.  Disabling
CONFIG_BLK_DEV_WRITE_MOUNTED will break this.

The CONFIG_BLK_DEV_WRITE_MOUNTED option was added specifically to
suppress maintainers from downing in Syzbot-generated noise, which
will gleefully report "security failures" caused by being able to
modify the block device while the file system is mounted.  The syzbot
scanner runs as root, which means it's great for inflating security
reports, at the cost of overloading maintainers who often deal with
syzbot reports late at night or on weekends, or have decided to just
ignore them since the signal is drowned out by the noise.  It was
**not** intended for users to disable that feature on production
systems.

At some point, in the future, the following thing needs to happen to
before blocking writes to the block device can be enabled for real:

*) E2fprogs's tune2fs program needs to be taught to use
 EXT4_IOC_GET_TUNE_SB_PARAM and EXT4_IOC_SET_TUNE_SB_PARAM ioctls so
 that the changes that are allowed to be made while the file system is
 mounted are done via this ioctls.

*) The kernel is enhanced so that write access to block device with a
 mounted file system can be blocked on a per "struct super", so the
 file system can block access either on per file system basis, or on a
 per mount basis based on a mount option.

*) We can only enable blocking block device writes to mounted file
 systems when we are use that users have both a kernel and e2fsprogs
 with support of EXT4_IOC_[SG]ET_TUNE_SB_PARAM.

Also note that at the moment, disabling block device writes of file
systems is at this point security theatre which is only rivaled by the
metal detectors in the airport operating by the United States
Transport Security Agency.  That's because
!CONFIG_BLK_DEV_WRITE_MOUNTED can be trivially bypassed by using the
loop device.

Cheers,

					- Ted

> --- a/resize/main.c	2026-03-06 18:17:36.000000000 +0100
> +++ b/resize/main.c	2026-06-13 10:53:46.165030199 +0200
> @@ -256,7 +256,7 @@ int main (int argc, char ** argv)
>  	int		force_min_size = 0;
>  	int		print_min_size = 0;
>  	int		fd, ret;
> -	int		open_flags = O_RDWR;
> +	int		open_flags = O_RDONLY;
>  	blk64_t		new_size = 0;
>  	blk64_t		max_size = 0;
>  	blk64_t		min_size = 0;
> @@ -364,9 +364,6 @@ int main (int argc, char ** argv)
>  		len = 2 * len;
>  	}
>  
> -	if (print_min_size)
> -		open_flags = O_RDONLY;
> -
>  	fd = ext2fs_open_file(device_name, open_flags, 0);
>  	if (fd < 0) {
>  		com_err("open", errno, _("while opening %s"),
> 

^ permalink raw reply

* [PATCH v2] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Yun Zhou @ 2026-06-13 14:14 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Validate EA inode i_nlink early in ext4_xattr_inode_iget() to convert
a WARN_ONCE in ext4_xattr_inode_update_ref() into a graceful error
return.

When a corrupted ext4 image has an EA inode with inconsistent i_nlink
and ref_count values, the code currently allows it through and later
hits WARN_ONCE when ref_count transitions cross the 0/1 boundary.
While this is not a security or stability issue -- it only fires on
crafted filesystem images and merely prints a call trace -- it is
better handled as an early sanity check that returns -EFSCORRUPTED,
consistent with how ext4 treats other on-disk corruption.

The valid states for an EA inode are:
  - i_nlink=0, ref_count=0: orphaned, pending deletion
  - i_nlink=1, ref_count>0: active, referenced

Reject any EA inode that does not match one of these states at iget
time.

Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/xattr.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..4deb17b7bcbe 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -448,6 +448,23 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 	}
 	ext4_xattr_inode_set_class(inode);
 
+	/*
+	 * EA inode has two valid states:
+	 *   i_nlink=0, ref_count=0: orphaned, pending deletion
+	 *   i_nlink=1, ref_count>0: active, referenced by one or more inodes
+	 * Anything else indicates on-disk corruption.
+	 */
+	if (inode->i_nlink > 1 ||
+	    (inode->i_nlink && !ext4_xattr_inode_get_ref(inode)) ||
+	    (!inode->i_nlink && ext4_xattr_inode_get_ref(inode))) {
+		ext4_error(parent->i_sb,
+			   "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
+			   ea_ino, inode->i_nlink,
+			   ext4_xattr_inode_get_ref(inode));
+		iput(inode);
+		return -EFSCORRUPTED;
+	}
+
 	/*
 	 * Check whether this is an old Lustre-style xattr inode. Lustre
 	 * implementation does not have hash validation, rather it has a
-- 
2.43.0


^ permalink raw reply related

* [PATCH] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Yun Zhou @ 2026-06-13 13:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Validate EA inode i_nlink early in ext4_xattr_inode_iget() to convert
a WARN_ONCE in ext4_xattr_inode_update_ref() into a graceful error
return.

When a corrupted ext4 image has an EA inode with i_nlink set to an
invalid value (e.g. 65535), the code currently allows it through and
later hits a WARN_ONCE when ref_count reaches 0.  While this is not a
security or stability issue -- it only fires on crafted filesystem
images and merely prints a call trace -- it is better handled as an
early sanity check that returns -EFSCORRUPTED, consistent with how ext4
treats other on-disk corruption.

An EA inode should only ever have i_nlink of 0 (orphaned, pending
deletion) or 1 (active).  Reject anything above 1 at iget time.

Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/xattr.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..2fc41a06a446 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -448,6 +448,14 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 	}
 	ext4_xattr_inode_set_class(inode);

+	if (inode->i_nlink > 1) {
+		ext4_error(parent->i_sb,
+			   "EA inode %lu has unexpected i_nlink=%u",
+			   ea_ino, inode->i_nlink);
+		iput(inode);
+		return -EFSCORRUPTED;
+	}
+
 	/*
 	 * Check whether this is an old Lustre-style xattr inode. Lustre
 	 * implementation does not have hash validation, rather it has a
-- 
2.43.0

^ permalink raw reply related

* [PATCH] e2fsprogs/resize2fs: allow online resizing with CONFIG_BLK_DEV_WRITE_MOUNTED disabled
From: Hanno Böck @ 2026-06-13  9:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: theodore.tso

Kernels since version 6.8 have an option CONFIG_BLK_DEV_WRITE_MOUNTED
that, if disabled, prevents accidental writes to and data corruption of
mounted devices.

This breaks online resizing with resize2fs, as it opens the device with
O_RDWR. However, it appears that this is not necessary and by changing
it to O_RDONLY, online resizing works again.

See also: https://unix.stackexchange.com/a/796881

Signed-off-by: Hanno Böck <hanno@hboeck.de>
--- a/resize/main.c	2026-03-06 18:17:36.000000000 +0100
+++ b/resize/main.c	2026-06-13 10:53:46.165030199 +0200
@@ -256,7 +256,7 @@ int main (int argc, char ** argv)
 	int		force_min_size = 0;
 	int		print_min_size = 0;
 	int		fd, ret;
-	int		open_flags = O_RDWR;
+	int		open_flags = O_RDONLY;
 	blk64_t		new_size = 0;
 	blk64_t		max_size = 0;
 	blk64_t		min_size = 0;
@@ -364,9 +364,6 @@ int main (int argc, char ** argv)
 		len = 2 * len;
 	}
 
-	if (print_min_size)
-		open_flags = O_RDONLY;
-
 	fd = ext2fs_open_file(device_name, open_flags, 0);
 	if (fd < 0) {
 		com_err("open", errno, _("while opening %s"),

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-12 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260612052831.GA9010@lst.de>

On Fri, Jun 12, 2026 at 07:28:31AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> > On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > > It's entirely possible a device supports byte aligned addresses. The
> > > > block layer just doesn't let a driver report that. So either it really
> > > > was successful because you found a bug that skips the alignment checks,
> > > > or your device silently corrupted your payload.
> > 
> > I tried this on different hardware, I find it hard to say all those
> > devices were corrupting the payload.
> 
> I think in the other thread we agreed that we are currently missing
> the alignment check for fast-path bios not hitting the splitting code,
> so maybe that is something you see.  Additionally we're missing the
> checks for purely bio based drivers not calling the splitting helper
> at all, but I don't think that applies here.
> 
> > > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > > though, in not taking the optimization when it was possible. So here's
> > > > an alternative suggestion that should get things working as expected:
> > > 
> > > The fix below looks like it is addressing a real bug.  I'm not sure if
> > > Carlos is hitting it, but we were missing the alignment checks for
> > > single-bvec fast path bios so far indeed.
> > 
> > You left context out so I'm assuming by the fix you meant Keith's patch.
> 
> Yes.

The fix indeed seems to fix the behavior I'm seeing. Keith could you Cc
me if you end up sending an official version?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox