Linux EXT4 FS development
 help / color / mirror / Atom feed
* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-12 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260612052831.GA9010@lst.de>

On Fri, Jun 12, 2026 at 07:28:31AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> > On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > > It's entirely possible a device supports byte aligned addresses. The
> > > > block layer just doesn't let a driver report that. So either it really
> > > > was successful because you found a bug that skips the alignment checks,
> > > > or your device silently corrupted your payload.
> > 
> > I tried this on different hardware, I find it hard to say all those
> > devices were corrupting the payload.
> 
> I think in the other thread we agreed that we are currently missing
> the alignment check for fast-path bios not hitting the splitting code,
> so maybe that is something you see.  Additionally we're missing the
> checks for purely bio based drivers not calling the splitting helper
> at all, but I don't think that applies here.
> 
> > > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > > though, in not taking the optimization when it was possible. So here's
> > > > an alternative suggestion that should get things working as expected:
> > > 
> > > The fix below looks like it is addressing a real bug.  I'm not sure if
> > > Carlos is hitting it, but we were missing the alignment checks for
> > > single-bvec fast path bios so far indeed.
> > 
> > You left context out so I'm assuming by the fix you meant Keith's patch.
> 
> Yes.

The fix indeed seems to fix the behavior I'm seeing. Keith could you Cc
me if you end up sending an official version?

^ permalink raw reply

* [PATCH v3] ext4: defer iput() in ext4_xattr_block_set() to avoid deadlock with writepages
From: Yun Zhou @ 2026-06-12 13:19 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260612095846.1024470-1-yun.zhou@windriver.com>

ext4_xattr_block_set() calls iput() on ea_inode while its callers hold
xattr_sem.  If this iput() drops the last reference, it can trigger
write_inode_now() -> ext4_writepages() -> s_writepages_rwsem, which
violates the lock ordering since ext4_writepages() already establishes
s_writepages_rwsem -> jbd2_handle ordering:

  CPU0 (writeback worker)            CPU1 (file create)
  ----                               ----
  ext4_writepages()
    s_writepages_rwsem (read)        ext4_create()
    ext4_do_writepages()               __ext4_new_inode()
      ext4_journal_start()               [holds jbd2 handle]
        wait_transaction_locked()        ext4_xattr_set_handle()
        [WAIT for jbd2_handle]             xattr_sem (write)

  CPU2 (xattr set or isize expand)
  ----
  ext4_xattr_set_handle() or ext4_try_to_expand_extra_isize()
    xattr_sem (write)
    ext4_xattr_block_set()
      iput(ea_inode)
        write_inode_now()
          ext4_writepages()
            s_writepages_rwsem (read) [DEADLOCK]

This forms a circular dependency on lock classes:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

Fix by deferring iput() calls inside ext4_xattr_block_set() via the
existing ext4_xattr_inode_array mechanism.  The array is threaded
through the call chain and freed by callers after releasing xattr_sem.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Address AI review feedback on v2:
  - Check ext4_expand_inode_array() return value; fallback to
    direct iput() on ENOMEM to prevent inode leak.
  - Make ext4_xattr_set_handle() take an optional ea_inode_array
    output parameter so callers can free after ext4_journal_stop(),
	avoiding the jbd2_handle vs s_writepages_rwsem AB-BA.
  - Pass ea_inode_array directly to ext4_xattr_release_block()
    instead of using a local array freed under xattr_sem.
  - Move ext4_xattr_inode_array_free() after ext4_journal_stop()

v2: Defer iput() in ext4_xattr_block_set() via ea_inode_array,
	freed after xattr_sem is released. Fixes the root cause.

v1: Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
	on inodes being deleted. Only fixes the syzbot reproducer, not
	the underlying lock ordering violation.

 fs/ext4/acl.c            |  2 +-
 fs/ext4/crypto.c         |  4 ++--
 fs/ext4/inode.c          | 13 ++++++----
 fs/ext4/xattr.c          | 51 ++++++++++++++++++++++++++--------------
 fs/ext4/xattr.h          |  6 +++--
 fs/ext4/xattr_security.c |  3 ++-
 6 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 3bffe862f954..21de8276b558 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -215,7 +215,7 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
 	}
 
 	error = ext4_xattr_set_handle(handle, inode, name_index, "",
-				      value, size, xattr_flags);
+				      value, size, xattr_flags, NULL);
 
 	kfree(value);
 	if (!error)
diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index f41f320f4437..bca760751c1d 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -173,7 +173,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 		res = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_ENCRYPTION,
 					    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-					    ctx, len, XATTR_CREATE);
+					    ctx, len, XATTR_CREATE, NULL);
 		if (!res) {
 			ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 			ext4_clear_inode_state(inode,
@@ -202,7 +202,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 
 	res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
 				    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-				    ctx, len, 0);
+				    ctx, len, 0, NULL);
 	if (!res) {
 		ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 		/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..2cf68d27e896 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6408,7 +6408,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 static int __ext4_expand_extra_isize(struct inode *inode,
 				     unsigned int new_extra_isize,
 				     struct ext4_iloc *iloc,
-				     handle_t *handle, int *no_expand)
+				     handle_t *handle, int *no_expand,
+				     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_inode *raw_inode;
 	struct ext4_xattr_ibody_header *header;
@@ -6453,7 +6454,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
 
 	/* try to expand with EAs present */
 	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
-					   raw_inode, handle);
+					   raw_inode, handle, ea_inode_array);
 	if (error) {
 		/*
 		 * Inode size expansion failed; don't try again
@@ -6475,6 +6476,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 {
 	int no_expand;
 	int error;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
@@ -6496,8 +6498,9 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 		return -EBUSY;
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
-					  handle, &no_expand);
+					  handle, &no_expand, &ea_inode_array);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 
 	return error;
 }
@@ -6509,6 +6512,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	handle_t *handle;
 	int no_expand;
 	int error, rc;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
 		brelse(iloc->bh);
@@ -6534,7 +6538,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	}
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
-					  handle, &no_expand);
+					  handle, &no_expand, &ea_inode_array);
 
 	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
 	if (!error)
@@ -6543,6 +6547,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 out_unlock:
 	ext4_write_unlock_xattr(inode, &no_expand);
 	ext4_journal_stop(handle);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return error;
 }
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e91af66db7a7..fa9a16c86fd8 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1906,7 +1906,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
 static int
 ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		     struct ext4_xattr_info *i,
-		     struct ext4_xattr_block_find *bs)
+		     struct ext4_xattr_block_find *bs,
+		     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *new_bh = NULL;
@@ -2158,7 +2159,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				if (ext4_expand_inode_array(ea_inode_array, ea_inode))
+					iput(ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2190,12 +2192,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
-
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
+					 ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
 	}
 	error = 0;
 
@@ -2211,7 +2210,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		if (ext4_expand_inode_array(ea_inode_array, ea_inode))
+			iput(ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2356,7 +2356,7 @@ static struct buffer_head *ext4_xattr_get_block(struct inode *inode)
 int
 ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		      const char *name, const void *value, size_t value_len,
-		      int flags)
+		      int flags, struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_info i = {
 		.name_index = name_index,
@@ -2371,6 +2371,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	struct ext4_xattr_block_find bs = {
 		.s = { .not_found = -ENODATA, },
 	};
+	struct ext4_xattr_inode_array *local_array = NULL;
 	int no_expand;
 	int error;
 
@@ -2379,6 +2380,9 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	if (strlen(name) > 255)
 		return -ERANGE;
 
+	if (!ea_inode_array)
+		ea_inode_array = &local_array;
+
 	ext4_write_lock_xattr(inode, &no_expand);
 
 	/* Check journal credits under write lock. */
@@ -2438,7 +2442,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		if (!is.s.not_found)
 			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		else if (!bs.s.not_found)
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 	} else {
 		error = 0;
 		/* Xattr value did not change? Save us some work and bail out */
@@ -2455,7 +2460,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		if (!error && !bs.s.not_found) {
 			i.value = NULL;
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 		} else if (error == -ENOSPC) {
 			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
 				brelse(bs.bh);
@@ -2464,7 +2470,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 				if (error)
 					goto cleanup;
 			}
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 			if (!error && !is.s.not_found) {
 				i.value = NULL;
 				error = ext4_xattr_ibody_set(handle, inode, &i,
@@ -2503,6 +2510,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	brelse(is.iloc.bh);
 	brelse(bs.bh);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(local_array);
 	return error;
 }
 
@@ -2547,6 +2555,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 {
 	handle_t *handle;
 	struct super_block *sb = inode->i_sb;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	int error, retries = 0;
 	int credits;
 
@@ -2567,10 +2576,13 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 		int error2;
 
 		error = ext4_xattr_set_handle(handle, inode, name_index, name,
-					      value, value_len, flags);
+					      value, value_len, flags,
+					      &ea_inode_array);
 		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR,
 					handle);
 		error2 = ext4_journal_stop(handle);
+		ext4_xattr_inode_array_free(ea_inode_array);
+		ea_inode_array = NULL;
 		if (error == -ENOSPC &&
 		    ext4_should_retry_alloc(sb, &retries))
 			goto retry;
@@ -2612,7 +2624,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  */
 static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 				    struct ext4_inode *raw_inode,
-				    struct ext4_xattr_entry *entry)
+				    struct ext4_xattr_entry *entry,
+				    struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_find *is = NULL;
 	struct ext4_xattr_block_find *bs = NULL;
@@ -2676,7 +2689,7 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Move ea entry from the inode into the block */
-	error = ext4_xattr_block_set(handle, inode, &i, bs);
+	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
 	if (error)
 		goto out;
 
@@ -2702,7 +2715,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 				       struct ext4_inode *raw_inode,
 				       int isize_diff, size_t ifree,
-				       size_t bfree, int *total_ino)
+				       size_t bfree, int *total_ino,
+				       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
 	struct ext4_xattr_entry *small_entry;
@@ -2752,7 +2766,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 			total_size += EXT4_XATTR_SIZE(
 					      le32_to_cpu(entry->e_value_size));
 		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
-						 entry);
+						 entry, ea_inode_array);
 		if (error)
 			return error;
 
@@ -2769,7 +2783,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
  * Returns 0 on success or negative error number on failure.
  */
 int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			       struct ext4_inode *raw_inode, handle_t *handle)
+			       struct ext4_inode *raw_inode, handle_t *handle,
+			       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2841,7 +2856,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 
 	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
 					    isize_diff, ifree, bfree,
-					    &total_ino);
+					    &total_ino, ea_inode_array);
 	if (error) {
 		if (error == -ENOSPC && !tried_min_extra_isize &&
 		    s_min_extra_isize) {
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..9c3f1a96895d 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -179,7 +179,8 @@ extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);
 
 extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
 extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
-extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
+extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *,
+		const void *, size_t, int, struct ext4_xattr_inode_array **);
 extern int ext4_xattr_set_credits(struct inode *inode, size_t value_len,
 				  bool is_create, int *credits);
 extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
@@ -192,7 +193,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			    struct ext4_inode *raw_inode, handle_t *handle);
+			    struct ext4_inode *raw_inode, handle_t *handle,
+			    struct ext4_xattr_inode_array **ea_inode_array);
 extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler * const ext4_xattr_handlers[];
diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
index 776cf11d24ca..6b7ab6e703ad 100644
--- a/fs/ext4/xattr_security.c
+++ b/fs/ext4/xattr_security.c
@@ -44,7 +44,8 @@ ext4_initxattrs(struct inode *inode, const struct xattr *xattr_array,
 		err = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_SECURITY,
 					    xattr->name, xattr->value,
-					    xattr->value_len, XATTR_CREATE);
+					    xattr->value_len, XATTR_CREATE,
+					    NULL);
 		if (err < 0)
 			break;
 	}
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Jan Kara @ 2026-06-12 12:55 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-3-libaokun@linux.alibaba.com>

On Fri 12-06-26 00:34:41, Baokun Li wrote:
> For unaligned DIO writes, the previous ext4_overwrite_io() required the
> entire range to fall within a single written extent.  This was overly
> conservative: the DIO layer only performs partial block zeroing for the
> head and tail blocks when they are partially covered by the write.
> Middle blocks that are fully covered are written as whole blocks
> without any zeroing, so they are safe regardless of extent state.
> 
> Therefore exclusive lock is only required when partial block zeroing
> will actually happen:
>  - The head partial block (if any) lands on a hole or unwritten extent.
>  - The tail partial block (if any) lands on a hole or unwritten extent.
> 
> Middle full-cover blocks can be in any state (hole, unwritten, or
> written) - block allocation under shared lock is safe per the previous
> patch's analysis (inode_dio_begin + i_data_sem protection).
> 
> Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
> directly answers the question driving the lock decision.  It uses at
> most two ext4_map_blocks() calls: one for the head partial block (also
> catching the case where it spans through the tail), and one for the
> tail partial block if not already covered.
> 
> This enables shared lock for previously-rejected scenarios such as:
>  - Unaligned write spanning written extent + mid-range hole + written
>    extent at the tail.
>  - Unaligned write where the partial blocks land on written extents but
>    the middle has unwritten extents.
> 
> Performance:
> 
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
> 
> Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
> Each stripe is laid out as [written][unwritten][unwritten][written],
> so the head and tail partial blocks land on written extents but the
> middle is unwritten.  Metric: IOPS.
> 
>   JOBS      Before        After    speedup
>   ----    --------    ---------    -------
>      1      15,547       17,381      1.12x
>      2      15,910       34,172      2.15x
>      4      15,014       57,567      3.83x
>      8      15,022       81,947      5.46x
>     16      14,586       99,126      6.80x
>     32      14,047       92,519      6.59x
> 
> Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.
> 
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
>  1 file changed, 73 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6f3886465ce3..aa926e641739 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
>  	return false;
>  }
>  
> -/* Is IO overwriting allocated or initialized blocks? */
> -static bool ext4_overwrite_io(struct inode *inode,
> -			      loff_t pos, loff_t len, bool *unwritten)
> +/*
> + * Does an unaligned DIO write require partial block zeroing?
> + *
> + * Partial block zeroing is performed only for the head and tail blocks
> + * when they are partially covered by the write and the underlying extent
> + * is a hole or unwritten. Middle blocks (fully covered by the write)
> + * are written as whole blocks without zeroing.
> + *
> + * When zeroing is required, two concurrent unaligned DIO writes to the
> + * same partial block can race and corrupt each other's data, so the
> + * caller must take the exclusive i_rwsem and drain in-flight DIO. When
> + * zeroing is not required, shared lock is safe -- block allocation and
> + * unwritten conversion for middle blocks are protected by i_data_sem
> + * and inode_dio_begin().
> + */
> +static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
>  {
>  	struct ext4_map_blocks map;
>  	unsigned int blkbits = inode->i_blkbits;
> -	int err, blklen;
> +	unsigned long blockmask = inode->i_sb->s_blocksize - 1;
> +	bool head_partial, tail_partial;
> +	ext4_lblk_t head_lblk, tail_lblk;
> +	int err;
>  
>  	if (pos + len > i_size_read(inode))
> -		return false;
> +		return true;
>  
> -	map.m_lblk = pos >> blkbits;
> -	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
> -	blklen = map.m_len;
> +	head_partial = (pos & blockmask) != 0;
> +	tail_partial = ((pos + len) & blockmask) != 0;
> +	head_lblk = pos >> blkbits;
> +	tail_lblk = (pos + len - 1) >> blkbits;
> +
> +	/* Check the head partial block. */
> +	if (head_partial) {
> +		map.m_lblk = head_lblk;
> +		map.m_len = tail_lblk - head_lblk + 1;
> +		err = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> +			return true;
> +		/* If this mapping already covers the tail block, we're done. */
> +		if (!tail_partial || map.m_lblk + err > tail_lblk)
> +			return false;
> +	}
>  
> -	err = ext4_map_blocks(NULL, inode, &map, 0);
> -	if (err != blklen)
> -		return false;
> -	/*
> -	 * 'err==len' means that all of the blocks have been preallocated,
> -	 * regardless of whether they have been initialized or not. We need to
> -	 * check m_flags to distinguish the unwritten extents.
> -	 */
> -	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
> -	return true;
> +	/* Check the tail partial block. */
> +	if (tail_partial) {
> +		map.m_lblk = tail_lblk;
> +		map.m_len = 1;
> +		err = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> +			return true;
> +	}
> +
> +	return false;
>  }
>  
>  static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
> @@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>   *    i_data_sem serializes concurrent extent tree modifications.
>   *
>   * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> - *    only safe for pure written-extent overwrites. Unwritten extents or
> - *    holes require exclusive lock because concurrent partial block zeroing
> - *    in the DIO layer could corrupt data.
> + *    safe unless the DIO layer needs to perform partial block zeroing --
> + *    i.e. the head or tail partial block sits on a hole or unwritten
> + *    extent. In that case upgrade to the exclusive lock and drain
> + *    in-flight DIO to avoid races with concurrent partial block zeroing.
>   */
>  static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  				     bool *ilock_shared, bool *extend,
> @@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	loff_t offset;
>  	size_t count;
>  	ssize_t ret;
> -	bool overwrite = true, unaligned_io, unwritten = false;
> +	bool needs_zeroing = false;
>  
>  restart:
>  	ret = ext4_generic_write_checks(iocb, from);
> @@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	offset = iocb->ki_pos;
>  	count = ret;
>  
> -	unaligned_io = ext4_unaligned_io(inode, from, offset);
>  	*extend = ext4_extending_io(inode, offset, count);
>  
>  	/*
> -	 * For unaligned writes we need to know the extent state to determine
> -	 * whether shared lock is safe. For aligned writes we skip this check
> -	 * entirely since allocation under shared lock is safe.
> +	 * For unaligned writes, check whether partial block zeroing will be
> +	 * needed. If so, exclusive lock is required to serialize against
> +	 * concurrent DIO that could race with the zeroing.
> +	 *
> +	 * For aligned writes we skip this check entirely since allocation
> +	 * under shared lock is safe.
>  	 */
> -	if (unaligned_io)
> -		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> +	if (ext4_unaligned_io(inode, from, offset))
> +		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
>  
>  	/* Determine whether we need to upgrade to an exclusive lock. */
>  	if (*ilock_shared &&
> -	    ((!IS_NOSEC(inode) || *extend ||
> -	     (unaligned_io && (!overwrite || unwritten))))) {
> +	    (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
> @@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	/*
>  	 * Now that locking is settled, determine dio flags and exclusivity
>  	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
> -	 * behavior already. The inode lock is already held exclusive if the
> -	 * write is unaligned non-overwrite or extending, so drain all
> -	 * outstanding dio and set the force wait dio flag.
> +	 * behavior already. When holding the exclusive lock for a write that
> +	 * needs partial block zeroing or is extending the file, we must wait
> +	 * for the I/O to complete synchronously:
> +	 *
> +	 *  - needs_zeroing: drain in-flight DIO whose end_io could race with
> +	 *    our partial block zeroing, and force synchronous completion so we
> +	 *    don't leave in-flight zeroing bios for the next writer to drain.
> +	 *
> +	 *  - extend: the caller must update i_disksize after I/O completion,
> +	 *    which requires the data to be on disk first.
>  	 */
> -	if (!*ilock_shared && (unaligned_io || *extend)) {
> +	if (!*ilock_shared && (needs_zeroing || *extend)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
>  		}
> -		if (unaligned_io && (!overwrite || unwritten))
> +		if (needs_zeroing)
>  			inode_dio_wait(inode);
>  		*dio_flags = IOMAP_DIO_FORCE_WAIT;
>  	}
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
From: Jan Kara @ 2026-06-12 12:46 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-2-libaokun@linux.alibaba.com>

On Fri 12-06-26 00:34:40, Baokun Li wrote:
> Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
> determine if a write is a pure overwrite, and upgrades to exclusive
> i_rwsem if not. However, ext4_overwrite_io() uses a single
> ext4_map_blocks() call which only returns the first contiguous extent of
> the same type. A write spanning multiple pre-allocated extents (e.g.
> written + unwritten, or two physically discontiguous written extents)
> produces a false negative, forcing an unnecessary exclusive lock upgrade.
> 
> After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
> writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
> useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
> accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
> journal transaction. The iomap iteration naturally handles multi-extent
> ranges: each call returns the mapping for the current segment, and
> unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
> This means the common case of mixed written/unwritten extents never
> reaches ext4_iomap_alloc() at all.
> 
> Even for the less common case where the range contains a hole and
> ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
> aligned non-extending writes:
> 
>  - truncate/punch_hole are kept out: they require exclusive i_rwsem
>    (blocked by our shared lock during allocation), and inode_dio_begin()
>    keeps their inode_dio_wait() blocked until in-flight bios complete.
>  - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
>    extent tree modifications (parallel writers to the same hole).
>  - The journal handle is per-thread and does not require i_rwsem
>    exclusion.
>  - i_disksize and orphan list are not involved in non-extending writes.
> 
> Skip the ext4_overwrite_io() check entirely for aligned writes by
> initializing overwrite to true and only calling ext4_overwrite_io() for
> unaligned writes. Unaligned writes still need the extent state check
> because concurrent partial block zeroing in the DIO layer requires
> exclusive serialization unless the range is a pure written-extent
> overwrite.
> 
> Performance:
> 
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
> 
> Aligned 8K DIO writes spanning written+unwritten extent boundaries.
> Each thread writes its own 1G region sequentially; the file is rebuilt
> between runs so every block is written exactly once. Metric: IOPS.
> 
>   JOBS      Before        After    speedup
>   ----    --------    ---------    -------
>      1      42,322       43,329      1.02x
>      2      68,516       70,677      1.03x
>      4      62,489       97,072      1.55x
>      8      58,701      110,819      1.89x
>     16      58,569      116,392      1.99x
>     32      60,860      117,244      1.93x
> 
> Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.
> 
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Uh, that's a significant change. I have to say I feel slightly uneasy :). But
I don't see a hole in your justification and the patch looks good. Nice
find and feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..6f3886465ce3 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>   * condition requires an exclusive inode lock. If yes, then we restart the
>   * whole operation by releasing the shared lock and acquiring exclusive lock.
>   *
> - * - For unaligned_io we never take shared lock as it may cause data corruption
> - *   when two unaligned IO tries to modify the same block e.g. while zeroing.
> + * The decision is layered, evaluated in this order:
>   *
> - * - For extending writes case we don't take the shared lock, since it requires
> - *   updating inode i_disksize and/or orphan handling with exclusive lock.
> + * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
> + *    to the exclusive lock -- the security update itself requires it,
> + *    regardless of whether the write extends the file or is aligned.
>   *
> - * - shared locking will only be true mostly with overwrites, including
> - *   initialized blocks and unwritten blocks.
> + * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
> + *    lock to safely update i_disksize and the orphan list, regardless of
> + *    alignment.
>   *
> - * - Otherwise we will switch to exclusive i_rwsem lock.
> + * 3. Otherwise, for aligned non-extending writes, shared lock is always
> + *    sufficient regardless of extent state (written, unwritten, or hole).
> + *    truncate/punch_hole cannot run while we hold the shared i_rwsem
> + *    (they need it exclusively); after we release it, inode_dio_begin()
> + *    keeps their inode_dio_wait() blocked until in-flight bios complete.
> + *    i_data_sem serializes concurrent extent tree modifications.
> + *
> + * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> + *    only safe for pure written-extent overwrites. Unwritten extents or
> + *    holes require exclusive lock because concurrent partial block zeroing
> + *    in the DIO layer could corrupt data.
>   */
>  static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  				     bool *ilock_shared, bool *extend,
> @@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	loff_t offset;
>  	size_t count;
>  	ssize_t ret;
> -	bool overwrite, unaligned_io, unwritten;
> +	bool overwrite = true, unaligned_io, unwritten = false;
>  
>  restart:
>  	ret = ext4_generic_write_checks(iocb, from);
> @@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  
>  	unaligned_io = ext4_unaligned_io(inode, from, offset);
>  	*extend = ext4_extending_io(inode, offset, count);
> -	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
>  
>  	/*
> -	 * Determine whether we need to upgrade to an exclusive lock. This is
> -	 * required to change security info in file_modified(), for extending
> -	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
> -	 * extents (as partial block zeroing may be required).
> -	 *
> -	 * Note that unaligned writes are allowed under shared lock so long as
> -	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
> -	 * data corruption due to partial block zeroing in the dio layer, and so
> -	 * the I/O must occur exclusively.
> +	 * For unaligned writes we need to know the extent state to determine
> +	 * whether shared lock is safe. For aligned writes we skip this check
> +	 * entirely since allocation under shared lock is safe.
>  	 */
> +	if (unaligned_io)
> +		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> +
> +	/* Determine whether we need to upgrade to an exclusive lock. */
>  	if (*ilock_shared &&
> -	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
> -	     (unaligned_io && unwritten)))) {
> +	    ((!IS_NOSEC(inode) || *extend ||
> +	     (unaligned_io && (!overwrite || unwritten))))) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			ret = -EAGAIN;
>  			goto out;
> @@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>  	 * Now that locking is settled, determine dio flags and exclusivity
>  	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
>  	 * behavior already. The inode lock is already held exclusive if the
> -	 * write is non-overwrite or extending, so drain all outstanding dio and
> -	 * set the force wait dio flag.
> +	 * write is unaligned non-overwrite or extending, so drain all
> +	 * outstanding dio and set the force wait dio flag.
>  	 */
>  	if (!*ilock_shared && (unaligned_io || *extend)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT) {
> -- 
> 2.43.7
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v2] ext4: defer iput() in ext4_xattr_block_set() to avoid deadlock with writepages
From: Yun Zhou @ 2026-06-12  9:58 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260611124555.1541195-1-yun.zhou@windriver.com>

ext4_xattr_block_set() calls iput() on ea_inode while its callers hold
xattr_sem.  If this iput() drops the last reference, it can trigger
write_inode_now() -> ext4_writepages() -> s_writepages_rwsem, which
violates the lock ordering since ext4_writepages() already establishes
s_writepages_rwsem -> jbd2_handle ordering:

  CPU0 (writeback worker)            CPU1 (file create)
  ----                               ----
  ext4_writepages()
    s_writepages_rwsem (read)        ext4_create()
    ext4_do_writepages()               __ext4_new_inode()
      ext4_journal_start()               [holds jbd2 handle]
        wait_transaction_locked()        ext4_xattr_set_handle()
        [WAIT for jbd2_handle]             xattr_sem (write)

  CPU2 (xattr set or isize expand)
  ----
  ext4_xattr_set_handle() or ext4_try_to_expand_extra_isize()
    xattr_sem (write)
    ext4_xattr_block_set()
      iput(ea_inode)
        write_inode_now()
          ext4_writepages()
            s_writepages_rwsem (read) [DEADLOCK]

This forms a circular dependency on lock classes:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

Fix by deferring iput() calls inside ext4_xattr_block_set() via the
existing ext4_xattr_inode_array mechanism.  The array is threaded
through the call chain and freed by callers after releasing xattr_sem.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v2: Defer iput() in ext4_xattr_block_set() via ea_inode_array,
	freed after xattr_sem is released. Fixes the root cause.

v1: Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
	on inodes being deleted. Only fixes the syzbot reproducer, not
	the underlying lock ordering violation.

 fs/ext4/inode.c | 15 +++++++++++----
 fs/ext4/xattr.c | 40 +++++++++++++++++++++++++---------------
 fs/ext4/xattr.h |  3 ++-
 3 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..c6448a9eb1e7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6408,7 +6408,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 static int __ext4_expand_extra_isize(struct inode *inode,
 				     unsigned int new_extra_isize,
 				     struct ext4_iloc *iloc,
-				     handle_t *handle, int *no_expand)
+				     handle_t *handle, int *no_expand,
+				     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_inode *raw_inode;
 	struct ext4_xattr_ibody_header *header;
@@ -6453,7 +6454,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
 
 	/* try to expand with EAs present */
 	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
-					   raw_inode, handle);
+					   raw_inode, handle, ea_inode_array);
 	if (error) {
 		/*
 		 * Inode size expansion failed; don't try again
@@ -6475,6 +6476,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 {
 	int no_expand;
 	int error;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
@@ -6496,8 +6498,10 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 		return -EBUSY;
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 
 	return error;
 }
@@ -6509,6 +6513,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	handle_t *handle;
 	int no_expand;
 	int error, rc;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
 		brelse(iloc->bh);
@@ -6534,7 +6539,8 @@ int ext4_expand_extra_isize(struct inode *inode,
 	}
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 
 	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
 	if (!error)
@@ -6542,6 +6548,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 
 out_unlock:
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	ext4_journal_stop(handle);
 	return error;
 }
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e91af66db7a7..bf8424927383 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1906,7 +1906,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
 static int
 ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		     struct ext4_xattr_info *i,
-		     struct ext4_xattr_block_find *bs)
+		     struct ext4_xattr_block_find *bs,
+		     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *new_bh = NULL;
@@ -2158,7 +2159,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				ext4_expand_inode_array(ea_inode_array,
+							ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2190,12 +2192,12 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
+		struct ext4_xattr_inode_array *old_ea_inode_array = NULL;
 
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
+					 &old_ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
+		ext4_xattr_inode_array_free(old_ea_inode_array);
 	}
 	error = 0;
 
@@ -2211,7 +2213,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		ext4_expand_inode_array(ea_inode_array, ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2371,6 +2373,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	struct ext4_xattr_block_find bs = {
 		.s = { .not_found = -ENODATA, },
 	};
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	int no_expand;
 	int error;
 
@@ -2438,7 +2441,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		if (!is.s.not_found)
 			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		else if (!bs.s.not_found)
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     &ea_inode_array);
 	} else {
 		error = 0;
 		/* Xattr value did not change? Save us some work and bail out */
@@ -2455,7 +2459,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
 		if (!error && !bs.s.not_found) {
 			i.value = NULL;
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     &ea_inode_array);
 		} else if (error == -ENOSPC) {
 			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
 				brelse(bs.bh);
@@ -2464,7 +2469,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 				if (error)
 					goto cleanup;
 			}
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     &ea_inode_array);
 			if (!error && !is.s.not_found) {
 				i.value = NULL;
 				error = ext4_xattr_ibody_set(handle, inode, &i,
@@ -2503,6 +2509,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 	brelse(is.iloc.bh);
 	brelse(bs.bh);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return error;
 }
 
@@ -2612,7 +2619,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  */
 static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 				    struct ext4_inode *raw_inode,
-				    struct ext4_xattr_entry *entry)
+				    struct ext4_xattr_entry *entry,
+				    struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_find *is = NULL;
 	struct ext4_xattr_block_find *bs = NULL;
@@ -2676,7 +2684,7 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Move ea entry from the inode into the block */
-	error = ext4_xattr_block_set(handle, inode, &i, bs);
+	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
 	if (error)
 		goto out;
 
@@ -2702,7 +2710,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 				       struct ext4_inode *raw_inode,
 				       int isize_diff, size_t ifree,
-				       size_t bfree, int *total_ino)
+				       size_t bfree, int *total_ino,
+				       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
 	struct ext4_xattr_entry *small_entry;
@@ -2752,7 +2761,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 			total_size += EXT4_XATTR_SIZE(
 					      le32_to_cpu(entry->e_value_size));
 		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
-						 entry);
+						 entry, ea_inode_array);
 		if (error)
 			return error;
 
@@ -2769,7 +2778,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
  * Returns 0 on success or negative error number on failure.
  */
 int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			       struct ext4_inode *raw_inode, handle_t *handle)
+			       struct ext4_inode *raw_inode, handle_t *handle,
+			       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2841,7 +2851,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 
 	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
 					    isize_diff, ifree, bfree,
-					    &total_ino);
+					    &total_ino, ea_inode_array);
 	if (error) {
 		if (error == -ENOSPC && !tried_min_extra_isize &&
 		    s_min_extra_isize) {
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..02a172515193 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -192,7 +192,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			    struct ext4_inode *raw_inode, handle_t *handle);
+			    struct ext4_inode *raw_inode, handle_t *handle,
+			    struct ext4_xattr_inode_array **ea_inode_array);
 extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler * const ext4_xattr_handlers[];
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-12  5:28 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, Keith Busch, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <airX6BmMQ14Rvjcb@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > It's entirely possible a device supports byte aligned addresses. The
> > > block layer just doesn't let a driver report that. So either it really
> > > was successful because you found a bug that skips the alignment checks,
> > > or your device silently corrupted your payload.
> 
> I tried this on different hardware, I find it hard to say all those
> devices were corrupting the payload.

I think in the other thread we agreed that we are currently missing
the alignment check for fast-path bios not hitting the splitting code,
so maybe that is something you see.  Additionally we're missing the
checks for purely bio based drivers not calling the splitting helper
at all, but I don't think that applies here.

> > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > though, in not taking the optimization when it was possible. So here's
> > > an alternative suggestion that should get things working as expected:
> > 
> > The fix below looks like it is addressing a real bug.  I'm not sure if
> > Carlos is hitting it, but we were missing the alignment checks for
> > single-bvec fast path bios so far indeed.
> 
> You left context out so I'm assuming by the fix you meant Keith's patch.

Yes.


^ permalink raw reply

* [PATCH v3] ext4: fix circular lock dependency in ext4_ext_migrate
From: Yun Zhou @ 2026-06-12  0:53 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260609084007.3432061-1-yun.zhou@windriver.com>

Move iput(tmp_inode) after ext4_writepages_up_write() to avoid a
circular lock dependency between s_writepages_rwsem and sb_internal
(freeze protection).

The deadlock scenario:

  CPU0 (EXT4_IOC_MIGRATE)        CPU1 (orphan cleanup during mount)
  ----                           ----
  ext4_ext_migrate()
    ext4_writepages_down_write()
      s_writepages_rwsem (write)
                                 ext4_evict_inode()
                                   sb_start_intwrite()   [sb_internal]
                                   ...
                                     ext4_writepages()
                                       s_writepages_rwsem (read) [BLOCKED]
    iput(tmp_inode)
      ext4_evict_inode()
        sb_start_intwrite()         [BLOCKED]

The tmp_inode is a temporary inode with nlink=0 created solely for
building the extent tree.  Its eviction does not require
s_writepages_rwsem protection, so deferring iput() until after
releasing the rwsem is safe.

Reported-by: syzbot+212e8f62790f8e0bc63b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=212e8f62790f8e0bc63b
Fixes: cb85f4d23f79 ("ext4: fix race between writepages and enabling EXT4_EXTENTS_FL")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
v3: fixes Reported-by tag and Closes tag.

v2: remove redundant null pointer check for iput(tmp_inode).

 fs/ext4/migrate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 477d43d7e294..5d60ef10fe11 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -464,6 +464,7 @@ int ext4_ext_migrate(struct inode *inode)
 	if (IS_ERR(tmp_inode)) {
 		retval = PTR_ERR(tmp_inode);
 		ext4_journal_stop(handle);
+		tmp_inode = NULL;
 		goto out_unlock;
 	}
 	/*
@@ -591,9 +592,9 @@ int ext4_ext_migrate(struct inode *inode)
 	ext4_journal_stop(handle);
 out_tmp_inode:
 	unlock_new_inode(tmp_inode);
-	iput(tmp_inode);
 out_unlock:
 	ext4_writepages_up_write(inode->i_sb, alloc_ctx);
+	iput(tmp_inode);
 	return retval;
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260611163441.2431805-1-libaokun@linux.alibaba.com>

Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.

After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.

Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:

 - truncate/punch_hole are kept out: they require exclusive i_rwsem
   (blocked by our shared lock during allocation), and inode_dio_begin()
   keeps their inode_dio_wait() blocked until in-flight bios complete.
 - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
   extent tree modifications (parallel writers to the same hole).
 - The journal handle is per-thread and does not require i_rwsem
   exclusion.
 - i_disksize and orphan list are not involved in non-extending writes.

Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      42,322       43,329      1.02x
     2      68,516       70,677      1.03x
     4      62,489       97,072      1.55x
     8      58,701      110,819      1.89x
    16      58,569      116,392      1.99x
    32      60,860      117,244      1.93x

Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..6f3886465ce3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  * condition requires an exclusive inode lock. If yes, then we restart the
  * whole operation by releasing the shared lock and acquiring exclusive lock.
  *
- * - For unaligned_io we never take shared lock as it may cause data corruption
- *   when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
  *
- * - For extending writes case we don't take the shared lock, since it requires
- *   updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ *    to the exclusive lock -- the security update itself requires it,
+ *    regardless of whether the write extends the file or is aligned.
  *
- * - shared locking will only be true mostly with overwrites, including
- *   initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ *    lock to safely update i_disksize and the orphan list, regardless of
+ *    alignment.
  *
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ *    sufficient regardless of extent state (written, unwritten, or hole).
+ *    truncate/punch_hole cannot run while we hold the shared i_rwsem
+ *    (they need it exclusively); after we release it, inode_dio_begin()
+ *    keeps their inode_dio_wait() blocked until in-flight bios complete.
+ *    i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ *    only safe for pure written-extent overwrites. Unwritten extents or
+ *    holes require exclusive lock because concurrent partial block zeroing
+ *    in the DIO layer could corrupt data.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite, unaligned_io, unwritten;
+	bool overwrite = true, unaligned_io, unwritten = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 
 	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
-	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
 
 	/*
-	 * Determine whether we need to upgrade to an exclusive lock. This is
-	 * required to change security info in file_modified(), for extending
-	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
-	 * extents (as partial block zeroing may be required).
-	 *
-	 * Note that unaligned writes are allowed under shared lock so long as
-	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
-	 * data corruption due to partial block zeroing in the dio layer, and so
-	 * the I/O must occur exclusively.
+	 * For unaligned writes we need to know the extent state to determine
+	 * whether shared lock is safe. For aligned writes we skip this check
+	 * entirely since allocation under shared lock is safe.
 	 */
+	if (unaligned_io)
+		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
-	     (unaligned_io && unwritten)))) {
+	    ((!IS_NOSEC(inode) || *extend ||
+	     (unaligned_io && (!overwrite || unwritten))))) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
 	 * behavior already. The inode lock is already held exclusive if the
-	 * write is non-overwrite or extending, so drain all outstanding dio and
-	 * set the force wait dio flag.
+	 * write is unaligned non-overwrite or extending, so drain all
+	 * outstanding dio and set the force wait dio flag.
 	 */
 	if (!*ilock_shared && (unaligned_io || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.43.7


^ permalink raw reply related

* [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260611163441.2431805-1-libaokun@linux.alibaba.com>

For unaligned DIO writes, the previous ext4_overwrite_io() required the
entire range to fall within a single written extent.  This was overly
conservative: the DIO layer only performs partial block zeroing for the
head and tail blocks when they are partially covered by the write.
Middle blocks that are fully covered are written as whole blocks
without any zeroing, so they are safe regardless of extent state.

Therefore exclusive lock is only required when partial block zeroing
will actually happen:
 - The head partial block (if any) lands on a hole or unwritten extent.
 - The tail partial block (if any) lands on a hole or unwritten extent.

Middle full-cover blocks can be in any state (hole, unwritten, or
written) - block allocation under shared lock is safe per the previous
patch's analysis (inode_dio_begin + i_data_sem protection).

Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
directly answers the question driving the lock decision.  It uses at
most two ext4_map_blocks() calls: one for the head partial block (also
catching the case where it spans through the tail), and one for the
tail partial block if not already covered.

This enables shared lock for previously-rejected scenarios such as:
 - Unaligned write spanning written extent + mid-range hole + written
   extent at the tail.
 - Unaligned write where the partial blocks land on written extents but
   the middle has unwritten extents.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
Each stripe is laid out as [written][unwritten][unwritten][written],
so the head and tail partial blocks land on written extents but the
middle is unwritten.  Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      15,547       17,381      1.12x
     2      15,910       34,172      2.15x
     4      15,014       57,567      3.83x
     8      15,022       81,947      5.46x
    16      14,586       99,126      6.80x
    32      14,047       92,519      6.59x

Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 73 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6f3886465ce3..aa926e641739 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
 	return false;
 }
 
-/* Is IO overwriting allocated or initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode,
-			      loff_t pos, loff_t len, bool *unwritten)
+/*
+ * Does an unaligned DIO write require partial block zeroing?
+ *
+ * Partial block zeroing is performed only for the head and tail blocks
+ * when they are partially covered by the write and the underlying extent
+ * is a hole or unwritten. Middle blocks (fully covered by the write)
+ * are written as whole blocks without zeroing.
+ *
+ * When zeroing is required, two concurrent unaligned DIO writes to the
+ * same partial block can race and corrupt each other's data, so the
+ * caller must take the exclusive i_rwsem and drain in-flight DIO. When
+ * zeroing is not required, shared lock is safe -- block allocation and
+ * unwritten conversion for middle blocks are protected by i_data_sem
+ * and inode_dio_begin().
+ */
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
 {
 	struct ext4_map_blocks map;
 	unsigned int blkbits = inode->i_blkbits;
-	int err, blklen;
+	unsigned long blockmask = inode->i_sb->s_blocksize - 1;
+	bool head_partial, tail_partial;
+	ext4_lblk_t head_lblk, tail_lblk;
+	int err;
 
 	if (pos + len > i_size_read(inode))
-		return false;
+		return true;
 
-	map.m_lblk = pos >> blkbits;
-	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-	blklen = map.m_len;
+	head_partial = (pos & blockmask) != 0;
+	tail_partial = ((pos + len) & blockmask) != 0;
+	head_lblk = pos >> blkbits;
+	tail_lblk = (pos + len - 1) >> blkbits;
+
+	/* Check the head partial block. */
+	if (head_partial) {
+		map.m_lblk = head_lblk;
+		map.m_len = tail_lblk - head_lblk + 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+		/* If this mapping already covers the tail block, we're done. */
+		if (!tail_partial || map.m_lblk + err > tail_lblk)
+			return false;
+	}
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
-	return true;
+	/* Check the tail partial block. */
+	if (tail_partial) {
+		map.m_lblk = tail_lblk;
+		map.m_len = 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+	}
+
+	return false;
 }
 
 static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  *    i_data_sem serializes concurrent extent tree modifications.
  *
  * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
- *    only safe for pure written-extent overwrites. Unwritten extents or
- *    holes require exclusive lock because concurrent partial block zeroing
- *    in the DIO layer could corrupt data.
+ *    safe unless the DIO layer needs to perform partial block zeroing --
+ *    i.e. the head or tail partial block sits on a hole or unwritten
+ *    extent. In that case upgrade to the exclusive lock and drain
+ *    in-flight DIO to avoid races with concurrent partial block zeroing.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite = true, unaligned_io, unwritten = false;
+	bool needs_zeroing = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	offset = iocb->ki_pos;
 	count = ret;
 
-	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
 
 	/*
-	 * For unaligned writes we need to know the extent state to determine
-	 * whether shared lock is safe. For aligned writes we skip this check
-	 * entirely since allocation under shared lock is safe.
+	 * For unaligned writes, check whether partial block zeroing will be
+	 * needed. If so, exclusive lock is required to serialize against
+	 * concurrent DIO that could race with the zeroing.
+	 *
+	 * For aligned writes we skip this check entirely since allocation
+	 * under shared lock is safe.
 	 */
-	if (unaligned_io)
-		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+	if (ext4_unaligned_io(inode, from, offset))
+		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
 
 	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend ||
-	     (unaligned_io && (!overwrite || unwritten))))) {
+	    (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	/*
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
-	 * behavior already. The inode lock is already held exclusive if the
-	 * write is unaligned non-overwrite or extending, so drain all
-	 * outstanding dio and set the force wait dio flag.
+	 * behavior already. When holding the exclusive lock for a write that
+	 * needs partial block zeroing or is extending the file, we must wait
+	 * for the I/O to complete synchronously:
+	 *
+	 *  - needs_zeroing: drain in-flight DIO whose end_io could race with
+	 *    our partial block zeroing, and force synchronous completion so we
+	 *    don't leave in-flight zeroing bios for the next writer to drain.
+	 *
+	 *  - extend: the caller must update i_disksize after I/O completion,
+	 *    which requires the data to be on disk first.
 	 */
-	if (!*ilock_shared && (unaligned_io || *extend)) {
+	if (!*ilock_shared && (needs_zeroing || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
 		}
-		if (unaligned_io && (!overwrite || unwritten))
+		if (needs_zeroing)
 			inode_dio_wait(inode);
 		*dio_flags = IOMAP_DIO_FORCE_WAIT;
 	}
-- 
2.43.7


^ permalink raw reply related

* [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang

Hi all,

This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.

It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.

ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):

  1. Aligned writes spanning multiple already-allocated extents (e.g.
     written + unwritten, or two discontiguous written extents).

  2. Unaligned writes whose head/tail partial blocks land on written
     extents but the fully-covered middle blocks include hole or
     unwritten extents.

Patch 1 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.

Patch 2 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.


Testing
=======

"kvm-xfstests -c ext4/all -g auto" passes with no new failures.


Performance
===========

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.

  JOBS         base    +patch 1    +patch 1+2    speedup
  ----    ---------    --------    ----------    -------
     1       42,322      43,329        43,087      1.02x
     2       68,516      70,677        66,958      1.03x
     4       62,489      97,072       101,468      1.62x
     8       58,701     110,819       113,679      1.94x
    16       58,569     116,392       115,272      1.97x
    32       60,860     117,244       119,621      1.97x

Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.

Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.

  JOBS         base    +patch 1    +patch 1+2    speedup
  ----    ---------    --------    ----------    -------
     1       15,547      15,975        17,381      1.12x
     2       15,910      14,808        34,172      2.15x
     4       15,014      14,828        57,567      3.83x
     8       15,022      14,648        81,947      5.46x
    16       14,586      14,262        99,126      6.80x
    32       14,047      13,809        92,519      6.59x

Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.

In test 2, patch 1 alone has no effect (slight noise) because patch 1
only touches the aligned write path. Patch 2 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.

Comments and questions are, as always, welcome.

Thanks,
Baokun

[1]: https://patch.msgid.link/20260607124935.6168-1-peng_wang@linux.alibaba.com


Baokun Li (2):
  ext4: skip overwrite check for aligned non-extending DIO writes
  ext4: base unaligned DIO lock decision on partial block zeroing

 fs/ext4/file.c | 132 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 89 insertions(+), 43 deletions(-)

-- 
2.43.7


^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-11 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260611133833.GA14645@lst.de>

On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > It's entirely possible a device supports byte aligned addresses. The
> > block layer just doesn't let a driver report that. So either it really
> > was successful because you found a bug that skips the alignment checks,
> > or your device silently corrupted your payload.

I tried this on different hardware, I find it hard to say all those
devices were corrupting the payload.

> > 
> > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > though, in not taking the optimization when it was possible. So here's
> > an alternative suggestion that should get things working as expected:
> 
> The fix below looks like it is addressing a real bug.  I'm not sure if
> Carlos is hitting it, but we were missing the alignment checks for
> single-bvec fast path bios so far indeed.

You left context out so I'm assuming by the fix you meant Keith's patch.
I can give it a spin and see if it fixes the behavior I'm talking
about. Give me some time as I have a bunch of stuff to do tonight so
likely I will only manage to try this tomorrow.

^ permalink raw reply

* [tytso-ext4:dev] BUILD SUCCESS c143957520c6c9b5cd72e0de8b52b814f0c576fe
From: kernel test robot @ 2026-06-11 14:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
branch HEAD: c143957520c6c9b5cd72e0de8b52b814f0c576fe  ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT

elapsed time: 772m

configs tested: 195
configs skipped: 2

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-16.1.0
alpha                            allyesconfig    gcc-16.1.0
alpha                               defconfig    gcc-16.1.0
arc                              allmodconfig    clang-23
arc                              allmodconfig    gcc-16.1.0
arc                               allnoconfig    gcc-16.1.0
arc                              allyesconfig    clang-23
arc                              allyesconfig    gcc-16.1.0
arc                                 defconfig    gcc-16.1.0
arc                        nsim_700_defconfig    gcc-16.1.0
arc                            randconfig-001    gcc-14.3.0
arc                   randconfig-001-20260611    gcc-14.3.0
arc                            randconfig-002    gcc-14.3.0
arc                   randconfig-002-20260611    gcc-14.3.0
arm                               allnoconfig    clang-23
arm                               allnoconfig    gcc-16.1.0
arm                              allyesconfig    clang-23
arm                              allyesconfig    gcc-16.1.0
arm                                 defconfig    gcc-16.1.0
arm                          pxa910_defconfig    gcc-16.1.0
arm                            randconfig-001    gcc-14.3.0
arm                   randconfig-001-20260611    gcc-14.3.0
arm                            randconfig-002    gcc-14.3.0
arm                   randconfig-002-20260611    gcc-14.3.0
arm                            randconfig-003    gcc-14.3.0
arm                   randconfig-003-20260611    gcc-14.3.0
arm                            randconfig-004    gcc-14.3.0
arm                   randconfig-004-20260611    gcc-14.3.0
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-16.1.0
arm64                               defconfig    gcc-16.1.0
csky                             allmodconfig    gcc-16.1.0
csky                              allnoconfig    gcc-16.1.0
csky                                defconfig    gcc-16.1.0
hexagon                          allmodconfig    clang-23
hexagon                          allmodconfig    gcc-16.1.0
hexagon                           allnoconfig    clang-23
hexagon                           allnoconfig    gcc-16.1.0
hexagon                             defconfig    gcc-16.1.0
hexagon               randconfig-001-20260611    clang-16
hexagon               randconfig-002-20260611    clang-16
i386                             allmodconfig    gcc-14
i386                              allnoconfig    gcc-14
i386                              allnoconfig    gcc-16.1.0
i386                             allyesconfig    gcc-14
i386                 buildonly-randconfig-001    clang-22
i386        buildonly-randconfig-001-20260611    clang-22
i386                 buildonly-randconfig-002    clang-22
i386        buildonly-randconfig-002-20260611    clang-22
i386                 buildonly-randconfig-003    clang-22
i386        buildonly-randconfig-003-20260611    clang-22
i386                 buildonly-randconfig-004    clang-22
i386        buildonly-randconfig-004-20260611    clang-22
i386                 buildonly-randconfig-005    clang-22
i386        buildonly-randconfig-005-20260611    clang-22
i386                 buildonly-randconfig-006    clang-22
i386        buildonly-randconfig-006-20260611    clang-22
i386                                defconfig    gcc-16.1.0
i386                  randconfig-001-20260611    gcc-14
i386                  randconfig-002-20260611    gcc-14
i386                  randconfig-003-20260611    gcc-14
i386                  randconfig-004-20260611    gcc-14
i386                  randconfig-005-20260611    gcc-14
i386                  randconfig-006-20260611    gcc-14
i386                  randconfig-007-20260611    gcc-14
i386                  randconfig-011-20260611    gcc-14
i386                  randconfig-012-20260611    gcc-14
i386                  randconfig-013-20260611    gcc-14
i386                  randconfig-014-20260611    gcc-14
i386                  randconfig-015-20260611    gcc-14
i386                  randconfig-016-20260611    gcc-14
i386                  randconfig-017-20260611    gcc-14
loongarch                        allmodconfig    clang-19
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    clang-20
loongarch                         allnoconfig    gcc-16.1.0
loongarch                           defconfig    clang-23
loongarch             randconfig-001-20260611    clang-16
loongarch             randconfig-002-20260611    clang-16
m68k                             allmodconfig    gcc-16.1.0
m68k                              allnoconfig    gcc-16.1.0
m68k                             allyesconfig    clang-23
m68k                             allyesconfig    gcc-16.1.0
m68k                                defconfig    clang-23
microblaze                        allnoconfig    gcc-16.1.0
microblaze                       allyesconfig    gcc-16.1.0
microblaze                          defconfig    clang-23
mips                             allmodconfig    gcc-16.1.0
mips                              allnoconfig    gcc-16.1.0
mips                             allyesconfig    gcc-16.1.0
nios2                            allmodconfig    clang-20
nios2                            allmodconfig    gcc-11.5.0
nios2                             allnoconfig    clang-23
nios2                             allnoconfig    gcc-11.5.0
nios2                               defconfig    clang-23
nios2                 randconfig-001-20260611    clang-16
nios2                 randconfig-002-20260611    clang-16
openrisc                         allmodconfig    clang-20
openrisc                         allmodconfig    gcc-16.1.0
openrisc                          allnoconfig    clang-23
openrisc                          allnoconfig    gcc-16.1.0
openrisc                            defconfig    gcc-16.1.0
parisc                           allmodconfig    gcc-16.1.0
parisc                            allnoconfig    clang-23
parisc                            allnoconfig    gcc-16.1.0
parisc                           allyesconfig    clang-23
parisc                           allyesconfig    gcc-16.1.0
parisc                              defconfig    gcc-16.1.0
parisc64                            defconfig    clang-23
powerpc                          allmodconfig    gcc-16.1.0
powerpc                           allnoconfig    clang-23
powerpc                           allnoconfig    gcc-16.1.0
powerpc                     tqm8540_defconfig    gcc-16.1.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                             allnoconfig    gcc-16.1.0
riscv                            allyesconfig    clang-23
riscv                               defconfig    gcc-16.1.0
riscv                 randconfig-001-20260611    gcc-12.5.0
riscv                 randconfig-002-20260611    gcc-12.5.0
s390                             allmodconfig    clang-23
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-16.1.0
s390                                defconfig    gcc-16.1.0
s390                  randconfig-001-20260611    gcc-12.5.0
s390                  randconfig-002-20260611    gcc-12.5.0
sh                               allmodconfig    gcc-16.1.0
sh                                allnoconfig    clang-23
sh                                allnoconfig    gcc-16.1.0
sh                               allyesconfig    clang-23
sh                               allyesconfig    gcc-16.1.0
sh                                  defconfig    gcc-14
sh                    randconfig-001-20260611    gcc-12.5.0
sh                    randconfig-002-20260611    gcc-12.5.0
sparc                             allnoconfig    clang-23
sparc                             allnoconfig    gcc-16.1.0
sparc                               defconfig    gcc-16.1.0
sparc                 randconfig-001-20260611    gcc-15.2.0
sparc                 randconfig-002-20260611    gcc-15.2.0
sparc64                          allmodconfig    clang-20
sparc64                             defconfig    gcc-14
sparc64               randconfig-001-20260611    gcc-15.2.0
sparc64               randconfig-002-20260611    gcc-15.2.0
um                               allmodconfig    clang-23
um                                allnoconfig    clang-16
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-14
um                               allyesconfig    gcc-16.1.0
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260611    gcc-15.2.0
um                    randconfig-002-20260611    gcc-15.2.0
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-22
x86_64                            allnoconfig    clang-22
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-22
x86_64      buildonly-randconfig-001-20260611    gcc-14
x86_64      buildonly-randconfig-002-20260611    gcc-14
x86_64      buildonly-randconfig-003-20260611    gcc-14
x86_64      buildonly-randconfig-004-20260611    gcc-14
x86_64      buildonly-randconfig-005-20260611    gcc-14
x86_64      buildonly-randconfig-006-20260611    gcc-14
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-22
x86_64                randconfig-001-20260611    gcc-14
x86_64                randconfig-002-20260611    gcc-14
x86_64                randconfig-003-20260611    gcc-14
x86_64                randconfig-004-20260611    gcc-14
x86_64                randconfig-005-20260611    gcc-14
x86_64                randconfig-006-20260611    gcc-14
x86_64                randconfig-011-20260611    gcc-14
x86_64                randconfig-012-20260611    gcc-14
x86_64                randconfig-013-20260611    gcc-14
x86_64                randconfig-014-20260611    gcc-14
x86_64                randconfig-015-20260611    gcc-14
x86_64                randconfig-016-20260611    gcc-14
x86_64                randconfig-071-20260611    clang-22
x86_64                randconfig-072-20260611    clang-22
x86_64                randconfig-073-20260611    clang-22
x86_64                randconfig-074-20260611    clang-22
x86_64                randconfig-075-20260611    clang-22
x86_64                randconfig-076-20260611    clang-22
x86_64                               rhel-9.4    clang-22
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-22
x86_64                    rhel-9.4-kselftests    clang-22
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-22
xtensa                            allnoconfig    clang-23
xtensa                            allnoconfig    gcc-16.1.0
xtensa                           allyesconfig    clang-20
xtensa                randconfig-001-20260611    gcc-15.2.0
xtensa                randconfig-002-20260611    gcc-15.2.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] ext4: skip extra isize expansion on inode eviction to avoid deadlock
From: Jan Kara @ 2026-06-11 14:00 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260611124555.1541195-1-yun.zhou@windriver.com>

On Thu 11-06-26 20:45:55, Yun Zhou wrote:
> Expanding extra isize on an inode that is being evicted is pointless
> since the inode is about to be deleted.  Skip it by setting
> EXT4_STATE_NO_EXPAND before calling ext4_mark_inode_dirty() in the
> eviction path.
> 
> This also breaks a circular lock dependency reported by lockdep during
> orphan cleanup at mount time:
> 
>   CPU0 (writeback worker)            CPU1 (open)
>   ----                               ----
>   ext4_writepages()
>     s_writepages_rwsem (read)        ext4_create()
>     ext4_do_writepages()               __ext4_new_inode()
>       ext4_journal_start()               [holds jbd2 handle]
>         wait_transaction_locked()        ext4_xattr_set_handle()
>         [WAIT for jbd2_handle]             xattr_sem (write)
> 
>   CPU2 (mount / orphan cleanup)
>   ----
>   ext4_evict_inode()
>     __ext4_mark_inode_dirty()
>       ext4_try_to_expand_extra_isize()
>         xattr_sem (write)
>         ext4_expand_extra_isize_ea()
>           ext4_xattr_block_set()
>             iput(ea_inode)
>               write_inode_now()
>                 ext4_writepages()
>                   s_writepages_rwsem (read)
>                   [WAIT for s_writepages_rwsem -- if blocked by write lock holder]
> 
> This forms a circular dependency on lock classes:
> 
>   s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
> 
> The iput() inside ext4_xattr_block_set() triggers write_inode_now()
> because SB_ACTIVE is not yet set during mount, so iput_final() cannot
> cache the inode in the LRU and must flush it synchronously.
> 
> Setting EXT4_STATE_NO_EXPAND prevents ext4_try_to_expand_extra_isize()
> from executing, which eliminates the xattr_sem --> s_writepages_rwsem
> edge and breaks the cycle.
> 
> Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
> Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Thanks for the patch! So I have no problem with setting EXT4_STATE_NO_EXPAND
in ext4_evict_inode() as you correctly point out expansion is pointless in
that case. But your patch actually doesn't fix the real problem, it only
deals with the particular syzbot reproducer. The real problem is that
ext4_xattr_block_set() which is run inside a transaction can end up
acquiring s_writepages_rwsem which violates the lock ordering rules. So
this is the problem that really needs to be fixed.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-11 13:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Carlos Maiolino, Christoph Hellwig, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <aiqwy5DfHI79KXuZ@kbusch-mbp>

On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> It's entirely possible a device supports byte aligned addresses. The
> block layer just doesn't let a driver report that. So either it really
> was successful because you found a bug that skips the alignment checks,
> or your device silently corrupted your payload.
> 
> Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> though, in not taking the optimization when it was possible. So here's
> an alternative suggestion that should get things working as expected:

The fix below looks like it is addressing a real bug.  I'm not sure if
Carlos is hitting it, but we were missing the alignment checks for
single-bvec fast path bios so far indeed.


^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Keith Busch @ 2026-06-11 12:57 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, brauner, linux-block, linux-fsdevel,
	linux-ext4, linux-xfs, Hannes Reinecke, Martin K. Petersen,
	Jens Axboe
In-Reply-To: <aiqBvF93P4NjfaDR@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 12:05:22PM +0200, Carlos Maiolino wrote:
> The passed in address 0x1003af80001 is one byte misaligned and shouldn't
> (at least in theory) ever be accepted no? Or am I missing something
> else?

It's entirely possible a device supports byte aligned addresses. The
block layer just doesn't let a driver report that. So either it really
was successful because you found a bug that skips the alignment checks,
or your device silently corrupted your payload.

Anyway, my earlier suggestion should work. Ming thinks it may go to far,
though, in not taking the optimization when it was possible. So here's
an alternative suggestion that should get things working as expected:

---
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba04..4c31762d6fb5f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -404,6 +404,9 @@ static inline bool bio_may_need_split(struct bio *bio,
        bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
        if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
                return true;
+
+       if ((bv->bv_offset | bv->bv_len) & lim->dma_alignment)
+               return true;
        return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
 }
 
-- 

^ permalink raw reply related

* [PATCH] ext4: skip extra isize expansion on inode eviction to avoid deadlock
From: Yun Zhou @ 2026-06-11 12:45 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

Expanding extra isize on an inode that is being evicted is pointless
since the inode is about to be deleted.  Skip it by setting
EXT4_STATE_NO_EXPAND before calling ext4_mark_inode_dirty() in the
eviction path.

This also breaks a circular lock dependency reported by lockdep during
orphan cleanup at mount time:

  CPU0 (writeback worker)            CPU1 (open)
  ----                               ----
  ext4_writepages()
    s_writepages_rwsem (read)        ext4_create()
    ext4_do_writepages()               __ext4_new_inode()
      ext4_journal_start()               [holds jbd2 handle]
        wait_transaction_locked()        ext4_xattr_set_handle()
        [WAIT for jbd2_handle]             xattr_sem (write)

  CPU2 (mount / orphan cleanup)
  ----
  ext4_evict_inode()
    __ext4_mark_inode_dirty()
      ext4_try_to_expand_extra_isize()
        xattr_sem (write)
        ext4_expand_extra_isize_ea()
          ext4_xattr_block_set()
            iput(ea_inode)
              write_inode_now()
                ext4_writepages()
                  s_writepages_rwsem (read)
                  [WAIT for s_writepages_rwsem -- if blocked by write lock holder]

This forms a circular dependency on lock classes:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

The iput() inside ext4_xattr_block_set() triggers write_inode_now()
because SB_ACTIVE is not yet set during mount, so iput_final() cannot
cache the inode in the LRU and must flush it synchronously.

Setting EXT4_STATE_NO_EXPAND prevents ext4_try_to_expand_extra_isize()
from executing, which eliminates the xattr_sem --> s_writepages_rwsem
edge and breaks the cycle.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..cbfd1d1282e6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -264,6 +264,12 @@ void ext4_evict_inode(struct inode *inode)
 	if (ext4_inode_is_fast_symlink(inode))
 		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
 	inode->i_size = 0;
+	/*
+	 * Skip extra isize expansion on inodes being deleted -- it is
+	 * pointless and can trigger a circular lock dependency:
+	 *   xattr_sem -> ext4_xattr_block_set -> iput -> s_writepages_rwsem
+	 */
+	ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (err) {
 		ext4_warning(inode->i_sb,
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] jbd2: Remove special jbd2 slabs
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
  To: Ext4 Developers List, Matthew Wilcox (Oracle)
  Cc: Theodore Ts'o, Jan Kara, linux-fsdevel,
	Mike Rapoport (Microsoft), Vlastimil Babka, Tal Zussman, Jan Kara
In-Reply-To: <20260528171413.1088143-1-willy@infradead.org>


On Thu, 28 May 2026 18:14:11 +0100, Matthew Wilcox (Oracle) wrote:
> When jbd2 was originally written, kmalloc() would not guarantee memory
> alignment for the requested objects.  Since commit 59bb47985c1d in 2019,
> kmalloc has guaranteed natural alignment for power-of-two allocations.
> We can now remove the jbd2 special slabs and just use kmalloc() directly.

Applied, thanks!

[1/1] jbd2: Remove special jbd2 slabs
      commit: bbe9015f23432bd4f5b8590eb178b3b5b7c29f02

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
  To: Ext4 Developers List, Andreas Dilger, Aditya Prakash Srivastava
  Cc: Theodore Ts'o, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-kernel,
	syzbot+0c89d865531d053abb2d
In-Reply-To: <20260608065227.3018-1-aditya.ansh182@gmail.com>


On Mon, 08 Jun 2026 06:52:27 +0000, Aditya Prakash Srivastava wrote:
> When the data=journal mount option is used, the ext4_journalled_write_end()
> function incorrectly calls ext4_write_inline_data_end() without checking
> if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.
> 
> If a previous attempt to convert the inline data to an extent failed (e.g.
> due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
> the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
> call to ext4_write_begin() will not prepare the inline data xattr for
> writing, but ext4_journalled_write_end() will incorrectly attempt to write
> to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
> ext4_write_inline_data() since i_inline_size was not expanded.
> 
> [...]

Applied, thanks!

[1/1] ext4: fix kernel BUG in ext4_write_inline_data_end
      commit: ad09aa45965d3fafaf9963bc78109b73c0f9ac8d

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
  To: Ext4 Developers List, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, yi.zhang, dmonakhov, Yun Zhou
  Cc: Theodore Ts'o, linux-kernel
In-Reply-To: <20260608152521.1292656-1-yun.zhou@windriver.com>


On Mon, 08 Jun 2026 23:25:21 +0800, Yun Zhou wrote:
> Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
> belong to the same superblock as the original file.  Currently, this
> validation is performed inside ext4_move_extents() by
> mext_check_validity(), but only after lock_two_nondirectories() has
> already acquired the inode locks.  When the donor fd refers to a file
> on a different filesystem (e.g., overlayfs), this late validation
> creates a circular lock dependency:
> 
> [...]

Applied, thanks!

[1/1] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
      commit: c143957520c6c9b5cd72e0de8b52b814f0c576fe

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] ext4: Remove mention of PageWriteback
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
  To: Ext4 Developers List, Matthew Wilcox (Oracle)
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-kernel
In-Reply-To: <20260526190805.341676-1-willy@infradead.org>


On Tue, 26 May 2026 20:08:02 +0100, Matthew Wilcox (Oracle) wrote:
> Update a comment to refer to the concept of writeback instead of the
> (now obsolete) detail of how it's implemented.

Applied, thanks!

[1/1] ext4: Remove mention of PageWriteback
      commit: 4e3a55f44b42c2aabd4c1cc3bdb6a01a7107121d

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v2] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
  To: Ext4 Developers List, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, yi.zhang, neil, brauner, jlayton, Hongling Zeng
  Cc: Theodore Ts'o, linux-kernel, zhongling0719
In-Reply-To: <20260604073647.211279-1-zenghongling@kylinos.cn>


On Thu, 04 Jun 2026 15:36:47 +0800, Hongling Zeng wrote:
> When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
> It should return NULL instead for success and ERR_PTR() only with
> negative error codes for failure.

Applied, thanks!

[1/1] ext4: Fix ERR_PTR(0) in ext4_mkdir()
      commit: 8e1c43af7cf5091d99db38b7c8129e394d7f45b5

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v4] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
  To: Ext4 Developers List, Andreas Dilger, Aditya Prakash Srivastava
  Cc: Theodore Ts'o, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, sashiko-reviews, linux-kernel,
	syzbot+0c89d865531d053abb2d
In-Reply-To: <20260609062005.1702-1-aditya.ansh182@gmail.com>


On Tue, 09 Jun 2026 06:20:05 +0000, Aditya Prakash Srivastava wrote:
> When the data=journal mount option is used, the ext4_journalled_write_end()
> function incorrectly calls ext4_write_inline_data_end() without checking
> if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.
> 
> If a previous attempt to convert the inline data to an extent failed (e.g.
> due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
> the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
> call to ext4_write_begin() will not prepare the inline data xattr for
> writing, but ext4_journalled_write_end() will incorrectly attempt to write
> to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
> ext4_write_inline_data() since i_inline_size was not expanded.
> 
> [...]

Applied, thanks!

[1/1] ext4: fix kernel BUG in ext4_write_inline_data_end
      commit: ad09aa45965d3fafaf9963bc78109b73c0f9ac8d

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v4] iomap: add simple read path for small direct I/O
From: Fengnan @ 2026-06-11 12:04 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: brauner, djwong, hch, ojaswin, dgc, linux-xfs, linux-fsdevel,
	linux-ext4, linux-kernel, lidiangang, p.raghav
In-Reply-To: <mmbe4kdeqg6zlblhysi27qno22dtkaahv7bzslaqopsg4k3qs7@nofv525nnl6c>

在 2026/6/11 17:36, Pankaj Raghav (Samsung) 写道:
>> +static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
>> +		struct bio *bio)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	ssize_t ret;
>> +
>> +	WRITE_ONCE(iocb->private, NULL);
>> +
>> +	ret = iomap_dio_simple_read_finish(iocb, bio,
>> +			blk_status_to_errno(bio->bi_status));
>> +
>> +	inode_dio_end(inode);
>> +	trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
> Shouldn't the second parameter here be
> blk_status_to_errno(bio->bi_status)?
>
> I think that will be more meaningful for tracing here.
> trace_iomap_dio_complete(iocb, blk_status_to_errno(bio->bi_status), ret);
Makes sense. I’ll update it in the next version.

>
> <snip>
>> +	return ret;
>> +}
>> +
>> +	sr->iocb = iocb;
>> +	sr->dio_flags = dio_flags;
>> +
>> +	bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
>> +	bio->bi_ioprio = iocb->ki_ioprio;
>> +	bio->bi_private = sr;
>> +	bio->bi_end_io = iomap_dio_simple_read_end_io;
>> +
>> +	if (dio_flags & IOMAP_DIO_BOUNCE)
>> +		ret = bio_iov_iter_bounce(bio, iter, count);
>> +	else
>> +		ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
>> +	if (unlikely(ret))
>> +		goto out_bio_put;
>> +
>> +	if (bio->bi_iter.bi_size != count) {
>> +		iov_iter_revert(iter, bio->bi_iter.bi_size);
>> +		ret = -ENOTBLK;
>> +		goto out_bio_release_pages;
>> +	}
>> +
>> +	sr->size = bio->bi_iter.bi_size;
>> +
>> +	if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
>> +	    !(dio_flags & IOMAP_DIO_BOUNCE))
>> +		bio_set_pages_dirty(bio);
>> +
>> +	if (iocb->ki_flags & IOCB_NOWAIT)
>> +		bio->bi_opf |= REQ_NOWAIT;
>> +	if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
>> +		bio->bi_opf |= REQ_POLLED;
>> +		bio_set_polled(bio, iocb);
> This results in build failure as the following patch removed this call:
> https://lore.kernel.org/linux-block/20260518062917.506483-1-hch@lst.de/
>
> I think this call can just be removed as you are setting REQ_POLLED
> anyway.
You’re right. I’ll update that in the next version too.

Thanks.

>
>> +		WRITE_ONCE(iocb->private, bio);
>> +	}
>> +
>> +	if (wait_for_completion) {
>> +		sr->waiter = current;
>> +		blk_crypto_submit_bio(bio);
>> +	} else {
>> +		atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
>> +		sr->waiter = NULL;
>> +		blk_crypto_submit_bio(bio);
>> +		ret = -EIOCBQUEUED;
>> +	}
>> +
> --
> Pankaj

^ permalink raw reply

* Re: [PATCH v2 0/4] show orphan file inode detail info
From: yebin @ 2026-06-11 11:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4
In-Reply-To: <a5v57ie6feotxznmhrf3i22gzplw2ucotlnw3y7hmjhkalbb26@bx2lzoil75ks>



On 2026/6/9 19:13, Jan Kara wrote:
> On Mon 08-06-26 19:44:20, yebin wrote:
>> On 2026/4/16 1:59, Jan Kara wrote:
>>> On Wed 15-04-26 18:55:01, Ye Bin wrote:
>>>> From: Ye Bin <yebin10@huawei.com>
>>>>
>>>> Diffs v2 vs v1:
>>>> (1) Fix sashiko review issues:
>>>> https://sashiko.dev/#/patchset/20260403082507.1882703-1-yebin%40huaweicloud.com
>>>> (2) Change "orphan_list" file mode from 0444 to 0400;
>>>> (3) The display format of the "orphan_list" file is modified according
>>>>       to Andreas' suggestions.
>>>> Fault injection tests have been conducted to address the issues raised
>>>> in the sashik review. There is no UAF issue in the ext4_seq_orphan_release()
>>>> function. The reason for this has already been explained in the code comments.
>>>> In addition to the fault injection tests, we also performed a stress test by
>>>> observing the /proc/fs/ext4/XX/orphan_list and the concurrent processes of
>>>> adding and removing orphan nodes, and no issues were found so far.
>>>>
>>>>
>>>> In actual production environments, the issue of inconsistency between
>>>> df and du is frequently encountered. In many cases, the cause of the
>>>> problem can be identified through the use of lsof. However, when
>>>> overlayfs is combined with project quota configuration, the issue becomes
>>>> more complex and troublesome to diagnose. First, to determine the project
>>>> ID, one needs to obtain orphaned nodes using `fsck.ext4 -fn /dev/xx`, and
>>>> then retrieve file information through `debugfs`. However, the file names
>>>> cannot always be obtained, and it is often unclear which files they are.
>>>> To identify which files these are, one would need to use crash for online
>>>> debugging or use kprobe to gather information incrementally. However, some
>>>> customers in production environments do not agree to upload any tools, and
>>>> online debugging might impact the business. There are also scenarios where
>>>> files are opened in kernel mode, which do not generate file descriptors(fds),
>>>> making it impossible to identify which files were deleted but still have
>>>> references through lsof. This patchset adds a procfs interface to query
>>>> information about orphaned nodes, which can assist in the analysis and
>>>> localization of such issues.
>>>
>>> Ye, did you read my comments to the v1 of the patchset [1]? I didn't see
>>> any reply from you. I don't think this is a good way how to expose orphan
>>> information for a filesystem for reasons I've outlined in that email.
>>>
>>
>> Hi Jan
>>
>> I thought about how to prevent resource exhaustion caused by making too many
>> FDs in a single application. My idea is that IOCTL should only obtain one FD
>> at a time, and the next time it should start obtaining orphan nodes from the
>> inode after the previous one. Each time an fd is obtained, the previous fd
>> should be closed. I expect that after traversing all the fds from the beginning,
>> they will all be closed and there will be no need for user space to close them
>> manually. I wonder if this approach is feasible? Or do you have any good
>> suggestions?
>
> Hum, I think you've misunderstood my suggestion in [1]. What I suggested
> is:
>
> 1) Provide ioctl GET_ORPHAN_FILES that will return one "virtual" fd that
> tracks state of iteration over orphan entries of a superblock
>
> 2) Reading from this fd will be returning file *handles* (as struct
> file_handle) describing the orphan inodes. There are no kernel resources
> struct file_handle occupies in the kernel. It is essentially just a
> filesystem agnostic container for inode number and inode generation number.
> Userspace can then use open_by_handle() syscall to convert struct
> file_handle into normal file descriptor but that is upto userspace and what
> it wants orphan information for.
>
> Is the design clearer now?
>
Thank you for your patient explanation. I have implemented it according to
your suggestion and am currently testing it locally. After the testing is
complete, I will release it. I hope I have not misunderstood your meaning
this time.
> 								Honza
>
> [1] https://lore.kernel.org/all/n4sccudy5avcgnkdhc27rzofzoprxqtwhfrlmsh3yyrj6vbc6d@mmu73gmtawkq/
>


^ permalink raw reply

* [syzbot ci] Re: Data in direntry (dirdata) feature
From: syzbot ci @ 2026-06-11 10:29 UTC (permalink / raw)
  To: adilger.kernel, adilger, adilger, artem.blagodarenko, linux-ext4,
	pravin.shelar
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260610152417.13576-1-ablagodarenko@thelustrecollective.com>

syzbot ci has tested the following series

[v2] Data in direntry (dirdata) feature
https://lore.kernel.org/all/20260610152417.13576-1-ablagodarenko@thelustrecollective.com
* [PATCH v2 01/10] ext4: replace ext4_dir_entry with ext4_dir_entry_2
* [PATCH v2 02/10] ext4: add ext4_dir_entry_is_tail()
* [PATCH v2 03/10] ext4: refactor dx_root to support variable dirent sizes
* [PATCH v2 04/10] ext4: add dirdata format definitions and access helpers
* [PATCH v2 05/10] ext4: preserve dirdata bits in get_dtype()
* [PATCH v2 06/10] ext4: add ext4_dir_entry_len() and harden dirdata parsing
* [PATCH v2 07/10] ext4: rename ext4_dir_rec_len() and clarify dirdata usage
* [PATCH v2 08/10] ext4: dirdata feature
* [PATCH v2 09/10] ext4: add dirdata set/get helpers
* [PATCH v2 10/10] ext4: Add EXT4_IOC_SET_LUFID ioctl for setting LUFID on directory entries

and found the following issues:
* KASAN: slab-out-of-bounds Read in __ext4_check_dir_entry
* KASAN: slab-out-of-bounds Read in ext4_inlinedir_to_tree
* KASAN: slab-use-after-free Read in __ext4_check_dir_entry
* KASAN: slab-use-after-free Read in ext4_inlinedir_to_tree
* KASAN: use-after-free Read in __ext4_check_dir_entry

Full report is available here:
https://ci.syzbot.org/series/5bf0e2fa-2e68-4532-8396-4568879b2788

***

KASAN: slab-out-of-bounds Read in __ext4_check_dir_entry

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/b0854918-13f9-49dd-ab30-12154f0debe2/syz_repro

loop0: lost filesystem error report for type 5 error -117
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-out-of-bounds in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: slab-out-of-bounds in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
Read of size 1 at addr ffff8881022db7f5 by task syz.0.23/5815

CPU: 1 UID: 0 PID: 5815 Comm: syz.0.23 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
 ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
 __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
 ext4_check_all_de+0x66/0x150 fs/ext4/dir.c:657
 ext4_convert_inline_data_nolock+0x1b7/0x990 fs/ext4/inline.c:1121
 ext4_try_add_inline_entry+0x604/0x8e0 fs/ext4/inline.c:1247
 __ext4_add_entry+0x390/0x1f40 fs/ext4/namei.c:2529
 ext4_add_entry fs/ext4/namei.c:2613 [inline]
 ext4_mkdir+0x5e5/0xce0 fs/ext4/namei.c:3175
 vfs_mkdir+0x413/0x630 fs/namei.c:5271
 filename_mkdirat+0x285/0x510 fs/namei.c:5304
 __do_sys_mkdirat fs/namei.c:5325 [inline]
 __se_sys_mkdirat+0x35/0x150 fs/namei.c:5322
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f669359bcc7
Code: 00 66 90 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 db f7 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 b8 02 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd42381d38 EFLAGS: 00000246 ORIG_RAX: 0000000000000102
RAX: ffffffffffffffda RBX: 00007ffd42381dc0 RCX: 00007f669359bcc7
RDX: 00000000000001ff RSI: 0000200000001200 RDI: 00000000ffffff9c
RBP: 00002000000024c0 R08: 0000200000000240 R09: 0000000000000000
R10: 00002000000024c0 R11: 0000000000000246 R12: 0000200000001200
R13: 00007ffd42381d80 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Allocated by task 5066:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
 kasan_kmalloc include/linux/kasan.h:263 [inline]
 __kmalloc_cache_noprof+0x31c/0x660 mm/slub.c:5420
 kmalloc_noprof include/linux/slab.h:950 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 kernfs_get_open_node fs/kernfs/file.c:543 [inline]
 kernfs_fop_open+0x862/0xda0 fs/kernfs/file.c:718
 do_dentry_open+0x822/0x13a0 fs/open.c:947
 vfs_open+0x3b/0x340 fs/open.c:1079
 do_open fs/namei.c:4699 [inline]
 path_openat+0x2e08/0x3860 fs/namei.c:4858
 do_file_open+0x23e/0x4a0 fs/namei.c:4887
 do_sys_openat2+0x113/0x200 fs/open.c:1364
 do_sys_open fs/open.c:1370 [inline]
 __do_sys_openat fs/open.c:1386 [inline]
 __se_sys_openat fs/open.c:1381 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1381
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Last potentially related work creation:
 kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
 kvfree_call_rcu+0x100/0x430 mm/slab_common.c:1970
 kernfs_unlink_open_file+0x3fe/0x4b0 fs/kernfs/file.c:604
 kernfs_fop_release+0x2eb/0x440 fs/kernfs/file.c:783
 __fput+0x44f/0xa60 fs/file_table.c:510
 fput_close_sync+0x11f/0x240 fs/file_table.c:615
 __do_sys_close fs/open.c:1507 [inline]
 __se_sys_close fs/open.c:1492 [inline]
 __x64_sys_close+0x7e/0x110 fs/open.c:1492
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff8881022db700
 which belongs to the cache kmalloc-128 of size 128
The buggy address is located 117 bytes to the right of
 allocated 128-byte region [ffff8881022db700, ffff8881022db780)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1022db
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000000 ffff888100041a00 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2000(__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 0, tgid 0 (swapper/0), ts 2408938923, free_ts 0
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
 prep_new_page mm/page_alloc.c:1861 [inline]
 get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
 alloc_slab_page mm/slub.c:3278 [inline]
 allocate_slab+0x77/0x660 mm/slub.c:3467
 new_slab mm/slub.c:3525 [inline]
 refill_objects+0x339/0x3d0 mm/slub.c:7272
 refill_sheaf mm/slub.c:2816 [inline]
 __pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
 alloc_from_pcs mm/slub.c:4750 [inline]
 slab_alloc_node mm/slub.c:4884 [inline]
 __do_kmalloc_node mm/slub.c:5295 [inline]
 __kmalloc_noprof+0x474/0x760 mm/slub.c:5308
 kmalloc_noprof include/linux/slab.h:954 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 __alloc_empty_sheaf mm/slub.c:2768 [inline]
 alloc_empty_sheaf mm/slub.c:2783 [inline]
 __pcs_replace_empty_main+0x2df/0x720 mm/slub.c:4647
 alloc_from_pcs mm/slub.c:4750 [inline]
 slab_alloc_node mm/slub.c:4884 [inline]
 kmem_cache_alloc_noprof+0x37d/0x650 mm/slub.c:4906
 dup_fd+0x55/0xb40 fs/file.c:390
 copy_files+0xc8/0x120 kernel/fork.c:1639
 copy_process+0x1d94/0x4440 kernel/fork.c:2252
 kernel_clone+0x2d7/0x940 kernel/fork.c:2722
 user_mode_thread+0x110/0x180 kernel/fork.c:2798
 rest_init+0x23/0x300 init/main.c:727
 start_kernel+0x38a/0x3e0 init/main.c:1220
page_owner free stack trace missing

Memory state around the buggy address:
 ffff8881022db680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8881022db700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881022db780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                             ^
 ffff8881022db800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff8881022db880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================


***

KASAN: slab-out-of-bounds Read in ext4_inlinedir_to_tree

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/2dff870b-f382-4c93-8d8d-b2291d921224/syz_repro

loop1: lost filesystem error report for type 5 error -117
EXT4-fs (loop1): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-out-of-bounds in ext4_dir_entry_len fs/ext4/ext4.h:4095 [inline]
BUG: KASAN: slab-out-of-bounds in ext4_inlinedir_to_tree+0xda5/0x10d0 fs/ext4/inline.c:1335
Read of size 2 at addr ffff888115a3183c by task syz.1.18/5839

CPU: 1 UID: 0 PID: 5839 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_dir_entry_len fs/ext4/ext4.h:4095 [inline]
 ext4_inlinedir_to_tree+0xda5/0x10d0 fs/ext4/inline.c:1335
 ext4_htree_fill_tree+0x517/0x1230 fs/ext4/namei.c:1182
 ext4_dx_readdir fs/ext4/dir.c:600 [inline]
 ext4_readdir+0x2db4/0x3640 fs/ext4/dir.c:146
 iterate_dir+0x399/0x570 fs/readdir.c:110
 __do_sys_getdents64 fs/readdir.c:399 [inline]
 __se_sys_getdents64+0xf1/0x280 fs/readdir.c:384
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f3e02b9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3e03ad5028 EFLAGS: 00000246 ORIG_RAX: 00000000000000d9
RAX: ffffffffffffffda RBX: 00007f3e02e15fa0 RCX: 00007f3e02b9ce59
RDX: 0000000000001000 RSI: 0000200000000f80 RDI: 0000000000000004
RBP: 00007f3e02c32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f3e02e16038 R14: 00007f3e02e15fa0 R15: 00007ffcaa902298
 </TASK>

Allocated by task 5839:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
 kasan_kmalloc include/linux/kasan.h:263 [inline]
 __do_kmalloc_node mm/slub.c:5296 [inline]
 __kmalloc_noprof+0x35c/0x760 mm/slub.c:5308
 kmalloc_noprof include/linux/slab.h:954 [inline]
 ext4_inlinedir_to_tree+0x312/0x10d0 fs/ext4/inline.c:1292
 ext4_htree_fill_tree+0x517/0x1230 fs/ext4/namei.c:1182
 ext4_dx_readdir fs/ext4/dir.c:600 [inline]
 ext4_readdir+0x2db4/0x3640 fs/ext4/dir.c:146
 iterate_dir+0x399/0x570 fs/readdir.c:110
 __do_sys_getdents64 fs/readdir.c:399 [inline]
 __se_sys_getdents64+0xf1/0x280 fs/readdir.c:384
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff888115a31800
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 0 bytes to the right of
 allocated 60-byte region [ffff888115a31800, ffff888115a3183c)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x115a31
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000000 ffff8881000418c0 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800200020 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5051, tgid 5051 (acpid), ts 27203740677, free_ts 27201732767
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
 prep_new_page mm/page_alloc.c:1861 [inline]
 get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
 alloc_slab_page mm/slub.c:3278 [inline]
 allocate_slab+0x77/0x660 mm/slub.c:3467
 new_slab mm/slub.c:3525 [inline]
 refill_objects+0x339/0x3d0 mm/slub.c:7272
 refill_sheaf mm/slub.c:2816 [inline]
 __pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
 alloc_from_pcs mm/slub.c:4750 [inline]
 slab_alloc_node mm/slub.c:4884 [inline]
 __do_kmalloc_node mm/slub.c:5295 [inline]
 __kmalloc_noprof+0x474/0x760 mm/slub.c:5308
 kmalloc_noprof include/linux/slab.h:954 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 tomoyo_get_name+0x20c/0x590 security/tomoyo/memory.c:173
 tomoyo_parse_name_union+0xd9/0x130 security/tomoyo/util.c:260
 tomoyo_update_path_acl security/tomoyo/file.c:399 [inline]
 tomoyo_write_file+0x3a6/0xc50 security/tomoyo/file.c:1027
 tomoyo_write_domain2 security/tomoyo/common.c:1160 [inline]
 tomoyo_add_entry security/tomoyo/common.c:2177 [inline]
 tomoyo_supervisor+0x1208/0x1570 security/tomoyo/common.c:2238
 tomoyo_audit_path_log security/tomoyo/file.c:169 [inline]
 tomoyo_path_permission+0x25a/0x380 security/tomoyo/file.c:592
 tomoyo_check_open_permission+0x2b2/0x470 security/tomoyo/file.c:782
 security_file_open+0xa9/0x240 security/security.c:2739
 do_dentry_open+0x4a8/0x13a0 fs/open.c:924
 vfs_open+0x3b/0x340 fs/open.c:1079
page last free pid 15 tgid 15 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1397 [inline]
 __free_frozen_pages+0xc1c/0xd30 mm/page_alloc.c:2938
 __tlb_remove_table_free mm/mmu_gather.c:228 [inline]
 tlb_remove_table_rcu+0x85/0x100 mm/mmu_gather.c:291
 rcu_do_batch kernel/rcu/tree.c:2617 [inline]
 rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869
 handle_softirqs+0x22a/0x840 kernel/softirq.c:622
 run_ksoftirqd+0x36/0x60 kernel/softirq.c:1076
 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Memory state around the buggy address:
 ffff888115a31700: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff888115a31780: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
>ffff888115a31800: 00 00 00 00 00 00 00 04 fc fc fc fc fc fc fc fc
                                        ^
 ffff888115a31880: 00 00 00 00 00 00 02 fc fc fc fc fc fc fc fc fc
 ffff888115a31900: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================


***

KASAN: slab-use-after-free Read in __ext4_check_dir_entry

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/f1d48ea1-6e87-4d64-9c13-8bf8aed109fc/syz_repro

loop0: lost filesystem error report for type 5 error -117
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: slab-use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
Read of size 1 at addr ffff888114d8c045 by task syz.0.20/5821

CPU: 1 UID: 0 PID: 5821 Comm: syz.0.20 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
 ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
 __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
 ext4_find_dest_de+0x136/0x770 fs/ext4/namei.c:2203
 ext4_add_dirent_to_inline+0xcf/0x430 fs/ext4/inline.c:984
 ext4_try_add_inline_entry+0x235/0x8e0 fs/ext4/inline.c:1213
 __ext4_add_entry+0x390/0x1f40 fs/ext4/namei.c:2529
 ext4_add_entry fs/ext4/namei.c:2613 [inline]
 ext4_add_nondir+0x111/0x310 fs/ext4/namei.c:2936
 ext4_create+0x2e9/0x470 fs/ext4/namei.c:2982
 lookup_open fs/namei.c:4511 [inline]
 open_last_lookups fs/namei.c:4611 [inline]
 path_openat+0x1395/0x3860 fs/namei.c:4855
 do_file_open+0x23e/0x4a0 fs/namei.c:4887
 do_sys_openat2+0x113/0x200 fs/open.c:1364
 do_sys_open fs/open.c:1370 [inline]
 __do_sys_openat fs/open.c:1386 [inline]
 __se_sys_openat fs/open.c:1381 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1381
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f922219ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f9223137028 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f9222415fa0 RCX: 00007f922219ce59
RDX: 0000000000042042 RSI: 0000200000000080 RDI: 0000000000000004
RBP: 00007f9222232d6f R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000014a R11: 0000000000000246 R12: 0000000000000000
R13: 00007f9222416038 R14: 00007f9222415fa0 R15: 00007ffd01a2d448
 </TASK>

Allocated by task 5484:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 unpoison_slab_object mm/kasan/common.c:340 [inline]
 __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366
 kasan_slab_alloc include/linux/kasan.h:253 [inline]
 slab_post_alloc_hook mm/slub.c:4570 [inline]
 slab_alloc_node mm/slub.c:4899 [inline]
 kmem_cache_alloc_node_noprof+0x384/0x690 mm/slub.c:4951
 kmalloc_reserve net/core/skbuff.c:613 [inline]
 __alloc_skb+0x27d/0x7d0 net/core/skbuff.c:713
 alloc_skb include/linux/skbuff.h:1385 [inline]
 nlmsg_new include/net/netlink.h:1055 [inline]
 mpls_netconf_notify_devconf+0x46/0x100 net/mpls/af_mpls.c:1217
 mpls_dev_notify+0xb2d/0xd10 net/mpls/af_mpls.c:1691
 notifier_call_chain+0x1ad/0x3d0 kernel/notifier.c:85
 call_netdevice_notifiers_extack net/core/dev.c:2287 [inline]
 call_netdevice_notifiers net/core/dev.c:2301 [inline]
 unregister_netdevice_many_notify+0x17a5/0x22c0 net/core/dev.c:12421
 ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
 ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
 cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Freed by task 5484:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2689 [inline]
 slab_free mm/slub.c:6251 [inline]
 kfree+0x1c5/0x640 mm/slub.c:6566
 skb_kfree_head net/core/skbuff.c:1075 [inline]
 skb_free_head net/core/skbuff.c:1087 [inline]
 skb_release_data+0x828/0xa60 net/core/skbuff.c:1114
 skb_release_all net/core/skbuff.c:1189 [inline]
 __kfree_skb+0x5d/0x210 net/core/skbuff.c:1203
 netlink_broadcast_filtered+0xe18/0xf20 net/netlink/af_netlink.c:1540
 nlmsg_multicast_filtered include/net/netlink.h:1165 [inline]
 nlmsg_multicast include/net/netlink.h:1184 [inline]
 nlmsg_notify+0xf0/0x1a0 net/netlink/af_netlink.c:2598
 mpls_dev_notify+0xb2d/0xd10 net/mpls/af_mpls.c:1691
 notifier_call_chain+0x1ad/0x3d0 kernel/notifier.c:85
 call_netdevice_notifiers_extack net/core/dev.c:2287 [inline]
 call_netdevice_notifiers net/core/dev.c:2301 [inline]
 unregister_netdevice_many_notify+0x17a5/0x22c0 net/core/dev.c:12421
 ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
 ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
 cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

The buggy address belongs to the object at ffff888114d8c000
 which belongs to the cache skbuff_small_head of size 704
The buggy address is located 69 bytes inside of
 freed 704-byte region [ffff888114d8c000, ffff888114d8c2c0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x114d8c
head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ff00000000040(head|node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000040 ffff888160416b40 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800120012 00000000f5000000 0000000000000000
head: 017ff00000000040 ffff888160416b40 dead000000000100 dead000000000122
head: 0000000000000000 0000000800120012 00000000f5000000 0000000000000000
head: 017ff00000000002 ffffffffffffff01 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000004
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5484, tgid 5484 (kworker/u8:2), ts 72573003529, free_ts 72546506446
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
 prep_new_page mm/page_alloc.c:1861 [inline]
 get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
 alloc_slab_page mm/slub.c:3278 [inline]
 allocate_slab+0x77/0x660 mm/slub.c:3467
 new_slab mm/slub.c:3525 [inline]
 refill_objects+0x339/0x3d0 mm/slub.c:7272
 refill_sheaf mm/slub.c:2816 [inline]
 __pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
 alloc_from_pcs mm/slub.c:4750 [inline]
 slab_alloc_node mm/slub.c:4884 [inline]
 kmem_cache_alloc_node_noprof+0x441/0x690 mm/slub.c:4951
 kmalloc_reserve net/core/skbuff.c:613 [inline]
 __alloc_skb+0x27d/0x7d0 net/core/skbuff.c:713
 alloc_skb include/linux/skbuff.h:1385 [inline]
 nlmsg_new include/net/netlink.h:1055 [inline]
 mpls_netconf_notify_devconf+0x46/0x100 net/mpls/af_mpls.c:1217
 mpls_dev_notify+0xb2d/0xd10 net/mpls/af_mpls.c:1691
 notifier_call_chain+0x1ad/0x3d0 kernel/notifier.c:85
 call_netdevice_notifiers_extack net/core/dev.c:2287 [inline]
 call_netdevice_notifiers net/core/dev.c:2301 [inline]
 unregister_netdevice_many_notify+0x17a5/0x22c0 net/core/dev.c:12421
 ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
 ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
 cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
page last free pid 5484 tgid 5484 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1397 [inline]
 __free_frozen_pages+0xc1c/0xd30 mm/page_alloc.c:2938
 stack_depot_save_flags+0x40e/0x810 lib/stackdepot.c:735
 kasan_save_stack mm/kasan/common.c:58 [inline]
 kasan_save_track+0x4f/0x80 mm/kasan/common.c:78
 unpoison_slab_object mm/kasan/common.c:340 [inline]
 __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366
 kasan_slab_alloc include/linux/kasan.h:253 [inline]
 slab_post_alloc_hook mm/slub.c:4570 [inline]
 slab_alloc_node mm/slub.c:4899 [inline]
 kmem_cache_alloc_noprof+0x2bc/0x650 mm/slub.c:4906
 kmem_alloc_batch lib/debugobjects.c:371 [inline]
 fill_pool+0x156/0x580 lib/debugobjects.c:420
 debug_objects_fill_pool lib/debugobjects.c:752 [inline]
 debug_object_activate+0x4a3/0x580 lib/debugobjects.c:841
 debug_rcu_head_queue kernel/rcu/rcu.h:236 [inline]
 __call_rcu_common kernel/rcu/tree.c:3116 [inline]
 call_rcu+0x43/0x890 kernel/rcu/tree.c:3251
 kernfs_put+0x259/0x520 fs/kernfs/dir.c:618
 kernfs_remove_by_name_ns+0xc8/0x140 fs/kernfs/dir.c:1799
 device_remove_class_symlinks+0x178/0x190 drivers/base/core.c:3479
 device_del+0x400/0x8f0 drivers/base/core.c:3881
 unregister_netdevice_many_notify+0x1d5f/0x22c0 net/core/dev.c:12456
 ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
 ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
 cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397

Memory state around the buggy address:
 ffff888114d8bf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888114d8bf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888114d8c000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
 ffff888114d8c080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888114d8c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


***

KASAN: slab-use-after-free Read in ext4_inlinedir_to_tree

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/f42da242-e16e-4f10-bf25-0bd7e192d989/syz_repro

loop0: lost filesystem error report for type 5 error -117
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: slab-use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: slab-use-after-free in ext4_inlinedir_to_tree+0x94c/0x10d0 fs/ext4/inline.c:1335
Read of size 1 at addr ffff88816fee8825 by task syz.0.20/5867

CPU: 1 UID: 0 PID: 5867 Comm: syz.0.20 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
 ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
 ext4_inlinedir_to_tree+0x94c/0x10d0 fs/ext4/inline.c:1335
 ext4_htree_fill_tree+0x517/0x1230 fs/ext4/namei.c:1182
 ext4_dx_readdir fs/ext4/dir.c:600 [inline]
 ext4_readdir+0x2db4/0x3640 fs/ext4/dir.c:146
 iterate_dir+0x399/0x570 fs/readdir.c:110
 __do_sys_getdents fs/readdir.c:319 [inline]
 __se_sys_getdents+0xf1/0x270 fs/readdir.c:304
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f010ad9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f010bc0f028 EFLAGS: 00000246 ORIG_RAX: 000000000000004e
RAX: ffffffffffffffda RBX: 00007f010b015fa0 RCX: 00007f010ad9ce59
RDX: 0000000000000054 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007f010ae32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f010b016038 R14: 00007f010b015fa0 R15: 00007ffd93577348
 </TASK>

Allocated by task 5064:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
 kasan_kmalloc include/linux/kasan.h:263 [inline]
 __do_kmalloc_node mm/slub.c:5296 [inline]
 __kmalloc_noprof+0x35c/0x760 mm/slub.c:5308
 kmalloc_noprof include/linux/slab.h:954 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
 tomoyo_encode+0x28b/0x550 security/tomoyo/realpath.c:80
 tomoyo_realpath_from_path+0x58d/0x5d0 security/tomoyo/realpath.c:283
 tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
 tomoyo_path_perm+0x283/0x560 security/tomoyo/file.c:827
 security_inode_getattr+0x12b/0x310 security/security.c:1895
 vfs_getattr fs/stat.c:259 [inline]
 vfs_fstat fs/stat.c:281 [inline]
 vfs_fstatat+0xb4/0x170 fs/stat.c:371
 __do_sys_newfstatat fs/stat.c:538 [inline]
 __se_sys_newfstatat fs/stat.c:532 [inline]
 __x64_sys_newfstatat+0x151/0x200 fs/stat.c:532
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5064:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2689 [inline]
 slab_free mm/slub.c:6251 [inline]
 kfree+0x1c5/0x640 mm/slub.c:6566
 tomoyo_path_perm+0x403/0x560 security/tomoyo/file.c:847
 security_inode_getattr+0x12b/0x310 security/security.c:1895
 vfs_getattr fs/stat.c:259 [inline]
 vfs_fstat fs/stat.c:281 [inline]
 vfs_fstatat+0xb4/0x170 fs/stat.c:371
 __do_sys_newfstatat fs/stat.c:538 [inline]
 __se_sys_newfstatat fs/stat.c:532 [inline]
 __x64_sys_newfstatat+0x151/0x200 fs/stat.c:532
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff88816fee8800
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 37 bytes inside of
 freed 64-byte region [ffff88816fee8800, ffff88816fee8840)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x16fee8
flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 057ff00000000000 ffff8881000418c0 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800200020 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 1, tgid 1 (swapper/0), ts 21294026082, free_ts 0
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
 prep_new_page mm/page_alloc.c:1861 [inline]
 get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
 alloc_slab_page mm/slub.c:3278 [inline]
 allocate_slab+0x77/0x660 mm/slub.c:3467
 new_slab mm/slub.c:3525 [inline]
 refill_objects+0x339/0x3d0 mm/slub.c:7272
 refill_sheaf mm/slub.c:2816 [inline]
 __pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
 alloc_from_pcs mm/slub.c:4750 [inline]
 slab_alloc_node mm/slub.c:4884 [inline]
 __do_kmalloc_node mm/slub.c:5295 [inline]
 __kmalloc_noprof+0x474/0x760 mm/slub.c:5308
 kmalloc_noprof include/linux/slab.h:954 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 handler_new_ref+0x261/0x9c0 drivers/media/v4l2-core/v4l2-ctrls-core.c:1882
 v4l2_ctrl_add_handler+0x19f/0x290 drivers/media/v4l2-core/v4l2-ctrls-core.c:2443
 vivid_create_controls+0x332d/0x3bd0 drivers/media/test-drivers/vivid/vivid-ctrls.c:2072
 vivid_create_instance drivers/media/test-drivers/vivid/vivid-core.c:1933 [inline]
 vivid_probe+0x4261/0x72b0 drivers/media/test-drivers/vivid/vivid-core.c:2095
 platform_probe+0xf9/0x190 drivers/base/platform.c:1432
 call_driver_probe drivers/base/dd.c:-1 [inline]
 really_probe+0x267/0xaf0 drivers/base/dd.c:709
 __driver_probe_device+0x1ef/0x380 drivers/base/dd.c:871
 driver_probe_device+0x4f/0x240 drivers/base/dd.c:901
 __driver_attach+0x34c/0x640 drivers/base/dd.c:1295
page_owner free stack trace missing

Memory state around the buggy address:
 ffff88816fee8700: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
 ffff88816fee8780: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
>ffff88816fee8800: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                               ^
 ffff88816fee8880: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff88816fee8900: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================


***

KASAN: use-after-free Read in __ext4_check_dir_entry

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/57c0b75a-8922-4dc1-9a20-ca947564792b/syz_repro

==================================================================
BUG: KASAN: use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: use-after-free in __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
Read of size 1 at addr ffff88816be85045 by task syz.2.21/5880

CPU: 1 UID: 0 PID: 5880 Comm: syz.2.21 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
 ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
 __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
 ext4_find_dest_de+0x136/0x770 fs/ext4/namei.c:2203
 ext4_add_dirent_to_inline+0xcf/0x430 fs/ext4/inline.c:984
 ext4_try_add_inline_entry+0x235/0x8e0 fs/ext4/inline.c:1213
 __ext4_add_entry+0x390/0x1f40 fs/ext4/namei.c:2529
 ext4_add_entry fs/ext4/namei.c:2613 [inline]
 ext4_add_nondir+0x111/0x310 fs/ext4/namei.c:2936
 ext4_create+0x2e9/0x470 fs/ext4/namei.c:2982
 lookup_open fs/namei.c:4511 [inline]
 open_last_lookups fs/namei.c:4611 [inline]
 path_openat+0x1395/0x3860 fs/namei.c:4855
 do_file_open+0x23e/0x4a0 fs/namei.c:4887
 do_sys_openat2+0x113/0x200 fs/open.c:1364
 do_sys_open fs/open.c:1370 [inline]
 __do_sys_openat fs/open.c:1386 [inline]
 __se_sys_openat fs/open.c:1381 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1381
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5713b9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff672b25f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f5713e15fa0 RCX: 00007f5713b9ce59
RDX: 0000000000042042 RSI: 0000200000000080 RDI: 0000000000000004
RBP: 00007f5713c32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000014a R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5713e15fac R14: 00007f5713e15fa0 R15: 00007f5713e15fa0
 </TASK>

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x16be85
flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
page_type: f0(buddy)
raw: 057ff00000000000 ffffea0005afa0c8 ffffea0005afa1c8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000f0000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 5630, tgid 5630 (syz-executor), ts 67290853657, free_ts 69321168948
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
 prep_new_page mm/page_alloc.c:1861 [inline]
 get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
 __alloc_pages_noprof+0x10/0x100 mm/page_alloc.c:5255
 alloc_pages_bulk_noprof+0x5ff/0x7c0 mm/page_alloc.c:5175
 ___alloc_pages_bulk mm/kasan/shadow.c:345 [inline]
 __kasan_populate_vmalloc_do mm/kasan/shadow.c:370 [inline]
 __kasan_populate_vmalloc+0xc1/0x1d0 mm/kasan/shadow.c:424
 kasan_populate_vmalloc include/linux/kasan.h:580 [inline]
 alloc_vmap_area+0xd47/0x1480 mm/vmalloc.c:2123
 __get_vm_area_node+0x1f8/0x300 mm/vmalloc.c:3226
 __vmalloc_node_range_noprof+0x36a/0x1750 mm/vmalloc.c:4024
 vmalloc_user_noprof+0xad/0xe0 mm/vmalloc.c:4218
 kcov_ioctl+0x55/0x620 kernel/kcov.c:726
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5693 tgid 5693 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1397 [inline]
 __free_frozen_pages+0xc1c/0xd30 mm/page_alloc.c:2938
 kasan_depopulate_vmalloc_pte+0x6d/0x90 mm/kasan/shadow.c:484
 apply_to_pte_range mm/memory.c:3338 [inline]
 apply_to_pmd_range mm/memory.c:3382 [inline]
 apply_to_pud_range mm/memory.c:3418 [inline]
 apply_to_p4d_range mm/memory.c:3454 [inline]
 __apply_to_page_range+0xbdc/0x1420 mm/memory.c:3490
 __kasan_release_vmalloc+0xa2/0xd0 mm/kasan/shadow.c:602
 kasan_release_vmalloc include/linux/kasan.h:593 [inline]
 kasan_release_vmalloc_node mm/vmalloc.c:2284 [inline]
 purge_vmap_node+0x220/0x960 mm/vmalloc.c:2306
 __purge_vmap_area_lazy+0x779/0xb40 mm/vmalloc.c:2396
 drain_vmap_area_work+0x27/0x40 mm/vmalloc.c:2430
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Memory state around the buggy address:
 ffff88816be84f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88816be84f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88816be85000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                           ^
 ffff88816be85080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88816be85100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox