* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-12 13:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260612052831.GA9010@lst.de>
On Fri, Jun 12, 2026 at 07:28:31AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> > On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > > It's entirely possible a device supports byte aligned addresses. The
> > > > block layer just doesn't let a driver report that. So either it really
> > > > was successful because you found a bug that skips the alignment checks,
> > > > or your device silently corrupted your payload.
> >
> > I tried this on different hardware, I find it hard to say all those
> > devices were corrupting the payload.
>
> I think in the other thread we agreed that we are currently missing
> the alignment check for fast-path bios not hitting the splitting code,
> so maybe that is something you see. Additionally we're missing the
> checks for purely bio based drivers not calling the splitting helper
> at all, but I don't think that applies here.
>
> > > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > > though, in not taking the optimization when it was possible. So here's
> > > > an alternative suggestion that should get things working as expected:
> > >
> > > The fix below looks like it is addressing a real bug. I'm not sure if
> > > Carlos is hitting it, but we were missing the alignment checks for
> > > single-bvec fast path bios so far indeed.
> >
> > You left context out so I'm assuming by the fix you meant Keith's patch.
>
> Yes.
The fix indeed seems to fix the behavior I'm seeing. Keith could you Cc
me if you end up sending an official version?
^ permalink raw reply
* [PATCH v3] ext4: defer iput() in ext4_xattr_block_set() to avoid deadlock with writepages
From: Yun Zhou @ 2026-06-12 13:19 UTC (permalink / raw)
To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang, ebiggers, yun.zhou
Cc: linux-ext4, linux-kernel
In-Reply-To: <20260612095846.1024470-1-yun.zhou@windriver.com>
ext4_xattr_block_set() calls iput() on ea_inode while its callers hold
xattr_sem. If this iput() drops the last reference, it can trigger
write_inode_now() -> ext4_writepages() -> s_writepages_rwsem, which
violates the lock ordering since ext4_writepages() already establishes
s_writepages_rwsem -> jbd2_handle ordering:
CPU0 (writeback worker) CPU1 (file create)
---- ----
ext4_writepages()
s_writepages_rwsem (read) ext4_create()
ext4_do_writepages() __ext4_new_inode()
ext4_journal_start() [holds jbd2 handle]
wait_transaction_locked() ext4_xattr_set_handle()
[WAIT for jbd2_handle] xattr_sem (write)
CPU2 (xattr set or isize expand)
----
ext4_xattr_set_handle() or ext4_try_to_expand_extra_isize()
xattr_sem (write)
ext4_xattr_block_set()
iput(ea_inode)
write_inode_now()
ext4_writepages()
s_writepages_rwsem (read) [DEADLOCK]
This forms a circular dependency on lock classes:
s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
Fix by deferring iput() calls inside ext4_xattr_block_set() via the
existing ext4_xattr_inode_array mechanism. The array is threaded
through the call chain and freed by callers after releasing xattr_sem.
Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Address AI review feedback on v2:
- Check ext4_expand_inode_array() return value; fallback to
direct iput() on ENOMEM to prevent inode leak.
- Make ext4_xattr_set_handle() take an optional ea_inode_array
output parameter so callers can free after ext4_journal_stop(),
avoiding the jbd2_handle vs s_writepages_rwsem AB-BA.
- Pass ea_inode_array directly to ext4_xattr_release_block()
instead of using a local array freed under xattr_sem.
- Move ext4_xattr_inode_array_free() after ext4_journal_stop()
v2: Defer iput() in ext4_xattr_block_set() via ea_inode_array,
freed after xattr_sem is released. Fixes the root cause.
v1: Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
on inodes being deleted. Only fixes the syzbot reproducer, not
the underlying lock ordering violation.
fs/ext4/acl.c | 2 +-
fs/ext4/crypto.c | 4 ++--
fs/ext4/inode.c | 13 ++++++----
fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++--------------
fs/ext4/xattr.h | 6 +++--
fs/ext4/xattr_security.c | 3 ++-
6 files changed, 51 insertions(+), 28 deletions(-)
diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 3bffe862f954..21de8276b558 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -215,7 +215,7 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
}
error = ext4_xattr_set_handle(handle, inode, name_index, "",
- value, size, xattr_flags);
+ value, size, xattr_flags, NULL);
kfree(value);
if (!error)
diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index f41f320f4437..bca760751c1d 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -173,7 +173,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
res = ext4_xattr_set_handle(handle, inode,
EXT4_XATTR_INDEX_ENCRYPTION,
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
- ctx, len, XATTR_CREATE);
+ ctx, len, XATTR_CREATE, NULL);
if (!res) {
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
ext4_clear_inode_state(inode,
@@ -202,7 +202,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
- ctx, len, 0);
+ ctx, len, 0, NULL);
if (!res) {
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..2cf68d27e896 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6408,7 +6408,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
static int __ext4_expand_extra_isize(struct inode *inode,
unsigned int new_extra_isize,
struct ext4_iloc *iloc,
- handle_t *handle, int *no_expand)
+ handle_t *handle, int *no_expand,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_inode *raw_inode;
struct ext4_xattr_ibody_header *header;
@@ -6453,7 +6454,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
/* try to expand with EAs present */
error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
- raw_inode, handle);
+ raw_inode, handle, ea_inode_array);
if (error) {
/*
* Inode size expansion failed; don't try again
@@ -6475,6 +6476,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
{
int no_expand;
int error;
+ struct ext4_xattr_inode_array *ea_inode_array = NULL;
if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
return -EOVERFLOW;
@@ -6496,8 +6498,9 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
return -EBUSY;
error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
- handle, &no_expand);
+ handle, &no_expand, &ea_inode_array);
ext4_write_unlock_xattr(inode, &no_expand);
+ ext4_xattr_inode_array_free(ea_inode_array);
return error;
}
@@ -6509,6 +6512,7 @@ int ext4_expand_extra_isize(struct inode *inode,
handle_t *handle;
int no_expand;
int error, rc;
+ struct ext4_xattr_inode_array *ea_inode_array = NULL;
if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
brelse(iloc->bh);
@@ -6534,7 +6538,7 @@ int ext4_expand_extra_isize(struct inode *inode,
}
error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
- handle, &no_expand);
+ handle, &no_expand, &ea_inode_array);
rc = ext4_mark_iloc_dirty(handle, inode, iloc);
if (!error)
@@ -6543,6 +6547,7 @@ int ext4_expand_extra_isize(struct inode *inode,
out_unlock:
ext4_write_unlock_xattr(inode, &no_expand);
ext4_journal_stop(handle);
+ ext4_xattr_inode_array_free(ea_inode_array);
return error;
}
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e91af66db7a7..fa9a16c86fd8 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1906,7 +1906,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
static int
ext4_xattr_block_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_info *i,
- struct ext4_xattr_block_find *bs)
+ struct ext4_xattr_block_find *bs,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct super_block *sb = inode->i_sb;
struct buffer_head *new_bh = NULL;
@@ -2158,7 +2159,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
ext4_warning_inode(ea_inode,
"dec ref error=%d",
error);
- iput(ea_inode);
+ if (ext4_expand_inode_array(ea_inode_array, ea_inode))
+ iput(ea_inode);
ea_inode = NULL;
}
@@ -2190,12 +2192,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
/* Drop the previous xattr block. */
if (bs->bh && bs->bh != new_bh) {
- struct ext4_xattr_inode_array *ea_inode_array = NULL;
-
ext4_xattr_release_block(handle, inode, bs->bh,
- &ea_inode_array,
+ ea_inode_array,
0 /* extra_credits */);
- ext4_xattr_inode_array_free(ea_inode_array);
}
error = 0;
@@ -2211,7 +2210,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
ext4_xattr_inode_free_quota(inode, ea_inode,
i_size_read(ea_inode));
}
- iput(ea_inode);
+ if (ext4_expand_inode_array(ea_inode_array, ea_inode))
+ iput(ea_inode);
}
if (ce)
mb_cache_entry_put(ea_block_cache, ce);
@@ -2356,7 +2356,7 @@ static struct buffer_head *ext4_xattr_get_block(struct inode *inode)
int
ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
const char *name, const void *value, size_t value_len,
- int flags)
+ int flags, struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_xattr_info i = {
.name_index = name_index,
@@ -2371,6 +2371,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
struct ext4_xattr_block_find bs = {
.s = { .not_found = -ENODATA, },
};
+ struct ext4_xattr_inode_array *local_array = NULL;
int no_expand;
int error;
@@ -2379,6 +2380,9 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
if (strlen(name) > 255)
return -ERANGE;
+ if (!ea_inode_array)
+ ea_inode_array = &local_array;
+
ext4_write_lock_xattr(inode, &no_expand);
/* Check journal credits under write lock. */
@@ -2438,7 +2442,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
if (!is.s.not_found)
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
else if (!bs.s.not_found)
- error = ext4_xattr_block_set(handle, inode, &i, &bs);
+ error = ext4_xattr_block_set(handle, inode, &i, &bs,
+ ea_inode_array);
} else {
error = 0;
/* Xattr value did not change? Save us some work and bail out */
@@ -2455,7 +2460,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (!error && !bs.s.not_found) {
i.value = NULL;
- error = ext4_xattr_block_set(handle, inode, &i, &bs);
+ error = ext4_xattr_block_set(handle, inode, &i, &bs,
+ ea_inode_array);
} else if (error == -ENOSPC) {
if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
brelse(bs.bh);
@@ -2464,7 +2470,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
if (error)
goto cleanup;
}
- error = ext4_xattr_block_set(handle, inode, &i, &bs);
+ error = ext4_xattr_block_set(handle, inode, &i, &bs,
+ ea_inode_array);
if (!error && !is.s.not_found) {
i.value = NULL;
error = ext4_xattr_ibody_set(handle, inode, &i,
@@ -2503,6 +2510,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
brelse(is.iloc.bh);
brelse(bs.bh);
ext4_write_unlock_xattr(inode, &no_expand);
+ ext4_xattr_inode_array_free(local_array);
return error;
}
@@ -2547,6 +2555,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
{
handle_t *handle;
struct super_block *sb = inode->i_sb;
+ struct ext4_xattr_inode_array *ea_inode_array = NULL;
int error, retries = 0;
int credits;
@@ -2567,10 +2576,13 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
int error2;
error = ext4_xattr_set_handle(handle, inode, name_index, name,
- value, value_len, flags);
+ value, value_len, flags,
+ &ea_inode_array);
ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR,
handle);
error2 = ext4_journal_stop(handle);
+ ext4_xattr_inode_array_free(ea_inode_array);
+ ea_inode_array = NULL;
if (error == -ENOSPC &&
ext4_should_retry_alloc(sb, &retries))
goto retry;
@@ -2612,7 +2624,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
*/
static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
struct ext4_inode *raw_inode,
- struct ext4_xattr_entry *entry)
+ struct ext4_xattr_entry *entry,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_xattr_ibody_find *is = NULL;
struct ext4_xattr_block_find *bs = NULL;
@@ -2676,7 +2689,7 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
goto out;
/* Move ea entry from the inode into the block */
- error = ext4_xattr_block_set(handle, inode, &i, bs);
+ error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
if (error)
goto out;
@@ -2702,7 +2715,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
struct ext4_inode *raw_inode,
int isize_diff, size_t ifree,
- size_t bfree, int *total_ino)
+ size_t bfree, int *total_ino,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
struct ext4_xattr_entry *small_entry;
@@ -2752,7 +2766,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
total_size += EXT4_XATTR_SIZE(
le32_to_cpu(entry->e_value_size));
error = ext4_xattr_move_to_block(handle, inode, raw_inode,
- entry);
+ entry, ea_inode_array);
if (error)
return error;
@@ -2769,7 +2783,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
* Returns 0 on success or negative error number on failure.
*/
int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
- struct ext4_inode *raw_inode, handle_t *handle)
+ struct ext4_inode *raw_inode, handle_t *handle,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_xattr_ibody_header *header;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2841,7 +2856,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
isize_diff, ifree, bfree,
- &total_ino);
+ &total_ino, ea_inode_array);
if (error) {
if (error == -ENOSPC && !tried_min_extra_isize &&
s_min_extra_isize) {
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..9c3f1a96895d 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -179,7 +179,8 @@ extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);
extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
-extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
+extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *,
+ const void *, size_t, int, struct ext4_xattr_inode_array **);
extern int ext4_xattr_set_credits(struct inode *inode, size_t value_len,
bool is_create, int *credits);
extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
@@ -192,7 +193,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
- struct ext4_inode *raw_inode, handle_t *handle);
+ struct ext4_inode *raw_inode, handle_t *handle,
+ struct ext4_xattr_inode_array **ea_inode_array);
extern void ext4_evict_ea_inode(struct inode *inode);
extern const struct xattr_handler * const ext4_xattr_handlers[];
diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
index 776cf11d24ca..6b7ab6e703ad 100644
--- a/fs/ext4/xattr_security.c
+++ b/fs/ext4/xattr_security.c
@@ -44,7 +44,8 @@ ext4_initxattrs(struct inode *inode, const struct xattr *xattr_array,
err = ext4_xattr_set_handle(handle, inode,
EXT4_XATTR_INDEX_SECURITY,
xattr->name, xattr->value,
- xattr->value_len, XATTR_CREATE);
+ xattr->value_len, XATTR_CREATE,
+ NULL);
if (err < 0)
break;
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Jan Kara @ 2026-06-12 12:55 UTC (permalink / raw)
To: Baokun Li
Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-3-libaokun@linux.alibaba.com>
On Fri 12-06-26 00:34:41, Baokun Li wrote:
> For unaligned DIO writes, the previous ext4_overwrite_io() required the
> entire range to fall within a single written extent. This was overly
> conservative: the DIO layer only performs partial block zeroing for the
> head and tail blocks when they are partially covered by the write.
> Middle blocks that are fully covered are written as whole blocks
> without any zeroing, so they are safe regardless of extent state.
>
> Therefore exclusive lock is only required when partial block zeroing
> will actually happen:
> - The head partial block (if any) lands on a hole or unwritten extent.
> - The tail partial block (if any) lands on a hole or unwritten extent.
>
> Middle full-cover blocks can be in any state (hole, unwritten, or
> written) - block allocation under shared lock is safe per the previous
> patch's analysis (inode_dio_begin + i_data_sem protection).
>
> Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
> directly answers the question driving the lock decision. It uses at
> most two ext4_map_blocks() calls: one for the head partial block (also
> catching the case where it spans through the tail), and one for the
> tail partial block if not already covered.
>
> This enables shared lock for previously-rejected scenarios such as:
> - Unaligned write spanning written extent + mid-range hole + written
> extent at the tail.
> - Unaligned write where the partial blocks land on written extents but
> the middle has unwritten extents.
>
> Performance:
>
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
>
> Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
> Each stripe is laid out as [written][unwritten][unwritten][written],
> so the head and tail partial blocks land on written extents but the
> middle is unwritten. Metric: IOPS.
>
> JOBS Before After speedup
> ---- -------- --------- -------
> 1 15,547 17,381 1.12x
> 2 15,910 34,172 2.15x
> 4 15,014 57,567 3.83x
> 8 15,022 81,947 5.46x
> 16 14,586 99,126 6.80x
> 32 14,047 92,519 6.59x
>
> Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.
>
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
> 1 file changed, 73 insertions(+), 35 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6f3886465ce3..aa926e641739 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
> return false;
> }
>
> -/* Is IO overwriting allocated or initialized blocks? */
> -static bool ext4_overwrite_io(struct inode *inode,
> - loff_t pos, loff_t len, bool *unwritten)
> +/*
> + * Does an unaligned DIO write require partial block zeroing?
> + *
> + * Partial block zeroing is performed only for the head and tail blocks
> + * when they are partially covered by the write and the underlying extent
> + * is a hole or unwritten. Middle blocks (fully covered by the write)
> + * are written as whole blocks without zeroing.
> + *
> + * When zeroing is required, two concurrent unaligned DIO writes to the
> + * same partial block can race and corrupt each other's data, so the
> + * caller must take the exclusive i_rwsem and drain in-flight DIO. When
> + * zeroing is not required, shared lock is safe -- block allocation and
> + * unwritten conversion for middle blocks are protected by i_data_sem
> + * and inode_dio_begin().
> + */
> +static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
> {
> struct ext4_map_blocks map;
> unsigned int blkbits = inode->i_blkbits;
> - int err, blklen;
> + unsigned long blockmask = inode->i_sb->s_blocksize - 1;
> + bool head_partial, tail_partial;
> + ext4_lblk_t head_lblk, tail_lblk;
> + int err;
>
> if (pos + len > i_size_read(inode))
> - return false;
> + return true;
>
> - map.m_lblk = pos >> blkbits;
> - map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
> - blklen = map.m_len;
> + head_partial = (pos & blockmask) != 0;
> + tail_partial = ((pos + len) & blockmask) != 0;
> + head_lblk = pos >> blkbits;
> + tail_lblk = (pos + len - 1) >> blkbits;
> +
> + /* Check the head partial block. */
> + if (head_partial) {
> + map.m_lblk = head_lblk;
> + map.m_len = tail_lblk - head_lblk + 1;
> + err = ext4_map_blocks(NULL, inode, &map, 0);
> + if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> + return true;
> + /* If this mapping already covers the tail block, we're done. */
> + if (!tail_partial || map.m_lblk + err > tail_lblk)
> + return false;
> + }
>
> - err = ext4_map_blocks(NULL, inode, &map, 0);
> - if (err != blklen)
> - return false;
> - /*
> - * 'err==len' means that all of the blocks have been preallocated,
> - * regardless of whether they have been initialized or not. We need to
> - * check m_flags to distinguish the unwritten extents.
> - */
> - *unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
> - return true;
> + /* Check the tail partial block. */
> + if (tail_partial) {
> + map.m_lblk = tail_lblk;
> + map.m_len = 1;
> + err = ext4_map_blocks(NULL, inode, &map, 0);
> + if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> + return true;
> + }
> +
> + return false;
> }
>
> static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
> @@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
> * i_data_sem serializes concurrent extent tree modifications.
> *
> * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> - * only safe for pure written-extent overwrites. Unwritten extents or
> - * holes require exclusive lock because concurrent partial block zeroing
> - * in the DIO layer could corrupt data.
> + * safe unless the DIO layer needs to perform partial block zeroing --
> + * i.e. the head or tail partial block sits on a hole or unwritten
> + * extent. In that case upgrade to the exclusive lock and drain
> + * in-flight DIO to avoid races with concurrent partial block zeroing.
> */
> static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> bool *ilock_shared, bool *extend,
> @@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> loff_t offset;
> size_t count;
> ssize_t ret;
> - bool overwrite = true, unaligned_io, unwritten = false;
> + bool needs_zeroing = false;
>
> restart:
> ret = ext4_generic_write_checks(iocb, from);
> @@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> offset = iocb->ki_pos;
> count = ret;
>
> - unaligned_io = ext4_unaligned_io(inode, from, offset);
> *extend = ext4_extending_io(inode, offset, count);
>
> /*
> - * For unaligned writes we need to know the extent state to determine
> - * whether shared lock is safe. For aligned writes we skip this check
> - * entirely since allocation under shared lock is safe.
> + * For unaligned writes, check whether partial block zeroing will be
> + * needed. If so, exclusive lock is required to serialize against
> + * concurrent DIO that could race with the zeroing.
> + *
> + * For aligned writes we skip this check entirely since allocation
> + * under shared lock is safe.
> */
> - if (unaligned_io)
> - overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> + if (ext4_unaligned_io(inode, from, offset))
> + needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
>
> /* Determine whether we need to upgrade to an exclusive lock. */
> if (*ilock_shared &&
> - ((!IS_NOSEC(inode) || *extend ||
> - (unaligned_io && (!overwrite || unwritten))))) {
> + (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
> if (iocb->ki_flags & IOCB_NOWAIT) {
> ret = -EAGAIN;
> goto out;
> @@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> /*
> * Now that locking is settled, determine dio flags and exclusivity
> * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
> - * behavior already. The inode lock is already held exclusive if the
> - * write is unaligned non-overwrite or extending, so drain all
> - * outstanding dio and set the force wait dio flag.
> + * behavior already. When holding the exclusive lock for a write that
> + * needs partial block zeroing or is extending the file, we must wait
> + * for the I/O to complete synchronously:
> + *
> + * - needs_zeroing: drain in-flight DIO whose end_io could race with
> + * our partial block zeroing, and force synchronous completion so we
> + * don't leave in-flight zeroing bios for the next writer to drain.
> + *
> + * - extend: the caller must update i_disksize after I/O completion,
> + * which requires the data to be on disk first.
> */
> - if (!*ilock_shared && (unaligned_io || *extend)) {
> + if (!*ilock_shared && (needs_zeroing || *extend)) {
> if (iocb->ki_flags & IOCB_NOWAIT) {
> ret = -EAGAIN;
> goto out;
> }
> - if (unaligned_io && (!overwrite || unwritten))
> + if (needs_zeroing)
> inode_dio_wait(inode);
> *dio_flags = IOMAP_DIO_FORCE_WAIT;
> }
> --
> 2.43.7
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
From: Jan Kara @ 2026-06-12 12:46 UTC (permalink / raw)
To: Baokun Li
Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
ritesh.list, peng_wang
In-Reply-To: <20260611163441.2431805-2-libaokun@linux.alibaba.com>
On Fri 12-06-26 00:34:40, Baokun Li wrote:
> Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
> determine if a write is a pure overwrite, and upgrades to exclusive
> i_rwsem if not. However, ext4_overwrite_io() uses a single
> ext4_map_blocks() call which only returns the first contiguous extent of
> the same type. A write spanning multiple pre-allocated extents (e.g.
> written + unwritten, or two physically discontiguous written extents)
> produces a false negative, forcing an unnecessary exclusive lock upgrade.
>
> After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
> writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
> useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
> accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
> journal transaction. The iomap iteration naturally handles multi-extent
> ranges: each call returns the mapping for the current segment, and
> unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
> This means the common case of mixed written/unwritten extents never
> reaches ext4_iomap_alloc() at all.
>
> Even for the less common case where the range contains a hole and
> ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
> aligned non-extending writes:
>
> - truncate/punch_hole are kept out: they require exclusive i_rwsem
> (blocked by our shared lock during allocation), and inode_dio_begin()
> keeps their inode_dio_wait() blocked until in-flight bios complete.
> - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
> extent tree modifications (parallel writers to the same hole).
> - The journal handle is per-thread and does not require i_rwsem
> exclusion.
> - i_disksize and orphan list are not involved in non-extending writes.
>
> Skip the ext4_overwrite_io() check entirely for aligned writes by
> initializing overwrite to true and only calling ext4_overwrite_io() for
> unaligned writes. Unaligned writes still need the extent state check
> because concurrent partial block zeroing in the DIO layer requires
> exclusive serialization unless the range is a pure written-extent
> overwrite.
>
> Performance:
>
> Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
> Filesystem: ext4 default mkfs
>
> Aligned 8K DIO writes spanning written+unwritten extent boundaries.
> Each thread writes its own 1G region sequentially; the file is rebuilt
> between runs so every block is written exactly once. Metric: IOPS.
>
> JOBS Before After speedup
> ---- -------- --------- -------
> 1 42,322 43,329 1.02x
> 2 68,516 70,677 1.03x
> 4 62,489 97,072 1.55x
> 8 58,701 110,819 1.89x
> 16 58,569 116,392 1.99x
> 32 60,860 117,244 1.93x
>
> Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.
>
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Uh, that's a significant change. I have to say I feel slightly uneasy :). But
I don't see a hole in your justification and the patch looks good. Nice
find and feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
> 1 file changed, 30 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..6f3886465ce3 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
> * condition requires an exclusive inode lock. If yes, then we restart the
> * whole operation by releasing the shared lock and acquiring exclusive lock.
> *
> - * - For unaligned_io we never take shared lock as it may cause data corruption
> - * when two unaligned IO tries to modify the same block e.g. while zeroing.
> + * The decision is layered, evaluated in this order:
> *
> - * - For extending writes case we don't take the shared lock, since it requires
> - * updating inode i_disksize and/or orphan handling with exclusive lock.
> + * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
> + * to the exclusive lock -- the security update itself requires it,
> + * regardless of whether the write extends the file or is aligned.
> *
> - * - shared locking will only be true mostly with overwrites, including
> - * initialized blocks and unwritten blocks.
> + * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
> + * lock to safely update i_disksize and the orphan list, regardless of
> + * alignment.
> *
> - * - Otherwise we will switch to exclusive i_rwsem lock.
> + * 3. Otherwise, for aligned non-extending writes, shared lock is always
> + * sufficient regardless of extent state (written, unwritten, or hole).
> + * truncate/punch_hole cannot run while we hold the shared i_rwsem
> + * (they need it exclusively); after we release it, inode_dio_begin()
> + * keeps their inode_dio_wait() blocked until in-flight bios complete.
> + * i_data_sem serializes concurrent extent tree modifications.
> + *
> + * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
> + * only safe for pure written-extent overwrites. Unwritten extents or
> + * holes require exclusive lock because concurrent partial block zeroing
> + * in the DIO layer could corrupt data.
> */
> static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> bool *ilock_shared, bool *extend,
> @@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> loff_t offset;
> size_t count;
> ssize_t ret;
> - bool overwrite, unaligned_io, unwritten;
> + bool overwrite = true, unaligned_io, unwritten = false;
>
> restart:
> ret = ext4_generic_write_checks(iocb, from);
> @@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>
> unaligned_io = ext4_unaligned_io(inode, from, offset);
> *extend = ext4_extending_io(inode, offset, count);
> - overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
>
> /*
> - * Determine whether we need to upgrade to an exclusive lock. This is
> - * required to change security info in file_modified(), for extending
> - * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
> - * extents (as partial block zeroing may be required).
> - *
> - * Note that unaligned writes are allowed under shared lock so long as
> - * they are pure overwrites. Otherwise, concurrent unaligned writes risk
> - * data corruption due to partial block zeroing in the dio layer, and so
> - * the I/O must occur exclusively.
> + * For unaligned writes we need to know the extent state to determine
> + * whether shared lock is safe. For aligned writes we skip this check
> + * entirely since allocation under shared lock is safe.
> */
> + if (unaligned_io)
> + overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
> +
> + /* Determine whether we need to upgrade to an exclusive lock. */
> if (*ilock_shared &&
> - ((!IS_NOSEC(inode) || *extend || !overwrite ||
> - (unaligned_io && unwritten)))) {
> + ((!IS_NOSEC(inode) || *extend ||
> + (unaligned_io && (!overwrite || unwritten))))) {
> if (iocb->ki_flags & IOCB_NOWAIT) {
> ret = -EAGAIN;
> goto out;
> @@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> * Now that locking is settled, determine dio flags and exclusivity
> * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
> * behavior already. The inode lock is already held exclusive if the
> - * write is non-overwrite or extending, so drain all outstanding dio and
> - * set the force wait dio flag.
> + * write is unaligned non-overwrite or extending, so drain all
> + * outstanding dio and set the force wait dio flag.
> */
> if (!*ilock_shared && (unaligned_io || *extend)) {
> if (iocb->ki_flags & IOCB_NOWAIT) {
> --
> 2.43.7
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* [PATCH v2] ext4: defer iput() in ext4_xattr_block_set() to avoid deadlock with writepages
From: Yun Zhou @ 2026-06-12 9:58 UTC (permalink / raw)
To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang, ebiggers, yun.zhou
Cc: linux-ext4, linux-kernel
In-Reply-To: <20260611124555.1541195-1-yun.zhou@windriver.com>
ext4_xattr_block_set() calls iput() on ea_inode while its callers hold
xattr_sem. If this iput() drops the last reference, it can trigger
write_inode_now() -> ext4_writepages() -> s_writepages_rwsem, which
violates the lock ordering since ext4_writepages() already establishes
s_writepages_rwsem -> jbd2_handle ordering:
CPU0 (writeback worker) CPU1 (file create)
---- ----
ext4_writepages()
s_writepages_rwsem (read) ext4_create()
ext4_do_writepages() __ext4_new_inode()
ext4_journal_start() [holds jbd2 handle]
wait_transaction_locked() ext4_xattr_set_handle()
[WAIT for jbd2_handle] xattr_sem (write)
CPU2 (xattr set or isize expand)
----
ext4_xattr_set_handle() or ext4_try_to_expand_extra_isize()
xattr_sem (write)
ext4_xattr_block_set()
iput(ea_inode)
write_inode_now()
ext4_writepages()
s_writepages_rwsem (read) [DEADLOCK]
This forms a circular dependency on lock classes:
s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
Fix by deferring iput() calls inside ext4_xattr_block_set() via the
existing ext4_xattr_inode_array mechanism. The array is threaded
through the call chain and freed by callers after releasing xattr_sem.
Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v2: Defer iput() in ext4_xattr_block_set() via ea_inode_array,
freed after xattr_sem is released. Fixes the root cause.
v1: Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
on inodes being deleted. Only fixes the syzbot reproducer, not
the underlying lock ordering violation.
fs/ext4/inode.c | 15 +++++++++++----
fs/ext4/xattr.c | 40 +++++++++++++++++++++++++---------------
fs/ext4/xattr.h | 3 ++-
3 files changed, 38 insertions(+), 20 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..c6448a9eb1e7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6408,7 +6408,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
static int __ext4_expand_extra_isize(struct inode *inode,
unsigned int new_extra_isize,
struct ext4_iloc *iloc,
- handle_t *handle, int *no_expand)
+ handle_t *handle, int *no_expand,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_inode *raw_inode;
struct ext4_xattr_ibody_header *header;
@@ -6453,7 +6454,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
/* try to expand with EAs present */
error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
- raw_inode, handle);
+ raw_inode, handle, ea_inode_array);
if (error) {
/*
* Inode size expansion failed; don't try again
@@ -6475,6 +6476,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
{
int no_expand;
int error;
+ struct ext4_xattr_inode_array *ea_inode_array = NULL;
if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
return -EOVERFLOW;
@@ -6496,8 +6498,10 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
return -EBUSY;
error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
- handle, &no_expand);
+ handle, &no_expand,
+ &ea_inode_array);
ext4_write_unlock_xattr(inode, &no_expand);
+ ext4_xattr_inode_array_free(ea_inode_array);
return error;
}
@@ -6509,6 +6513,7 @@ int ext4_expand_extra_isize(struct inode *inode,
handle_t *handle;
int no_expand;
int error, rc;
+ struct ext4_xattr_inode_array *ea_inode_array = NULL;
if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
brelse(iloc->bh);
@@ -6534,7 +6539,8 @@ int ext4_expand_extra_isize(struct inode *inode,
}
error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
- handle, &no_expand);
+ handle, &no_expand,
+ &ea_inode_array);
rc = ext4_mark_iloc_dirty(handle, inode, iloc);
if (!error)
@@ -6542,6 +6548,7 @@ int ext4_expand_extra_isize(struct inode *inode,
out_unlock:
ext4_write_unlock_xattr(inode, &no_expand);
+ ext4_xattr_inode_array_free(ea_inode_array);
ext4_journal_stop(handle);
return error;
}
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e91af66db7a7..bf8424927383 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1906,7 +1906,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
static int
ext4_xattr_block_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_info *i,
- struct ext4_xattr_block_find *bs)
+ struct ext4_xattr_block_find *bs,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct super_block *sb = inode->i_sb;
struct buffer_head *new_bh = NULL;
@@ -2158,7 +2159,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
ext4_warning_inode(ea_inode,
"dec ref error=%d",
error);
- iput(ea_inode);
+ ext4_expand_inode_array(ea_inode_array,
+ ea_inode);
ea_inode = NULL;
}
@@ -2190,12 +2192,12 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
/* Drop the previous xattr block. */
if (bs->bh && bs->bh != new_bh) {
- struct ext4_xattr_inode_array *ea_inode_array = NULL;
+ struct ext4_xattr_inode_array *old_ea_inode_array = NULL;
ext4_xattr_release_block(handle, inode, bs->bh,
- &ea_inode_array,
+ &old_ea_inode_array,
0 /* extra_credits */);
- ext4_xattr_inode_array_free(ea_inode_array);
+ ext4_xattr_inode_array_free(old_ea_inode_array);
}
error = 0;
@@ -2211,7 +2213,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
ext4_xattr_inode_free_quota(inode, ea_inode,
i_size_read(ea_inode));
}
- iput(ea_inode);
+ ext4_expand_inode_array(ea_inode_array, ea_inode);
}
if (ce)
mb_cache_entry_put(ea_block_cache, ce);
@@ -2371,6 +2373,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
struct ext4_xattr_block_find bs = {
.s = { .not_found = -ENODATA, },
};
+ struct ext4_xattr_inode_array *ea_inode_array = NULL;
int no_expand;
int error;
@@ -2438,7 +2441,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
if (!is.s.not_found)
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
else if (!bs.s.not_found)
- error = ext4_xattr_block_set(handle, inode, &i, &bs);
+ error = ext4_xattr_block_set(handle, inode, &i, &bs,
+ &ea_inode_array);
} else {
error = 0;
/* Xattr value did not change? Save us some work and bail out */
@@ -2455,7 +2459,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (!error && !bs.s.not_found) {
i.value = NULL;
- error = ext4_xattr_block_set(handle, inode, &i, &bs);
+ error = ext4_xattr_block_set(handle, inode, &i, &bs,
+ &ea_inode_array);
} else if (error == -ENOSPC) {
if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
brelse(bs.bh);
@@ -2464,7 +2469,8 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
if (error)
goto cleanup;
}
- error = ext4_xattr_block_set(handle, inode, &i, &bs);
+ error = ext4_xattr_block_set(handle, inode, &i, &bs,
+ &ea_inode_array);
if (!error && !is.s.not_found) {
i.value = NULL;
error = ext4_xattr_ibody_set(handle, inode, &i,
@@ -2503,6 +2509,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
brelse(is.iloc.bh);
brelse(bs.bh);
ext4_write_unlock_xattr(inode, &no_expand);
+ ext4_xattr_inode_array_free(ea_inode_array);
return error;
}
@@ -2612,7 +2619,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
*/
static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
struct ext4_inode *raw_inode,
- struct ext4_xattr_entry *entry)
+ struct ext4_xattr_entry *entry,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_xattr_ibody_find *is = NULL;
struct ext4_xattr_block_find *bs = NULL;
@@ -2676,7 +2684,7 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
goto out;
/* Move ea entry from the inode into the block */
- error = ext4_xattr_block_set(handle, inode, &i, bs);
+ error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
if (error)
goto out;
@@ -2702,7 +2710,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
struct ext4_inode *raw_inode,
int isize_diff, size_t ifree,
- size_t bfree, int *total_ino)
+ size_t bfree, int *total_ino,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
struct ext4_xattr_entry *small_entry;
@@ -2752,7 +2761,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
total_size += EXT4_XATTR_SIZE(
le32_to_cpu(entry->e_value_size));
error = ext4_xattr_move_to_block(handle, inode, raw_inode,
- entry);
+ entry, ea_inode_array);
if (error)
return error;
@@ -2769,7 +2778,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
* Returns 0 on success or negative error number on failure.
*/
int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
- struct ext4_inode *raw_inode, handle_t *handle)
+ struct ext4_inode *raw_inode, handle_t *handle,
+ struct ext4_xattr_inode_array **ea_inode_array)
{
struct ext4_xattr_ibody_header *header;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2841,7 +2851,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
isize_diff, ifree, bfree,
- &total_ino);
+ &total_ino, ea_inode_array);
if (error) {
if (error == -ENOSPC && !tried_min_extra_isize &&
s_min_extra_isize) {
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..02a172515193 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -192,7 +192,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
- struct ext4_inode *raw_inode, handle_t *handle);
+ struct ext4_inode *raw_inode, handle_t *handle,
+ struct ext4_xattr_inode_array **ea_inode_array);
extern void ext4_evict_ea_inode(struct inode *inode);
extern const struct xattr_handler * const ext4_xattr_handlers[];
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-12 5:28 UTC (permalink / raw)
To: Carlos Maiolino
Cc: Christoph Hellwig, Keith Busch, brauner, linux-block,
linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
Martin K. Petersen, Jens Axboe
In-Reply-To: <airX6BmMQ14Rvjcb@nidhogg.toxiclabs.cc>
On Thu, Jun 11, 2026 at 05:47:07PM +0200, Carlos Maiolino wrote:
> On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > > It's entirely possible a device supports byte aligned addresses. The
> > > block layer just doesn't let a driver report that. So either it really
> > > was successful because you found a bug that skips the alignment checks,
> > > or your device silently corrupted your payload.
>
> I tried this on different hardware, I find it hard to say all those
> devices were corrupting the payload.
I think in the other thread we agreed that we are currently missing
the alignment check for fast-path bios not hitting the splitting code,
so maybe that is something you see. Additionally we're missing the
checks for purely bio based drivers not calling the splitting helper
at all, but I don't think that applies here.
> > > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > > though, in not taking the optimization when it was possible. So here's
> > > an alternative suggestion that should get things working as expected:
> >
> > The fix below looks like it is addressing a real bug. I'm not sure if
> > Carlos is hitting it, but we were missing the alignment checks for
> > single-bvec fast path bios so far indeed.
>
> You left context out so I'm assuming by the fix you meant Keith's patch.
Yes.
^ permalink raw reply
* [PATCH v3] ext4: fix circular lock dependency in ext4_ext_migrate
From: Yun Zhou @ 2026-06-12 0:53 UTC (permalink / raw)
To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang, ebiggers, yun.zhou
Cc: linux-ext4, linux-kernel
In-Reply-To: <20260609084007.3432061-1-yun.zhou@windriver.com>
Move iput(tmp_inode) after ext4_writepages_up_write() to avoid a
circular lock dependency between s_writepages_rwsem and sb_internal
(freeze protection).
The deadlock scenario:
CPU0 (EXT4_IOC_MIGRATE) CPU1 (orphan cleanup during mount)
---- ----
ext4_ext_migrate()
ext4_writepages_down_write()
s_writepages_rwsem (write)
ext4_evict_inode()
sb_start_intwrite() [sb_internal]
...
ext4_writepages()
s_writepages_rwsem (read) [BLOCKED]
iput(tmp_inode)
ext4_evict_inode()
sb_start_intwrite() [BLOCKED]
The tmp_inode is a temporary inode with nlink=0 created solely for
building the extent tree. Its eviction does not require
s_writepages_rwsem protection, so deferring iput() until after
releasing the rwsem is safe.
Reported-by: syzbot+212e8f62790f8e0bc63b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=212e8f62790f8e0bc63b
Fixes: cb85f4d23f79 ("ext4: fix race between writepages and enabling EXT4_EXTENTS_FL")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
v3: fixes Reported-by tag and Closes tag.
v2: remove redundant null pointer check for iput(tmp_inode).
fs/ext4/migrate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 477d43d7e294..5d60ef10fe11 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -464,6 +464,7 @@ int ext4_ext_migrate(struct inode *inode)
if (IS_ERR(tmp_inode)) {
retval = PTR_ERR(tmp_inode);
ext4_journal_stop(handle);
+ tmp_inode = NULL;
goto out_unlock;
}
/*
@@ -591,9 +592,9 @@ int ext4_ext_migrate(struct inode *inode)
ext4_journal_stop(handle);
out_tmp_inode:
unlock_new_inode(tmp_inode);
- iput(tmp_inode);
out_unlock:
ext4_writepages_up_write(inode->i_sb, alloc_ctx);
+ iput(tmp_inode);
return retval;
}
--
2.43.0
^ permalink raw reply related
* [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
In-Reply-To: <20260611163441.2431805-1-libaokun@linux.alibaba.com>
Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.
After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.
Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:
- truncate/punch_hole are kept out: they require exclusive i_rwsem
(blocked by our shared lock during allocation), and inode_dio_begin()
keeps their inode_dio_wait() blocked until in-flight bios complete.
- i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
extent tree modifications (parallel writers to the same hole).
- The journal handle is per-thread and does not require i_rwsem
exclusion.
- i_disksize and orphan list are not involved in non-extending writes.
Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.
Performance:
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.
JOBS Before After speedup
---- -------- --------- -------
1 42,322 43,329 1.02x
2 68,516 70,677 1.03x
4 62,489 97,072 1.55x
8 58,701 110,819 1.89x
16 58,569 116,392 1.99x
32 60,860 117,244 1.93x
Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
1 file changed, 30 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..6f3886465ce3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
* condition requires an exclusive inode lock. If yes, then we restart the
* whole operation by releasing the shared lock and acquiring exclusive lock.
*
- * - For unaligned_io we never take shared lock as it may cause data corruption
- * when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
*
- * - For extending writes case we don't take the shared lock, since it requires
- * updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ * to the exclusive lock -- the security update itself requires it,
+ * regardless of whether the write extends the file or is aligned.
*
- * - shared locking will only be true mostly with overwrites, including
- * initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ * lock to safely update i_disksize and the orphan list, regardless of
+ * alignment.
*
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ * sufficient regardless of extent state (written, unwritten, or hole).
+ * truncate/punch_hole cannot run while we hold the shared i_rwsem
+ * (they need it exclusively); after we release it, inode_dio_begin()
+ * keeps their inode_dio_wait() blocked until in-flight bios complete.
+ * i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ * only safe for pure written-extent overwrites. Unwritten extents or
+ * holes require exclusive lock because concurrent partial block zeroing
+ * in the DIO layer could corrupt data.
*/
static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
bool *ilock_shared, bool *extend,
@@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
loff_t offset;
size_t count;
ssize_t ret;
- bool overwrite, unaligned_io, unwritten;
+ bool overwrite = true, unaligned_io, unwritten = false;
restart:
ret = ext4_generic_write_checks(iocb, from);
@@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
unaligned_io = ext4_unaligned_io(inode, from, offset);
*extend = ext4_extending_io(inode, offset, count);
- overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
/*
- * Determine whether we need to upgrade to an exclusive lock. This is
- * required to change security info in file_modified(), for extending
- * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
- * extents (as partial block zeroing may be required).
- *
- * Note that unaligned writes are allowed under shared lock so long as
- * they are pure overwrites. Otherwise, concurrent unaligned writes risk
- * data corruption due to partial block zeroing in the dio layer, and so
- * the I/O must occur exclusively.
+ * For unaligned writes we need to know the extent state to determine
+ * whether shared lock is safe. For aligned writes we skip this check
+ * entirely since allocation under shared lock is safe.
*/
+ if (unaligned_io)
+ overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+ /* Determine whether we need to upgrade to an exclusive lock. */
if (*ilock_shared &&
- ((!IS_NOSEC(inode) || *extend || !overwrite ||
- (unaligned_io && unwritten)))) {
+ ((!IS_NOSEC(inode) || *extend ||
+ (unaligned_io && (!overwrite || unwritten))))) {
if (iocb->ki_flags & IOCB_NOWAIT) {
ret = -EAGAIN;
goto out;
@@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
* Now that locking is settled, determine dio flags and exclusivity
* requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
* behavior already. The inode lock is already held exclusive if the
- * write is non-overwrite or extending, so drain all outstanding dio and
- * set the force wait dio flag.
+ * write is unaligned non-overwrite or extending, so drain all
+ * outstanding dio and set the force wait dio flag.
*/
if (!*ilock_shared && (unaligned_io || *extend)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
--
2.43.7
^ permalink raw reply related
* [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
In-Reply-To: <20260611163441.2431805-1-libaokun@linux.alibaba.com>
For unaligned DIO writes, the previous ext4_overwrite_io() required the
entire range to fall within a single written extent. This was overly
conservative: the DIO layer only performs partial block zeroing for the
head and tail blocks when they are partially covered by the write.
Middle blocks that are fully covered are written as whole blocks
without any zeroing, so they are safe regardless of extent state.
Therefore exclusive lock is only required when partial block zeroing
will actually happen:
- The head partial block (if any) lands on a hole or unwritten extent.
- The tail partial block (if any) lands on a hole or unwritten extent.
Middle full-cover blocks can be in any state (hole, unwritten, or
written) - block allocation under shared lock is safe per the previous
patch's analysis (inode_dio_begin + i_data_sem protection).
Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
directly answers the question driving the lock decision. It uses at
most two ext4_map_blocks() calls: one for the head partial block (also
catching the case where it spans through the tail), and one for the
tail partial block if not already covered.
This enables shared lock for previously-rejected scenarios such as:
- Unaligned write spanning written extent + mid-range hole + written
extent at the tail.
- Unaligned write where the partial blocks land on written extents but
the middle has unwritten extents.
Performance:
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
Each stripe is laid out as [written][unwritten][unwritten][written],
so the head and tail partial blocks land on written extents but the
middle is unwritten. Metric: IOPS.
JOBS Before After speedup
---- -------- --------- -------
1 15,547 17,381 1.12x
2 15,910 34,172 2.15x
4 15,014 57,567 3.83x
8 15,022 81,947 5.46x
16 14,586 99,126 6.80x
32 14,047 92,519 6.59x
Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
1 file changed, 73 insertions(+), 35 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6f3886465ce3..aa926e641739 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
return false;
}
-/* Is IO overwriting allocated or initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode,
- loff_t pos, loff_t len, bool *unwritten)
+/*
+ * Does an unaligned DIO write require partial block zeroing?
+ *
+ * Partial block zeroing is performed only for the head and tail blocks
+ * when they are partially covered by the write and the underlying extent
+ * is a hole or unwritten. Middle blocks (fully covered by the write)
+ * are written as whole blocks without zeroing.
+ *
+ * When zeroing is required, two concurrent unaligned DIO writes to the
+ * same partial block can race and corrupt each other's data, so the
+ * caller must take the exclusive i_rwsem and drain in-flight DIO. When
+ * zeroing is not required, shared lock is safe -- block allocation and
+ * unwritten conversion for middle blocks are protected by i_data_sem
+ * and inode_dio_begin().
+ */
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
{
struct ext4_map_blocks map;
unsigned int blkbits = inode->i_blkbits;
- int err, blklen;
+ unsigned long blockmask = inode->i_sb->s_blocksize - 1;
+ bool head_partial, tail_partial;
+ ext4_lblk_t head_lblk, tail_lblk;
+ int err;
if (pos + len > i_size_read(inode))
- return false;
+ return true;
- map.m_lblk = pos >> blkbits;
- map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
- blklen = map.m_len;
+ head_partial = (pos & blockmask) != 0;
+ tail_partial = ((pos + len) & blockmask) != 0;
+ head_lblk = pos >> blkbits;
+ tail_lblk = (pos + len - 1) >> blkbits;
+
+ /* Check the head partial block. */
+ if (head_partial) {
+ map.m_lblk = head_lblk;
+ map.m_len = tail_lblk - head_lblk + 1;
+ err = ext4_map_blocks(NULL, inode, &map, 0);
+ if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+ return true;
+ /* If this mapping already covers the tail block, we're done. */
+ if (!tail_partial || map.m_lblk + err > tail_lblk)
+ return false;
+ }
- err = ext4_map_blocks(NULL, inode, &map, 0);
- if (err != blklen)
- return false;
- /*
- * 'err==len' means that all of the blocks have been preallocated,
- * regardless of whether they have been initialized or not. We need to
- * check m_flags to distinguish the unwritten extents.
- */
- *unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
- return true;
+ /* Check the tail partial block. */
+ if (tail_partial) {
+ map.m_lblk = tail_lblk;
+ map.m_len = 1;
+ err = ext4_map_blocks(NULL, inode, &map, 0);
+ if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+ return true;
+ }
+
+ return false;
}
static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
* i_data_sem serializes concurrent extent tree modifications.
*
* 4. Otherwise, the write is unaligned and non-extending. Shared lock is
- * only safe for pure written-extent overwrites. Unwritten extents or
- * holes require exclusive lock because concurrent partial block zeroing
- * in the DIO layer could corrupt data.
+ * safe unless the DIO layer needs to perform partial block zeroing --
+ * i.e. the head or tail partial block sits on a hole or unwritten
+ * extent. In that case upgrade to the exclusive lock and drain
+ * in-flight DIO to avoid races with concurrent partial block zeroing.
*/
static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
bool *ilock_shared, bool *extend,
@@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
loff_t offset;
size_t count;
ssize_t ret;
- bool overwrite = true, unaligned_io, unwritten = false;
+ bool needs_zeroing = false;
restart:
ret = ext4_generic_write_checks(iocb, from);
@@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
offset = iocb->ki_pos;
count = ret;
- unaligned_io = ext4_unaligned_io(inode, from, offset);
*extend = ext4_extending_io(inode, offset, count);
/*
- * For unaligned writes we need to know the extent state to determine
- * whether shared lock is safe. For aligned writes we skip this check
- * entirely since allocation under shared lock is safe.
+ * For unaligned writes, check whether partial block zeroing will be
+ * needed. If so, exclusive lock is required to serialize against
+ * concurrent DIO that could race with the zeroing.
+ *
+ * For aligned writes we skip this check entirely since allocation
+ * under shared lock is safe.
*/
- if (unaligned_io)
- overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+ if (ext4_unaligned_io(inode, from, offset))
+ needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
/* Determine whether we need to upgrade to an exclusive lock. */
if (*ilock_shared &&
- ((!IS_NOSEC(inode) || *extend ||
- (unaligned_io && (!overwrite || unwritten))))) {
+ (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
ret = -EAGAIN;
goto out;
@@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
/*
* Now that locking is settled, determine dio flags and exclusivity
* requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
- * behavior already. The inode lock is already held exclusive if the
- * write is unaligned non-overwrite or extending, so drain all
- * outstanding dio and set the force wait dio flag.
+ * behavior already. When holding the exclusive lock for a write that
+ * needs partial block zeroing or is extending the file, we must wait
+ * for the I/O to complete synchronously:
+ *
+ * - needs_zeroing: drain in-flight DIO whose end_io could race with
+ * our partial block zeroing, and force synchronous completion so we
+ * don't leave in-flight zeroing bios for the next writer to drain.
+ *
+ * - extend: the caller must update i_disksize after I/O completion,
+ * which requires the data to be on disk first.
*/
- if (!*ilock_shared && (unaligned_io || *extend)) {
+ if (!*ilock_shared && (needs_zeroing || *extend)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
ret = -EAGAIN;
goto out;
}
- if (unaligned_io && (!overwrite || unwritten))
+ if (needs_zeroing)
inode_dio_wait(inode);
*dio_flags = IOMAP_DIO_FORCE_WAIT;
}
--
2.43.7
^ permalink raw reply related
* [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
Hi all,
This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.
It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.
ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):
1. Aligned writes spanning multiple already-allocated extents (e.g.
written + unwritten, or two discontiguous written extents).
2. Unaligned writes whose head/tail partial blocks land on written
extents but the fully-covered middle blocks include hole or
unwritten extents.
Patch 1 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.
Patch 2 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.
Testing
=======
"kvm-xfstests -c ext4/all -g auto" passes with no new failures.
Performance
===========
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.
JOBS base +patch 1 +patch 1+2 speedup
---- --------- -------- ---------- -------
1 42,322 43,329 43,087 1.02x
2 68,516 70,677 66,958 1.03x
4 62,489 97,072 101,468 1.62x
8 58,701 110,819 113,679 1.94x
16 58,569 116,392 115,272 1.97x
32 60,860 117,244 119,621 1.97x
Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.
Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.
JOBS base +patch 1 +patch 1+2 speedup
---- --------- -------- ---------- -------
1 15,547 15,975 17,381 1.12x
2 15,910 14,808 34,172 2.15x
4 15,014 14,828 57,567 3.83x
8 15,022 14,648 81,947 5.46x
16 14,586 14,262 99,126 6.80x
32 14,047 13,809 92,519 6.59x
Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.
In test 2, patch 1 alone has no effect (slight noise) because patch 1
only touches the aligned write path. Patch 2 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.
Comments and questions are, as always, welcome.
Thanks,
Baokun
[1]: https://patch.msgid.link/20260607124935.6168-1-peng_wang@linux.alibaba.com
Baokun Li (2):
ext4: skip overwrite check for aligned non-extending DIO writes
ext4: base unaligned DIO lock decision on partial block zeroing
fs/ext4/file.c | 132 +++++++++++++++++++++++++++++++++----------------
1 file changed, 89 insertions(+), 43 deletions(-)
--
2.43.7
^ permalink raw reply
* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-11 15:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260611133833.GA14645@lst.de>
On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > It's entirely possible a device supports byte aligned addresses. The
> > block layer just doesn't let a driver report that. So either it really
> > was successful because you found a bug that skips the alignment checks,
> > or your device silently corrupted your payload.
I tried this on different hardware, I find it hard to say all those
devices were corrupting the payload.
> >
> > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > though, in not taking the optimization when it was possible. So here's
> > an alternative suggestion that should get things working as expected:
>
> The fix below looks like it is addressing a real bug. I'm not sure if
> Carlos is hitting it, but we were missing the alignment checks for
> single-bvec fast path bios so far indeed.
You left context out so I'm assuming by the fix you meant Keith's patch.
I can give it a spin and see if it fixes the behavior I'm talking
about. Give me some time as I have a bunch of stuff to do tonight so
likely I will only manage to try this tomorrow.
^ permalink raw reply
* [tytso-ext4:dev] BUILD SUCCESS c143957520c6c9b5cd72e0de8b52b814f0c576fe
From: kernel test robot @ 2026-06-11 14:52 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
branch HEAD: c143957520c6c9b5cd72e0de8b52b814f0c576fe ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
elapsed time: 772m
configs tested: 195
configs skipped: 2
The following configs have been built successfully.
More configs may be tested in the coming days.
tested configs:
alpha allnoconfig gcc-16.1.0
alpha allyesconfig gcc-16.1.0
alpha defconfig gcc-16.1.0
arc allmodconfig clang-23
arc allmodconfig gcc-16.1.0
arc allnoconfig gcc-16.1.0
arc allyesconfig clang-23
arc allyesconfig gcc-16.1.0
arc defconfig gcc-16.1.0
arc nsim_700_defconfig gcc-16.1.0
arc randconfig-001 gcc-14.3.0
arc randconfig-001-20260611 gcc-14.3.0
arc randconfig-002 gcc-14.3.0
arc randconfig-002-20260611 gcc-14.3.0
arm allnoconfig clang-23
arm allnoconfig gcc-16.1.0
arm allyesconfig clang-23
arm allyesconfig gcc-16.1.0
arm defconfig gcc-16.1.0
arm pxa910_defconfig gcc-16.1.0
arm randconfig-001 gcc-14.3.0
arm randconfig-001-20260611 gcc-14.3.0
arm randconfig-002 gcc-14.3.0
arm randconfig-002-20260611 gcc-14.3.0
arm randconfig-003 gcc-14.3.0
arm randconfig-003-20260611 gcc-14.3.0
arm randconfig-004 gcc-14.3.0
arm randconfig-004-20260611 gcc-14.3.0
arm64 allmodconfig clang-23
arm64 allnoconfig gcc-16.1.0
arm64 defconfig gcc-16.1.0
csky allmodconfig gcc-16.1.0
csky allnoconfig gcc-16.1.0
csky defconfig gcc-16.1.0
hexagon allmodconfig clang-23
hexagon allmodconfig gcc-16.1.0
hexagon allnoconfig clang-23
hexagon allnoconfig gcc-16.1.0
hexagon defconfig gcc-16.1.0
hexagon randconfig-001-20260611 clang-16
hexagon randconfig-002-20260611 clang-16
i386 allmodconfig gcc-14
i386 allnoconfig gcc-14
i386 allnoconfig gcc-16.1.0
i386 allyesconfig gcc-14
i386 buildonly-randconfig-001 clang-22
i386 buildonly-randconfig-001-20260611 clang-22
i386 buildonly-randconfig-002 clang-22
i386 buildonly-randconfig-002-20260611 clang-22
i386 buildonly-randconfig-003 clang-22
i386 buildonly-randconfig-003-20260611 clang-22
i386 buildonly-randconfig-004 clang-22
i386 buildonly-randconfig-004-20260611 clang-22
i386 buildonly-randconfig-005 clang-22
i386 buildonly-randconfig-005-20260611 clang-22
i386 buildonly-randconfig-006 clang-22
i386 buildonly-randconfig-006-20260611 clang-22
i386 defconfig gcc-16.1.0
i386 randconfig-001-20260611 gcc-14
i386 randconfig-002-20260611 gcc-14
i386 randconfig-003-20260611 gcc-14
i386 randconfig-004-20260611 gcc-14
i386 randconfig-005-20260611 gcc-14
i386 randconfig-006-20260611 gcc-14
i386 randconfig-007-20260611 gcc-14
i386 randconfig-011-20260611 gcc-14
i386 randconfig-012-20260611 gcc-14
i386 randconfig-013-20260611 gcc-14
i386 randconfig-014-20260611 gcc-14
i386 randconfig-015-20260611 gcc-14
i386 randconfig-016-20260611 gcc-14
i386 randconfig-017-20260611 gcc-14
loongarch allmodconfig clang-19
loongarch allmodconfig clang-23
loongarch allnoconfig clang-20
loongarch allnoconfig gcc-16.1.0
loongarch defconfig clang-23
loongarch randconfig-001-20260611 clang-16
loongarch randconfig-002-20260611 clang-16
m68k allmodconfig gcc-16.1.0
m68k allnoconfig gcc-16.1.0
m68k allyesconfig clang-23
m68k allyesconfig gcc-16.1.0
m68k defconfig clang-23
microblaze allnoconfig gcc-16.1.0
microblaze allyesconfig gcc-16.1.0
microblaze defconfig clang-23
mips allmodconfig gcc-16.1.0
mips allnoconfig gcc-16.1.0
mips allyesconfig gcc-16.1.0
nios2 allmodconfig clang-20
nios2 allmodconfig gcc-11.5.0
nios2 allnoconfig clang-23
nios2 allnoconfig gcc-11.5.0
nios2 defconfig clang-23
nios2 randconfig-001-20260611 clang-16
nios2 randconfig-002-20260611 clang-16
openrisc allmodconfig clang-20
openrisc allmodconfig gcc-16.1.0
openrisc allnoconfig clang-23
openrisc allnoconfig gcc-16.1.0
openrisc defconfig gcc-16.1.0
parisc allmodconfig gcc-16.1.0
parisc allnoconfig clang-23
parisc allnoconfig gcc-16.1.0
parisc allyesconfig clang-23
parisc allyesconfig gcc-16.1.0
parisc defconfig gcc-16.1.0
parisc64 defconfig clang-23
powerpc allmodconfig gcc-16.1.0
powerpc allnoconfig clang-23
powerpc allnoconfig gcc-16.1.0
powerpc tqm8540_defconfig gcc-16.1.0
riscv allmodconfig clang-23
riscv allnoconfig clang-23
riscv allnoconfig gcc-16.1.0
riscv allyesconfig clang-23
riscv defconfig gcc-16.1.0
riscv randconfig-001-20260611 gcc-12.5.0
riscv randconfig-002-20260611 gcc-12.5.0
s390 allmodconfig clang-23
s390 allnoconfig clang-23
s390 allyesconfig gcc-16.1.0
s390 defconfig gcc-16.1.0
s390 randconfig-001-20260611 gcc-12.5.0
s390 randconfig-002-20260611 gcc-12.5.0
sh allmodconfig gcc-16.1.0
sh allnoconfig clang-23
sh allnoconfig gcc-16.1.0
sh allyesconfig clang-23
sh allyesconfig gcc-16.1.0
sh defconfig gcc-14
sh randconfig-001-20260611 gcc-12.5.0
sh randconfig-002-20260611 gcc-12.5.0
sparc allnoconfig clang-23
sparc allnoconfig gcc-16.1.0
sparc defconfig gcc-16.1.0
sparc randconfig-001-20260611 gcc-15.2.0
sparc randconfig-002-20260611 gcc-15.2.0
sparc64 allmodconfig clang-20
sparc64 defconfig gcc-14
sparc64 randconfig-001-20260611 gcc-15.2.0
sparc64 randconfig-002-20260611 gcc-15.2.0
um allmodconfig clang-23
um allnoconfig clang-16
um allnoconfig clang-23
um allyesconfig gcc-14
um allyesconfig gcc-16.1.0
um defconfig gcc-14
um i386_defconfig gcc-14
um randconfig-001-20260611 gcc-15.2.0
um randconfig-002-20260611 gcc-15.2.0
um x86_64_defconfig gcc-14
x86_64 allmodconfig clang-22
x86_64 allnoconfig clang-22
x86_64 allnoconfig clang-23
x86_64 allyesconfig clang-22
x86_64 buildonly-randconfig-001-20260611 gcc-14
x86_64 buildonly-randconfig-002-20260611 gcc-14
x86_64 buildonly-randconfig-003-20260611 gcc-14
x86_64 buildonly-randconfig-004-20260611 gcc-14
x86_64 buildonly-randconfig-005-20260611 gcc-14
x86_64 buildonly-randconfig-006-20260611 gcc-14
x86_64 defconfig gcc-14
x86_64 kexec clang-22
x86_64 randconfig-001-20260611 gcc-14
x86_64 randconfig-002-20260611 gcc-14
x86_64 randconfig-003-20260611 gcc-14
x86_64 randconfig-004-20260611 gcc-14
x86_64 randconfig-005-20260611 gcc-14
x86_64 randconfig-006-20260611 gcc-14
x86_64 randconfig-011-20260611 gcc-14
x86_64 randconfig-012-20260611 gcc-14
x86_64 randconfig-013-20260611 gcc-14
x86_64 randconfig-014-20260611 gcc-14
x86_64 randconfig-015-20260611 gcc-14
x86_64 randconfig-016-20260611 gcc-14
x86_64 randconfig-071-20260611 clang-22
x86_64 randconfig-072-20260611 clang-22
x86_64 randconfig-073-20260611 clang-22
x86_64 randconfig-074-20260611 clang-22
x86_64 randconfig-075-20260611 clang-22
x86_64 randconfig-076-20260611 clang-22
x86_64 rhel-9.4 clang-22
x86_64 rhel-9.4-bpf gcc-14
x86_64 rhel-9.4-func clang-22
x86_64 rhel-9.4-kselftests clang-22
x86_64 rhel-9.4-kunit gcc-14
x86_64 rhel-9.4-ltp gcc-14
x86_64 rhel-9.4-rust clang-22
xtensa allnoconfig clang-23
xtensa allnoconfig gcc-16.1.0
xtensa allyesconfig clang-20
xtensa randconfig-001-20260611 gcc-15.2.0
xtensa randconfig-002-20260611 gcc-15.2.0
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] ext4: skip extra isize expansion on inode eviction to avoid deadlock
From: Jan Kara @ 2026-06-11 14:00 UTC (permalink / raw)
To: Yun Zhou
Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260611124555.1541195-1-yun.zhou@windriver.com>
On Thu 11-06-26 20:45:55, Yun Zhou wrote:
> Expanding extra isize on an inode that is being evicted is pointless
> since the inode is about to be deleted. Skip it by setting
> EXT4_STATE_NO_EXPAND before calling ext4_mark_inode_dirty() in the
> eviction path.
>
> This also breaks a circular lock dependency reported by lockdep during
> orphan cleanup at mount time:
>
> CPU0 (writeback worker) CPU1 (open)
> ---- ----
> ext4_writepages()
> s_writepages_rwsem (read) ext4_create()
> ext4_do_writepages() __ext4_new_inode()
> ext4_journal_start() [holds jbd2 handle]
> wait_transaction_locked() ext4_xattr_set_handle()
> [WAIT for jbd2_handle] xattr_sem (write)
>
> CPU2 (mount / orphan cleanup)
> ----
> ext4_evict_inode()
> __ext4_mark_inode_dirty()
> ext4_try_to_expand_extra_isize()
> xattr_sem (write)
> ext4_expand_extra_isize_ea()
> ext4_xattr_block_set()
> iput(ea_inode)
> write_inode_now()
> ext4_writepages()
> s_writepages_rwsem (read)
> [WAIT for s_writepages_rwsem -- if blocked by write lock holder]
>
> This forms a circular dependency on lock classes:
>
> s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
>
> The iput() inside ext4_xattr_block_set() triggers write_inode_now()
> because SB_ACTIVE is not yet set during mount, so iput_final() cannot
> cache the inode in the LRU and must flush it synchronously.
>
> Setting EXT4_STATE_NO_EXPAND prevents ext4_try_to_expand_extra_isize()
> from executing, which eliminates the xattr_sem --> s_writepages_rwsem
> edge and breaks the cycle.
>
> Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
> Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Thanks for the patch! So I have no problem with setting EXT4_STATE_NO_EXPAND
in ext4_evict_inode() as you correctly point out expansion is pointless in
that case. But your patch actually doesn't fix the real problem, it only
deals with the particular syzbot reproducer. The real problem is that
ext4_xattr_block_set() which is run inside a transaction can end up
acquiring s_writepages_rwsem which violates the lock ordering rules. So
this is the problem that really needs to be fixed.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-11 13:38 UTC (permalink / raw)
To: Keith Busch
Cc: Carlos Maiolino, Christoph Hellwig, brauner, linux-block,
linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
Martin K. Petersen, Jens Axboe
In-Reply-To: <aiqwy5DfHI79KXuZ@kbusch-mbp>
On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> It's entirely possible a device supports byte aligned addresses. The
> block layer just doesn't let a driver report that. So either it really
> was successful because you found a bug that skips the alignment checks,
> or your device silently corrupted your payload.
>
> Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> though, in not taking the optimization when it was possible. So here's
> an alternative suggestion that should get things working as expected:
The fix below looks like it is addressing a real bug. I'm not sure if
Carlos is hitting it, but we were missing the alignment checks for
single-bvec fast path bios so far indeed.
^ permalink raw reply
* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Keith Busch @ 2026-06-11 12:57 UTC (permalink / raw)
To: Carlos Maiolino
Cc: Christoph Hellwig, brauner, linux-block, linux-fsdevel,
linux-ext4, linux-xfs, Hannes Reinecke, Martin K. Petersen,
Jens Axboe
In-Reply-To: <aiqBvF93P4NjfaDR@nidhogg.toxiclabs.cc>
On Thu, Jun 11, 2026 at 12:05:22PM +0200, Carlos Maiolino wrote:
> The passed in address 0x1003af80001 is one byte misaligned and shouldn't
> (at least in theory) ever be accepted no? Or am I missing something
> else?
It's entirely possible a device supports byte aligned addresses. The
block layer just doesn't let a driver report that. So either it really
was successful because you found a bug that skips the alignment checks,
or your device silently corrupted your payload.
Anyway, my earlier suggestion should work. Ming thinks it may go to far,
though, in not taking the optimization when it was possible. So here's
an alternative suggestion that should get things working as expected:
---
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba04..4c31762d6fb5f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -404,6 +404,9 @@ static inline bool bio_may_need_split(struct bio *bio,
bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
return true;
+
+ if ((bv->bv_offset | bv->bv_len) & lim->dma_alignment)
+ return true;
return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
}
--
^ permalink raw reply related
* [PATCH] ext4: skip extra isize expansion on inode eviction to avoid deadlock
From: Yun Zhou @ 2026-06-11 12:45 UTC (permalink / raw)
To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang, ebiggers, yun.zhou
Cc: linux-ext4, linux-kernel
Expanding extra isize on an inode that is being evicted is pointless
since the inode is about to be deleted. Skip it by setting
EXT4_STATE_NO_EXPAND before calling ext4_mark_inode_dirty() in the
eviction path.
This also breaks a circular lock dependency reported by lockdep during
orphan cleanup at mount time:
CPU0 (writeback worker) CPU1 (open)
---- ----
ext4_writepages()
s_writepages_rwsem (read) ext4_create()
ext4_do_writepages() __ext4_new_inode()
ext4_journal_start() [holds jbd2 handle]
wait_transaction_locked() ext4_xattr_set_handle()
[WAIT for jbd2_handle] xattr_sem (write)
CPU2 (mount / orphan cleanup)
----
ext4_evict_inode()
__ext4_mark_inode_dirty()
ext4_try_to_expand_extra_isize()
xattr_sem (write)
ext4_expand_extra_isize_ea()
ext4_xattr_block_set()
iput(ea_inode)
write_inode_now()
ext4_writepages()
s_writepages_rwsem (read)
[WAIT for s_writepages_rwsem -- if blocked by write lock holder]
This forms a circular dependency on lock classes:
s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
The iput() inside ext4_xattr_block_set() triggers write_inode_now()
because SB_ACTIVE is not yet set during mount, so iput_final() cannot
cache the inode in the LRU and must flush it synchronously.
Setting EXT4_STATE_NO_EXPAND prevents ext4_try_to_expand_extra_isize()
from executing, which eliminates the xattr_sem --> s_writepages_rwsem
edge and breaks the cycle.
Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
fs/ext4/inode.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..cbfd1d1282e6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -264,6 +264,12 @@ void ext4_evict_inode(struct inode *inode)
if (ext4_inode_is_fast_symlink(inode))
memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
inode->i_size = 0;
+ /*
+ * Skip extra isize expansion on inodes being deleted -- it is
+ * pointless and can trigger a circular lock dependency:
+ * xattr_sem -> ext4_xattr_block_set -> iput -> s_writepages_rwsem
+ */
+ ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
err = ext4_mark_inode_dirty(handle, inode);
if (err) {
ext4_warning(inode->i_sb,
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v2] jbd2: Remove special jbd2 slabs
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
To: Ext4 Developers List, Matthew Wilcox (Oracle)
Cc: Theodore Ts'o, Jan Kara, linux-fsdevel,
Mike Rapoport (Microsoft), Vlastimil Babka, Tal Zussman, Jan Kara
In-Reply-To: <20260528171413.1088143-1-willy@infradead.org>
On Thu, 28 May 2026 18:14:11 +0100, Matthew Wilcox (Oracle) wrote:
> When jbd2 was originally written, kmalloc() would not guarantee memory
> alignment for the requested objects. Since commit 59bb47985c1d in 2019,
> kmalloc has guaranteed natural alignment for power-of-two allocations.
> We can now remove the jbd2 special slabs and just use kmalloc() directly.
Applied, thanks!
[1/1] jbd2: Remove special jbd2 slabs
commit: bbe9015f23432bd4f5b8590eb178b3b5b7c29f02
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
To: Ext4 Developers List, Andreas Dilger, Aditya Prakash Srivastava
Cc: Theodore Ts'o, Jan Kara, Baokun Li, Ojaswin Mujoo,
Ritesh Harjani, Zhang Yi, linux-kernel,
syzbot+0c89d865531d053abb2d
In-Reply-To: <20260608065227.3018-1-aditya.ansh182@gmail.com>
On Mon, 08 Jun 2026 06:52:27 +0000, Aditya Prakash Srivastava wrote:
> When the data=journal mount option is used, the ext4_journalled_write_end()
> function incorrectly calls ext4_write_inline_data_end() without checking
> if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.
>
> If a previous attempt to convert the inline data to an extent failed (e.g.
> due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
> the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
> call to ext4_write_begin() will not prepare the inline data xattr for
> writing, but ext4_journalled_write_end() will incorrectly attempt to write
> to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
> ext4_write_inline_data() since i_inline_size was not expanded.
>
> [...]
Applied, thanks!
[1/1] ext4: fix kernel BUG in ext4_write_inline_data_end
commit: ad09aa45965d3fafaf9963bc78109b73c0f9ac8d
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
To: Ext4 Developers List, adilger.kernel, libaokun, jack, ojaswin,
ritesh.list, yi.zhang, dmonakhov, Yun Zhou
Cc: Theodore Ts'o, linux-kernel
In-Reply-To: <20260608152521.1292656-1-yun.zhou@windriver.com>
On Mon, 08 Jun 2026 23:25:21 +0800, Yun Zhou wrote:
> Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
> belong to the same superblock as the original file. Currently, this
> validation is performed inside ext4_move_extents() by
> mext_check_validity(), but only after lock_two_nondirectories() has
> already acquired the inode locks. When the donor fd refers to a file
> on a different filesystem (e.g., overlayfs), this late validation
> creates a circular lock dependency:
>
> [...]
Applied, thanks!
[1/1] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
commit: c143957520c6c9b5cd72e0de8b52b814f0c576fe
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH] ext4: Remove mention of PageWriteback
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
To: Ext4 Developers List, Matthew Wilcox (Oracle)
Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-kernel
In-Reply-To: <20260526190805.341676-1-willy@infradead.org>
On Tue, 26 May 2026 20:08:02 +0100, Matthew Wilcox (Oracle) wrote:
> Update a comment to refer to the concept of writeback instead of the
> (now obsolete) detail of how it's implemented.
Applied, thanks!
[1/1] ext4: Remove mention of PageWriteback
commit: 4e3a55f44b42c2aabd4c1cc3bdb6a01a7107121d
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH v2] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
To: Ext4 Developers List, adilger.kernel, libaokun, jack, ojaswin,
ritesh.list, yi.zhang, neil, brauner, jlayton, Hongling Zeng
Cc: Theodore Ts'o, linux-kernel, zhongling0719
In-Reply-To: <20260604073647.211279-1-zenghongling@kylinos.cn>
On Thu, 04 Jun 2026 15:36:47 +0800, Hongling Zeng wrote:
> When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
> It should return NULL instead for success and ERR_PTR() only with
> negative error codes for failure.
Applied, thanks!
[1/1] ext4: Fix ERR_PTR(0) in ext4_mkdir()
commit: 8e1c43af7cf5091d99db38b7c8129e394d7f45b5
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH v4] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Theodore Ts'o @ 2026-06-11 12:35 UTC (permalink / raw)
To: Ext4 Developers List, Andreas Dilger, Aditya Prakash Srivastava
Cc: Theodore Ts'o, Jan Kara, Baokun Li, Ojaswin Mujoo,
Ritesh Harjani, Zhang Yi, sashiko-reviews, linux-kernel,
syzbot+0c89d865531d053abb2d
In-Reply-To: <20260609062005.1702-1-aditya.ansh182@gmail.com>
On Tue, 09 Jun 2026 06:20:05 +0000, Aditya Prakash Srivastava wrote:
> When the data=journal mount option is used, the ext4_journalled_write_end()
> function incorrectly calls ext4_write_inline_data_end() without checking
> if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.
>
> If a previous attempt to convert the inline data to an extent failed (e.g.
> due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
> the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
> call to ext4_write_begin() will not prepare the inline data xattr for
> writing, but ext4_journalled_write_end() will incorrectly attempt to write
> to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
> ext4_write_inline_data() since i_inline_size was not expanded.
>
> [...]
Applied, thanks!
[1/1] ext4: fix kernel BUG in ext4_write_inline_data_end
commit: ad09aa45965d3fafaf9963bc78109b73c0f9ac8d
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* Re: [PATCH v4] iomap: add simple read path for small direct I/O
From: Fengnan @ 2026-06-11 12:04 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: brauner, djwong, hch, ojaswin, dgc, linux-xfs, linux-fsdevel,
linux-ext4, linux-kernel, lidiangang, p.raghav
In-Reply-To: <mmbe4kdeqg6zlblhysi27qno22dtkaahv7bzslaqopsg4k3qs7@nofv525nnl6c>
在 2026/6/11 17:36, Pankaj Raghav (Samsung) 写道:
>> +static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
>> + struct bio *bio)
>> +{
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> + ssize_t ret;
>> +
>> + WRITE_ONCE(iocb->private, NULL);
>> +
>> + ret = iomap_dio_simple_read_finish(iocb, bio,
>> + blk_status_to_errno(bio->bi_status));
>> +
>> + inode_dio_end(inode);
>> + trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
> Shouldn't the second parameter here be
> blk_status_to_errno(bio->bi_status)?
>
> I think that will be more meaningful for tracing here.
> trace_iomap_dio_complete(iocb, blk_status_to_errno(bio->bi_status), ret);
Makes sense. I’ll update it in the next version.
>
> <snip>
>> + return ret;
>> +}
>> +
>> + sr->iocb = iocb;
>> + sr->dio_flags = dio_flags;
>> +
>> + bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
>> + bio->bi_ioprio = iocb->ki_ioprio;
>> + bio->bi_private = sr;
>> + bio->bi_end_io = iomap_dio_simple_read_end_io;
>> +
>> + if (dio_flags & IOMAP_DIO_BOUNCE)
>> + ret = bio_iov_iter_bounce(bio, iter, count);
>> + else
>> + ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
>> + if (unlikely(ret))
>> + goto out_bio_put;
>> +
>> + if (bio->bi_iter.bi_size != count) {
>> + iov_iter_revert(iter, bio->bi_iter.bi_size);
>> + ret = -ENOTBLK;
>> + goto out_bio_release_pages;
>> + }
>> +
>> + sr->size = bio->bi_iter.bi_size;
>> +
>> + if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
>> + !(dio_flags & IOMAP_DIO_BOUNCE))
>> + bio_set_pages_dirty(bio);
>> +
>> + if (iocb->ki_flags & IOCB_NOWAIT)
>> + bio->bi_opf |= REQ_NOWAIT;
>> + if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
>> + bio->bi_opf |= REQ_POLLED;
>> + bio_set_polled(bio, iocb);
> This results in build failure as the following patch removed this call:
> https://lore.kernel.org/linux-block/20260518062917.506483-1-hch@lst.de/
>
> I think this call can just be removed as you are setting REQ_POLLED
> anyway.
You’re right. I’ll update that in the next version too.
Thanks.
>
>> + WRITE_ONCE(iocb->private, bio);
>> + }
>> +
>> + if (wait_for_completion) {
>> + sr->waiter = current;
>> + blk_crypto_submit_bio(bio);
>> + } else {
>> + atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
>> + sr->waiter = NULL;
>> + blk_crypto_submit_bio(bio);
>> + ret = -EIOCBQUEUED;
>> + }
>> +
> --
> Pankaj
^ permalink raw reply
* Re: [PATCH v2 0/4] show orphan file inode detail info
From: yebin @ 2026-06-11 11:42 UTC (permalink / raw)
To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4
In-Reply-To: <a5v57ie6feotxznmhrf3i22gzplw2ucotlnw3y7hmjhkalbb26@bx2lzoil75ks>
On 2026/6/9 19:13, Jan Kara wrote:
> On Mon 08-06-26 19:44:20, yebin wrote:
>> On 2026/4/16 1:59, Jan Kara wrote:
>>> On Wed 15-04-26 18:55:01, Ye Bin wrote:
>>>> From: Ye Bin <yebin10@huawei.com>
>>>>
>>>> Diffs v2 vs v1:
>>>> (1) Fix sashiko review issues:
>>>> https://sashiko.dev/#/patchset/20260403082507.1882703-1-yebin%40huaweicloud.com
>>>> (2) Change "orphan_list" file mode from 0444 to 0400;
>>>> (3) The display format of the "orphan_list" file is modified according
>>>> to Andreas' suggestions.
>>>> Fault injection tests have been conducted to address the issues raised
>>>> in the sashik review. There is no UAF issue in the ext4_seq_orphan_release()
>>>> function. The reason for this has already been explained in the code comments.
>>>> In addition to the fault injection tests, we also performed a stress test by
>>>> observing the /proc/fs/ext4/XX/orphan_list and the concurrent processes of
>>>> adding and removing orphan nodes, and no issues were found so far.
>>>>
>>>>
>>>> In actual production environments, the issue of inconsistency between
>>>> df and du is frequently encountered. In many cases, the cause of the
>>>> problem can be identified through the use of lsof. However, when
>>>> overlayfs is combined with project quota configuration, the issue becomes
>>>> more complex and troublesome to diagnose. First, to determine the project
>>>> ID, one needs to obtain orphaned nodes using `fsck.ext4 -fn /dev/xx`, and
>>>> then retrieve file information through `debugfs`. However, the file names
>>>> cannot always be obtained, and it is often unclear which files they are.
>>>> To identify which files these are, one would need to use crash for online
>>>> debugging or use kprobe to gather information incrementally. However, some
>>>> customers in production environments do not agree to upload any tools, and
>>>> online debugging might impact the business. There are also scenarios where
>>>> files are opened in kernel mode, which do not generate file descriptors(fds),
>>>> making it impossible to identify which files were deleted but still have
>>>> references through lsof. This patchset adds a procfs interface to query
>>>> information about orphaned nodes, which can assist in the analysis and
>>>> localization of such issues.
>>>
>>> Ye, did you read my comments to the v1 of the patchset [1]? I didn't see
>>> any reply from you. I don't think this is a good way how to expose orphan
>>> information for a filesystem for reasons I've outlined in that email.
>>>
>>
>> Hi Jan
>>
>> I thought about how to prevent resource exhaustion caused by making too many
>> FDs in a single application. My idea is that IOCTL should only obtain one FD
>> at a time, and the next time it should start obtaining orphan nodes from the
>> inode after the previous one. Each time an fd is obtained, the previous fd
>> should be closed. I expect that after traversing all the fds from the beginning,
>> they will all be closed and there will be no need for user space to close them
>> manually. I wonder if this approach is feasible? Or do you have any good
>> suggestions?
>
> Hum, I think you've misunderstood my suggestion in [1]. What I suggested
> is:
>
> 1) Provide ioctl GET_ORPHAN_FILES that will return one "virtual" fd that
> tracks state of iteration over orphan entries of a superblock
>
> 2) Reading from this fd will be returning file *handles* (as struct
> file_handle) describing the orphan inodes. There are no kernel resources
> struct file_handle occupies in the kernel. It is essentially just a
> filesystem agnostic container for inode number and inode generation number.
> Userspace can then use open_by_handle() syscall to convert struct
> file_handle into normal file descriptor but that is upto userspace and what
> it wants orphan information for.
>
> Is the design clearer now?
>
Thank you for your patient explanation. I have implemented it according to
your suggestion and am currently testing it locally. After the testing is
complete, I will release it. I hope I have not misunderstood your meaning
this time.
> Honza
>
> [1] https://lore.kernel.org/all/n4sccudy5avcgnkdhc27rzofzoprxqtwhfrlmsh3yyrj6vbc6d@mmu73gmtawkq/
>
^ permalink raw reply
* [syzbot ci] Re: Data in direntry (dirdata) feature
From: syzbot ci @ 2026-06-11 10:29 UTC (permalink / raw)
To: adilger.kernel, adilger, adilger, artem.blagodarenko, linux-ext4,
pravin.shelar
Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260610152417.13576-1-ablagodarenko@thelustrecollective.com>
syzbot ci has tested the following series
[v2] Data in direntry (dirdata) feature
https://lore.kernel.org/all/20260610152417.13576-1-ablagodarenko@thelustrecollective.com
* [PATCH v2 01/10] ext4: replace ext4_dir_entry with ext4_dir_entry_2
* [PATCH v2 02/10] ext4: add ext4_dir_entry_is_tail()
* [PATCH v2 03/10] ext4: refactor dx_root to support variable dirent sizes
* [PATCH v2 04/10] ext4: add dirdata format definitions and access helpers
* [PATCH v2 05/10] ext4: preserve dirdata bits in get_dtype()
* [PATCH v2 06/10] ext4: add ext4_dir_entry_len() and harden dirdata parsing
* [PATCH v2 07/10] ext4: rename ext4_dir_rec_len() and clarify dirdata usage
* [PATCH v2 08/10] ext4: dirdata feature
* [PATCH v2 09/10] ext4: add dirdata set/get helpers
* [PATCH v2 10/10] ext4: Add EXT4_IOC_SET_LUFID ioctl for setting LUFID on directory entries
and found the following issues:
* KASAN: slab-out-of-bounds Read in __ext4_check_dir_entry
* KASAN: slab-out-of-bounds Read in ext4_inlinedir_to_tree
* KASAN: slab-use-after-free Read in __ext4_check_dir_entry
* KASAN: slab-use-after-free Read in ext4_inlinedir_to_tree
* KASAN: use-after-free Read in __ext4_check_dir_entry
Full report is available here:
https://ci.syzbot.org/series/5bf0e2fa-2e68-4532-8396-4568879b2788
***
KASAN: slab-out-of-bounds Read in __ext4_check_dir_entry
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/b0854918-13f9-49dd-ab30-12154f0debe2/syz_repro
loop0: lost filesystem error report for type 5 error -117
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-out-of-bounds in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: slab-out-of-bounds in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
Read of size 1 at addr ffff8881022db7f5 by task syz.0.23/5815
CPU: 1 UID: 0 PID: 5815 Comm: syz.0.23 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description+0x55/0x1e0 mm/kasan/report.c:378
print_report+0x58/0x70 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
__ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
ext4_check_all_de+0x66/0x150 fs/ext4/dir.c:657
ext4_convert_inline_data_nolock+0x1b7/0x990 fs/ext4/inline.c:1121
ext4_try_add_inline_entry+0x604/0x8e0 fs/ext4/inline.c:1247
__ext4_add_entry+0x390/0x1f40 fs/ext4/namei.c:2529
ext4_add_entry fs/ext4/namei.c:2613 [inline]
ext4_mkdir+0x5e5/0xce0 fs/ext4/namei.c:3175
vfs_mkdir+0x413/0x630 fs/namei.c:5271
filename_mkdirat+0x285/0x510 fs/namei.c:5304
__do_sys_mkdirat fs/namei.c:5325 [inline]
__se_sys_mkdirat+0x35/0x150 fs/namei.c:5322
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f669359bcc7
Code: 00 66 90 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 db f7 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 b8 02 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd42381d38 EFLAGS: 00000246 ORIG_RAX: 0000000000000102
RAX: ffffffffffffffda RBX: 00007ffd42381dc0 RCX: 00007f669359bcc7
RDX: 00000000000001ff RSI: 0000200000001200 RDI: 00000000ffffff9c
RBP: 00002000000024c0 R08: 0000200000000240 R09: 0000000000000000
R10: 00002000000024c0 R11: 0000000000000246 R12: 0000200000001200
R13: 00007ffd42381d80 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Allocated by task 5066:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__kmalloc_cache_noprof+0x31c/0x660 mm/slub.c:5420
kmalloc_noprof include/linux/slab.h:950 [inline]
kzalloc_noprof include/linux/slab.h:1188 [inline]
kernfs_get_open_node fs/kernfs/file.c:543 [inline]
kernfs_fop_open+0x862/0xda0 fs/kernfs/file.c:718
do_dentry_open+0x822/0x13a0 fs/open.c:947
vfs_open+0x3b/0x340 fs/open.c:1079
do_open fs/namei.c:4699 [inline]
path_openat+0x2e08/0x3860 fs/namei.c:4858
do_file_open+0x23e/0x4a0 fs/namei.c:4887
do_sys_openat2+0x113/0x200 fs/open.c:1364
do_sys_open fs/open.c:1370 [inline]
__do_sys_openat fs/open.c:1386 [inline]
__se_sys_openat fs/open.c:1381 [inline]
__x64_sys_openat+0x138/0x170 fs/open.c:1381
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Last potentially related work creation:
kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
kvfree_call_rcu+0x100/0x430 mm/slab_common.c:1970
kernfs_unlink_open_file+0x3fe/0x4b0 fs/kernfs/file.c:604
kernfs_fop_release+0x2eb/0x440 fs/kernfs/file.c:783
__fput+0x44f/0xa60 fs/file_table.c:510
fput_close_sync+0x11f/0x240 fs/file_table.c:615
__do_sys_close fs/open.c:1507 [inline]
__se_sys_close fs/open.c:1492 [inline]
__x64_sys_close+0x7e/0x110 fs/open.c:1492
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff8881022db700
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 117 bytes to the right of
allocated 128-byte region [ffff8881022db700, ffff8881022db780)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1022db
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000000 ffff888100041a00 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2000(__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 0, tgid 0 (swapper/0), ts 2408938923, free_ts 0
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
prep_new_page mm/page_alloc.c:1861 [inline]
get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
alloc_slab_page mm/slub.c:3278 [inline]
allocate_slab+0x77/0x660 mm/slub.c:3467
new_slab mm/slub.c:3525 [inline]
refill_objects+0x339/0x3d0 mm/slub.c:7272
refill_sheaf mm/slub.c:2816 [inline]
__pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
alloc_from_pcs mm/slub.c:4750 [inline]
slab_alloc_node mm/slub.c:4884 [inline]
__do_kmalloc_node mm/slub.c:5295 [inline]
__kmalloc_noprof+0x474/0x760 mm/slub.c:5308
kmalloc_noprof include/linux/slab.h:954 [inline]
kzalloc_noprof include/linux/slab.h:1188 [inline]
__alloc_empty_sheaf mm/slub.c:2768 [inline]
alloc_empty_sheaf mm/slub.c:2783 [inline]
__pcs_replace_empty_main+0x2df/0x720 mm/slub.c:4647
alloc_from_pcs mm/slub.c:4750 [inline]
slab_alloc_node mm/slub.c:4884 [inline]
kmem_cache_alloc_noprof+0x37d/0x650 mm/slub.c:4906
dup_fd+0x55/0xb40 fs/file.c:390
copy_files+0xc8/0x120 kernel/fork.c:1639
copy_process+0x1d94/0x4440 kernel/fork.c:2252
kernel_clone+0x2d7/0x940 kernel/fork.c:2722
user_mode_thread+0x110/0x180 kernel/fork.c:2798
rest_init+0x23/0x300 init/main.c:727
start_kernel+0x38a/0x3e0 init/main.c:1220
page_owner free stack trace missing
Memory state around the buggy address:
ffff8881022db680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8881022db700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881022db780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff8881022db800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8881022db880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
***
KASAN: slab-out-of-bounds Read in ext4_inlinedir_to_tree
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/2dff870b-f382-4c93-8d8d-b2291d921224/syz_repro
loop1: lost filesystem error report for type 5 error -117
EXT4-fs (loop1): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-out-of-bounds in ext4_dir_entry_len fs/ext4/ext4.h:4095 [inline]
BUG: KASAN: slab-out-of-bounds in ext4_inlinedir_to_tree+0xda5/0x10d0 fs/ext4/inline.c:1335
Read of size 2 at addr ffff888115a3183c by task syz.1.18/5839
CPU: 1 UID: 0 PID: 5839 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description+0x55/0x1e0 mm/kasan/report.c:378
print_report+0x58/0x70 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
ext4_dir_entry_len fs/ext4/ext4.h:4095 [inline]
ext4_inlinedir_to_tree+0xda5/0x10d0 fs/ext4/inline.c:1335
ext4_htree_fill_tree+0x517/0x1230 fs/ext4/namei.c:1182
ext4_dx_readdir fs/ext4/dir.c:600 [inline]
ext4_readdir+0x2db4/0x3640 fs/ext4/dir.c:146
iterate_dir+0x399/0x570 fs/readdir.c:110
__do_sys_getdents64 fs/readdir.c:399 [inline]
__se_sys_getdents64+0xf1/0x280 fs/readdir.c:384
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f3e02b9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3e03ad5028 EFLAGS: 00000246 ORIG_RAX: 00000000000000d9
RAX: ffffffffffffffda RBX: 00007f3e02e15fa0 RCX: 00007f3e02b9ce59
RDX: 0000000000001000 RSI: 0000200000000f80 RDI: 0000000000000004
RBP: 00007f3e02c32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f3e02e16038 R14: 00007f3e02e15fa0 R15: 00007ffcaa902298
</TASK>
Allocated by task 5839:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__do_kmalloc_node mm/slub.c:5296 [inline]
__kmalloc_noprof+0x35c/0x760 mm/slub.c:5308
kmalloc_noprof include/linux/slab.h:954 [inline]
ext4_inlinedir_to_tree+0x312/0x10d0 fs/ext4/inline.c:1292
ext4_htree_fill_tree+0x517/0x1230 fs/ext4/namei.c:1182
ext4_dx_readdir fs/ext4/dir.c:600 [inline]
ext4_readdir+0x2db4/0x3640 fs/ext4/dir.c:146
iterate_dir+0x399/0x570 fs/readdir.c:110
__do_sys_getdents64 fs/readdir.c:399 [inline]
__se_sys_getdents64+0xf1/0x280 fs/readdir.c:384
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff888115a31800
which belongs to the cache kmalloc-64 of size 64
The buggy address is located 0 bytes to the right of
allocated 60-byte region [ffff888115a31800, ffff888115a3183c)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x115a31
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000000 ffff8881000418c0 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800200020 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5051, tgid 5051 (acpid), ts 27203740677, free_ts 27201732767
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
prep_new_page mm/page_alloc.c:1861 [inline]
get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
alloc_slab_page mm/slub.c:3278 [inline]
allocate_slab+0x77/0x660 mm/slub.c:3467
new_slab mm/slub.c:3525 [inline]
refill_objects+0x339/0x3d0 mm/slub.c:7272
refill_sheaf mm/slub.c:2816 [inline]
__pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
alloc_from_pcs mm/slub.c:4750 [inline]
slab_alloc_node mm/slub.c:4884 [inline]
__do_kmalloc_node mm/slub.c:5295 [inline]
__kmalloc_noprof+0x474/0x760 mm/slub.c:5308
kmalloc_noprof include/linux/slab.h:954 [inline]
kzalloc_noprof include/linux/slab.h:1188 [inline]
tomoyo_get_name+0x20c/0x590 security/tomoyo/memory.c:173
tomoyo_parse_name_union+0xd9/0x130 security/tomoyo/util.c:260
tomoyo_update_path_acl security/tomoyo/file.c:399 [inline]
tomoyo_write_file+0x3a6/0xc50 security/tomoyo/file.c:1027
tomoyo_write_domain2 security/tomoyo/common.c:1160 [inline]
tomoyo_add_entry security/tomoyo/common.c:2177 [inline]
tomoyo_supervisor+0x1208/0x1570 security/tomoyo/common.c:2238
tomoyo_audit_path_log security/tomoyo/file.c:169 [inline]
tomoyo_path_permission+0x25a/0x380 security/tomoyo/file.c:592
tomoyo_check_open_permission+0x2b2/0x470 security/tomoyo/file.c:782
security_file_open+0xa9/0x240 security/security.c:2739
do_dentry_open+0x4a8/0x13a0 fs/open.c:924
vfs_open+0x3b/0x340 fs/open.c:1079
page last free pid 15 tgid 15 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
__free_pages_prepare mm/page_alloc.c:1397 [inline]
__free_frozen_pages+0xc1c/0xd30 mm/page_alloc.c:2938
__tlb_remove_table_free mm/mmu_gather.c:228 [inline]
tlb_remove_table_rcu+0x85/0x100 mm/mmu_gather.c:291
rcu_do_batch kernel/rcu/tree.c:2617 [inline]
rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869
handle_softirqs+0x22a/0x840 kernel/softirq.c:622
run_ksoftirqd+0x36/0x60 kernel/softirq.c:1076
smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
kthread+0x389/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Memory state around the buggy address:
ffff888115a31700: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff888115a31780: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
>ffff888115a31800: 00 00 00 00 00 00 00 04 fc fc fc fc fc fc fc fc
^
ffff888115a31880: 00 00 00 00 00 00 02 fc fc fc fc fc fc fc fc fc
ffff888115a31900: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================
***
KASAN: slab-use-after-free Read in __ext4_check_dir_entry
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/f1d48ea1-6e87-4d64-9c13-8bf8aed109fc/syz_repro
loop0: lost filesystem error report for type 5 error -117
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: slab-use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
Read of size 1 at addr ffff888114d8c045 by task syz.0.20/5821
CPU: 1 UID: 0 PID: 5821 Comm: syz.0.20 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description+0x55/0x1e0 mm/kasan/report.c:378
print_report+0x58/0x70 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
__ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
ext4_find_dest_de+0x136/0x770 fs/ext4/namei.c:2203
ext4_add_dirent_to_inline+0xcf/0x430 fs/ext4/inline.c:984
ext4_try_add_inline_entry+0x235/0x8e0 fs/ext4/inline.c:1213
__ext4_add_entry+0x390/0x1f40 fs/ext4/namei.c:2529
ext4_add_entry fs/ext4/namei.c:2613 [inline]
ext4_add_nondir+0x111/0x310 fs/ext4/namei.c:2936
ext4_create+0x2e9/0x470 fs/ext4/namei.c:2982
lookup_open fs/namei.c:4511 [inline]
open_last_lookups fs/namei.c:4611 [inline]
path_openat+0x1395/0x3860 fs/namei.c:4855
do_file_open+0x23e/0x4a0 fs/namei.c:4887
do_sys_openat2+0x113/0x200 fs/open.c:1364
do_sys_open fs/open.c:1370 [inline]
__do_sys_openat fs/open.c:1386 [inline]
__se_sys_openat fs/open.c:1381 [inline]
__x64_sys_openat+0x138/0x170 fs/open.c:1381
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f922219ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f9223137028 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f9222415fa0 RCX: 00007f922219ce59
RDX: 0000000000042042 RSI: 0000200000000080 RDI: 0000000000000004
RBP: 00007f9222232d6f R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000014a R11: 0000000000000246 R12: 0000000000000000
R13: 00007f9222416038 R14: 00007f9222415fa0 R15: 00007ffd01a2d448
</TASK>
Allocated by task 5484:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
unpoison_slab_object mm/kasan/common.c:340 [inline]
__kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366
kasan_slab_alloc include/linux/kasan.h:253 [inline]
slab_post_alloc_hook mm/slub.c:4570 [inline]
slab_alloc_node mm/slub.c:4899 [inline]
kmem_cache_alloc_node_noprof+0x384/0x690 mm/slub.c:4951
kmalloc_reserve net/core/skbuff.c:613 [inline]
__alloc_skb+0x27d/0x7d0 net/core/skbuff.c:713
alloc_skb include/linux/skbuff.h:1385 [inline]
nlmsg_new include/net/netlink.h:1055 [inline]
mpls_netconf_notify_devconf+0x46/0x100 net/mpls/af_mpls.c:1217
mpls_dev_notify+0xb2d/0xd10 net/mpls/af_mpls.c:1691
notifier_call_chain+0x1ad/0x3d0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2287 [inline]
call_netdevice_notifiers net/core/dev.c:2301 [inline]
unregister_netdevice_many_notify+0x17a5/0x22c0 net/core/dev.c:12421
ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
process_one_work kernel/workqueue.c:3314 [inline]
process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
kthread+0x389/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Freed by task 5484:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:253 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:2689 [inline]
slab_free mm/slub.c:6251 [inline]
kfree+0x1c5/0x640 mm/slub.c:6566
skb_kfree_head net/core/skbuff.c:1075 [inline]
skb_free_head net/core/skbuff.c:1087 [inline]
skb_release_data+0x828/0xa60 net/core/skbuff.c:1114
skb_release_all net/core/skbuff.c:1189 [inline]
__kfree_skb+0x5d/0x210 net/core/skbuff.c:1203
netlink_broadcast_filtered+0xe18/0xf20 net/netlink/af_netlink.c:1540
nlmsg_multicast_filtered include/net/netlink.h:1165 [inline]
nlmsg_multicast include/net/netlink.h:1184 [inline]
nlmsg_notify+0xf0/0x1a0 net/netlink/af_netlink.c:2598
mpls_dev_notify+0xb2d/0xd10 net/mpls/af_mpls.c:1691
notifier_call_chain+0x1ad/0x3d0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2287 [inline]
call_netdevice_notifiers net/core/dev.c:2301 [inline]
unregister_netdevice_many_notify+0x17a5/0x22c0 net/core/dev.c:12421
ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
process_one_work kernel/workqueue.c:3314 [inline]
process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
kthread+0x389/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
The buggy address belongs to the object at ffff888114d8c000
which belongs to the cache skbuff_small_head of size 704
The buggy address is located 69 bytes inside of
freed 704-byte region [ffff888114d8c000, ffff888114d8c2c0)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x114d8c
head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ff00000000040(head|node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000040 ffff888160416b40 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800120012 00000000f5000000 0000000000000000
head: 017ff00000000040 ffff888160416b40 dead000000000100 dead000000000122
head: 0000000000000000 0000000800120012 00000000f5000000 0000000000000000
head: 017ff00000000002 ffffffffffffff01 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000004
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5484, tgid 5484 (kworker/u8:2), ts 72573003529, free_ts 72546506446
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
prep_new_page mm/page_alloc.c:1861 [inline]
get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
alloc_slab_page mm/slub.c:3278 [inline]
allocate_slab+0x77/0x660 mm/slub.c:3467
new_slab mm/slub.c:3525 [inline]
refill_objects+0x339/0x3d0 mm/slub.c:7272
refill_sheaf mm/slub.c:2816 [inline]
__pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
alloc_from_pcs mm/slub.c:4750 [inline]
slab_alloc_node mm/slub.c:4884 [inline]
kmem_cache_alloc_node_noprof+0x441/0x690 mm/slub.c:4951
kmalloc_reserve net/core/skbuff.c:613 [inline]
__alloc_skb+0x27d/0x7d0 net/core/skbuff.c:713
alloc_skb include/linux/skbuff.h:1385 [inline]
nlmsg_new include/net/netlink.h:1055 [inline]
mpls_netconf_notify_devconf+0x46/0x100 net/mpls/af_mpls.c:1217
mpls_dev_notify+0xb2d/0xd10 net/mpls/af_mpls.c:1691
notifier_call_chain+0x1ad/0x3d0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2287 [inline]
call_netdevice_notifiers net/core/dev.c:2301 [inline]
unregister_netdevice_many_notify+0x17a5/0x22c0 net/core/dev.c:12421
ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
process_one_work kernel/workqueue.c:3314 [inline]
process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
page last free pid 5484 tgid 5484 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
__free_pages_prepare mm/page_alloc.c:1397 [inline]
__free_frozen_pages+0xc1c/0xd30 mm/page_alloc.c:2938
stack_depot_save_flags+0x40e/0x810 lib/stackdepot.c:735
kasan_save_stack mm/kasan/common.c:58 [inline]
kasan_save_track+0x4f/0x80 mm/kasan/common.c:78
unpoison_slab_object mm/kasan/common.c:340 [inline]
__kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366
kasan_slab_alloc include/linux/kasan.h:253 [inline]
slab_post_alloc_hook mm/slub.c:4570 [inline]
slab_alloc_node mm/slub.c:4899 [inline]
kmem_cache_alloc_noprof+0x2bc/0x650 mm/slub.c:4906
kmem_alloc_batch lib/debugobjects.c:371 [inline]
fill_pool+0x156/0x580 lib/debugobjects.c:420
debug_objects_fill_pool lib/debugobjects.c:752 [inline]
debug_object_activate+0x4a3/0x580 lib/debugobjects.c:841
debug_rcu_head_queue kernel/rcu/rcu.h:236 [inline]
__call_rcu_common kernel/rcu/tree.c:3116 [inline]
call_rcu+0x43/0x890 kernel/rcu/tree.c:3251
kernfs_put+0x259/0x520 fs/kernfs/dir.c:618
kernfs_remove_by_name_ns+0xc8/0x140 fs/kernfs/dir.c:1799
device_remove_class_symlinks+0x178/0x190 drivers/base/core.c:3479
device_del+0x400/0x8f0 drivers/base/core.c:3881
unregister_netdevice_many_notify+0x1d5f/0x22c0 net/core/dev.c:12456
ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
ops_undo_list+0x3d3/0x940 net/core/net_namespace.c:248
cleanup_net+0x56b/0x800 net/core/net_namespace.c:702
process_one_work kernel/workqueue.c:3314 [inline]
process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
Memory state around the buggy address:
ffff888114d8bf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888114d8bf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888114d8c000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888114d8c080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888114d8c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
***
KASAN: slab-use-after-free Read in ext4_inlinedir_to_tree
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/f42da242-e16e-4f10-bf25-0bd7e192d989/syz_repro
loop0: lost filesystem error report for type 5 error -117
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: slab-use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: slab-use-after-free in ext4_inlinedir_to_tree+0x94c/0x10d0 fs/ext4/inline.c:1335
Read of size 1 at addr ffff88816fee8825 by task syz.0.20/5867
CPU: 1 UID: 0 PID: 5867 Comm: syz.0.20 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description+0x55/0x1e0 mm/kasan/report.c:378
print_report+0x58/0x70 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
ext4_inlinedir_to_tree+0x94c/0x10d0 fs/ext4/inline.c:1335
ext4_htree_fill_tree+0x517/0x1230 fs/ext4/namei.c:1182
ext4_dx_readdir fs/ext4/dir.c:600 [inline]
ext4_readdir+0x2db4/0x3640 fs/ext4/dir.c:146
iterate_dir+0x399/0x570 fs/readdir.c:110
__do_sys_getdents fs/readdir.c:319 [inline]
__se_sys_getdents+0xf1/0x270 fs/readdir.c:304
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f010ad9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f010bc0f028 EFLAGS: 00000246 ORIG_RAX: 000000000000004e
RAX: ffffffffffffffda RBX: 00007f010b015fa0 RCX: 00007f010ad9ce59
RDX: 0000000000000054 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007f010ae32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f010b016038 R14: 00007f010b015fa0 R15: 00007ffd93577348
</TASK>
Allocated by task 5064:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__do_kmalloc_node mm/slub.c:5296 [inline]
__kmalloc_noprof+0x35c/0x760 mm/slub.c:5308
kmalloc_noprof include/linux/slab.h:954 [inline]
kzalloc_noprof include/linux/slab.h:1188 [inline]
tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
tomoyo_encode+0x28b/0x550 security/tomoyo/realpath.c:80
tomoyo_realpath_from_path+0x58d/0x5d0 security/tomoyo/realpath.c:283
tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
tomoyo_path_perm+0x283/0x560 security/tomoyo/file.c:827
security_inode_getattr+0x12b/0x310 security/security.c:1895
vfs_getattr fs/stat.c:259 [inline]
vfs_fstat fs/stat.c:281 [inline]
vfs_fstatat+0xb4/0x170 fs/stat.c:371
__do_sys_newfstatat fs/stat.c:538 [inline]
__se_sys_newfstatat fs/stat.c:532 [inline]
__x64_sys_newfstatat+0x151/0x200 fs/stat.c:532
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 5064:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:253 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:2689 [inline]
slab_free mm/slub.c:6251 [inline]
kfree+0x1c5/0x640 mm/slub.c:6566
tomoyo_path_perm+0x403/0x560 security/tomoyo/file.c:847
security_inode_getattr+0x12b/0x310 security/security.c:1895
vfs_getattr fs/stat.c:259 [inline]
vfs_fstat fs/stat.c:281 [inline]
vfs_fstatat+0xb4/0x170 fs/stat.c:371
__do_sys_newfstatat fs/stat.c:538 [inline]
__se_sys_newfstatat fs/stat.c:532 [inline]
__x64_sys_newfstatat+0x151/0x200 fs/stat.c:532
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff88816fee8800
which belongs to the cache kmalloc-64 of size 64
The buggy address is located 37 bytes inside of
freed 64-byte region [ffff88816fee8800, ffff88816fee8840)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x16fee8
flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 057ff00000000000 ffff8881000418c0 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800200020 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 1, tgid 1 (swapper/0), ts 21294026082, free_ts 0
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
prep_new_page mm/page_alloc.c:1861 [inline]
get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
alloc_slab_page mm/slub.c:3278 [inline]
allocate_slab+0x77/0x660 mm/slub.c:3467
new_slab mm/slub.c:3525 [inline]
refill_objects+0x339/0x3d0 mm/slub.c:7272
refill_sheaf mm/slub.c:2816 [inline]
__pcs_replace_empty_main+0x321/0x720 mm/slub.c:4652
alloc_from_pcs mm/slub.c:4750 [inline]
slab_alloc_node mm/slub.c:4884 [inline]
__do_kmalloc_node mm/slub.c:5295 [inline]
__kmalloc_noprof+0x474/0x760 mm/slub.c:5308
kmalloc_noprof include/linux/slab.h:954 [inline]
kzalloc_noprof include/linux/slab.h:1188 [inline]
handler_new_ref+0x261/0x9c0 drivers/media/v4l2-core/v4l2-ctrls-core.c:1882
v4l2_ctrl_add_handler+0x19f/0x290 drivers/media/v4l2-core/v4l2-ctrls-core.c:2443
vivid_create_controls+0x332d/0x3bd0 drivers/media/test-drivers/vivid/vivid-ctrls.c:2072
vivid_create_instance drivers/media/test-drivers/vivid/vivid-core.c:1933 [inline]
vivid_probe+0x4261/0x72b0 drivers/media/test-drivers/vivid/vivid-core.c:2095
platform_probe+0xf9/0x190 drivers/base/platform.c:1432
call_driver_probe drivers/base/dd.c:-1 [inline]
really_probe+0x267/0xaf0 drivers/base/dd.c:709
__driver_probe_device+0x1ef/0x380 drivers/base/dd.c:871
driver_probe_device+0x4f/0x240 drivers/base/dd.c:901
__driver_attach+0x34c/0x640 drivers/base/dd.c:1295
page_owner free stack trace missing
Memory state around the buggy address:
ffff88816fee8700: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
ffff88816fee8780: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
>ffff88816fee8800: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
^
ffff88816fee8880: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff88816fee8900: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================
***
KASAN: use-after-free Read in __ext4_check_dir_entry
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 9716c086c8e8b141d35aa61f2e96a2e83de212a7
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/ddf6ee7c-dfa8-4383-b004-10140edc081c/config
syz repro: https://ci.syzbot.org/findings/57c0b75a-8922-4dc1-9a20-ca947564792b/syz_repro
==================================================================
BUG: KASAN: use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
BUG: KASAN: use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
BUG: KASAN: use-after-free in __ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
Read of size 1 at addr ffff88816be85045 by task syz.2.21/5880
CPU: 1 UID: 0 PID: 5880 Comm: syz.2.21 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description+0x55/0x1e0 mm/kasan/report.c:378
print_report+0x58/0x70 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
ext4_dirent_get_data_len fs/ext4/ext4.h:4069 [inline]
ext4_dir_entry_len fs/ext4/ext4.h:4096 [inline]
__ext4_check_dir_entry+0x65a/0xc40 fs/ext4/dir.c:96
ext4_find_dest_de+0x136/0x770 fs/ext4/namei.c:2203
ext4_add_dirent_to_inline+0xcf/0x430 fs/ext4/inline.c:984
ext4_try_add_inline_entry+0x235/0x8e0 fs/ext4/inline.c:1213
__ext4_add_entry+0x390/0x1f40 fs/ext4/namei.c:2529
ext4_add_entry fs/ext4/namei.c:2613 [inline]
ext4_add_nondir+0x111/0x310 fs/ext4/namei.c:2936
ext4_create+0x2e9/0x470 fs/ext4/namei.c:2982
lookup_open fs/namei.c:4511 [inline]
open_last_lookups fs/namei.c:4611 [inline]
path_openat+0x1395/0x3860 fs/namei.c:4855
do_file_open+0x23e/0x4a0 fs/namei.c:4887
do_sys_openat2+0x113/0x200 fs/open.c:1364
do_sys_open fs/open.c:1370 [inline]
__do_sys_openat fs/open.c:1386 [inline]
__se_sys_openat fs/open.c:1381 [inline]
__x64_sys_openat+0x138/0x170 fs/open.c:1381
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5713b9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff672b25f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f5713e15fa0 RCX: 00007f5713b9ce59
RDX: 0000000000042042 RSI: 0000200000000080 RDI: 0000000000000004
RBP: 00007f5713c32d6f R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000014a R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5713e15fac R14: 00007f5713e15fa0 R15: 00007f5713e15fa0
</TASK>
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x16be85
flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
page_type: f0(buddy)
raw: 057ff00000000000 ffffea0005afa0c8 ffffea0005afa1c8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000f0000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 5630, tgid 5630 (syz-executor), ts 67290853657, free_ts 69321168948
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
prep_new_page mm/page_alloc.c:1861 [inline]
get_page_from_freelist+0x2593/0x2610 mm/page_alloc.c:3941
__alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
__alloc_pages_noprof+0x10/0x100 mm/page_alloc.c:5255
alloc_pages_bulk_noprof+0x5ff/0x7c0 mm/page_alloc.c:5175
___alloc_pages_bulk mm/kasan/shadow.c:345 [inline]
__kasan_populate_vmalloc_do mm/kasan/shadow.c:370 [inline]
__kasan_populate_vmalloc+0xc1/0x1d0 mm/kasan/shadow.c:424
kasan_populate_vmalloc include/linux/kasan.h:580 [inline]
alloc_vmap_area+0xd47/0x1480 mm/vmalloc.c:2123
__get_vm_area_node+0x1f8/0x300 mm/vmalloc.c:3226
__vmalloc_node_range_noprof+0x36a/0x1750 mm/vmalloc.c:4024
vmalloc_user_noprof+0xad/0xe0 mm/vmalloc.c:4218
kcov_ioctl+0x55/0x620 kernel/kcov.c:726
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5693 tgid 5693 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
__free_pages_prepare mm/page_alloc.c:1397 [inline]
__free_frozen_pages+0xc1c/0xd30 mm/page_alloc.c:2938
kasan_depopulate_vmalloc_pte+0x6d/0x90 mm/kasan/shadow.c:484
apply_to_pte_range mm/memory.c:3338 [inline]
apply_to_pmd_range mm/memory.c:3382 [inline]
apply_to_pud_range mm/memory.c:3418 [inline]
apply_to_p4d_range mm/memory.c:3454 [inline]
__apply_to_page_range+0xbdc/0x1420 mm/memory.c:3490
__kasan_release_vmalloc+0xa2/0xd0 mm/kasan/shadow.c:602
kasan_release_vmalloc include/linux/kasan.h:593 [inline]
kasan_release_vmalloc_node mm/vmalloc.c:2284 [inline]
purge_vmap_node+0x220/0x960 mm/vmalloc.c:2306
__purge_vmap_area_lazy+0x779/0xb40 mm/vmalloc.c:2396
drain_vmap_area_work+0x27/0x40 mm/vmalloc.c:2430
process_one_work kernel/workqueue.c:3314 [inline]
process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
kthread+0x389/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Memory state around the buggy address:
ffff88816be84f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88816be84f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88816be85000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88816be85080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88816be85100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).
The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox