Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Jan Kara @ 2026-06-16 12:28 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-15-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:34, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> infrastructure for ext4.
> 
> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> out either a mapped partial block or a dirty, unwritten partial block.
> 
> Important constraints:
> 
> Zeroing out under an active journal handle can cause deadlock, because
> the order of acquiring the folio lock and starting a handle is
> inconsistent with the iomap writeback path.
> 
> Therefore, ext4_iomap_block_zero_range():
> - Must NOT be called under an active handle.
> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>   before updating i_disksize (for the cases of post-EOF append write,
>   post-EOF fallocate, and truncate up). In subsequent patches, we will
>   address this by synchronizing commit I/O but doesn't waiting for
>   completion, and updating i_disksize to i_size only after the zeroed
>   data has been written back.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c6fe42d012fc..e0dae2501292 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	return 0;
>  }
>  
> +static int ext4_iomap_zero_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);

This looks like a layering violation to me. I don't think you can safely
assume the iomap you're passed is a part of iomap_iter...

> +	struct ext4_map_blocks map;
> +	u8 blkbits = inode->i_blkbits;
> +	unsigned int iomap_flags = 0;
> +	int ret;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> +		return -EINVAL;
> +
> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> +	 * this bypasses the flush iomap uses to trigger extent conversion
> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> +	 */
> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> +
> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
> +			map.m_len = (start >> blkbits) - map.m_lblk;
> +	}

... and you need access to iter only for this which seems to be really a
hack that's trying to outsmart the iomap code. I have to admit I don't
fully understand what you are trying to achieve here. Are you trying to
avoid flushing of the range that will be zeroed out?

> +	ret = iomap_zero_range(inode, from, length, did_zero,
> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> +			       NULL);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * TODO: The iomap does not distinguish between different types of
> +	 * zeroing and always sets zero_written if a zeroing operation is
> +	 * performed, which may result in unnecessary order operations.
> +	 */

Is this still true after your fix to did_zero handling?

> +	if (did_zero && zero_written)
> +		*zero_written = *did_zero;
> +
> +	return 0;
> +}
> +
>  /*
>   * Zeros out a mapping of length 'length' starting from file offset
>   * 'from'.  The range to be zero'd must be contained with in one block.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 10/23] ext4: implement mmap path using iomap
From: Jan Kara @ 2026-06-16 11:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-11-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:30, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce ext4_iomap_page_mkwrite() to implement the mmap iomap path
> for ext4. The heavy lifting is delegated to iomap_page_mkwrite(), which
> only requires ext4_iomap_buffered_write_ops and
> ext4_iomap_buffered_da_write_ops to allocate and map blocks.
> 
> Note that the lock ordering between folio lock and transaction start in
> this path is reversed compared to the buffer_head buffered write path.
> The lock ordering documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  8 ++++++--
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a80195bd6f20..c6fe42d012fc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4020,7 +4020,7 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>  		return -ERANGE;
>  	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>  		return -EINVAL;
> -	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> +	if (WARN_ON_ONCE(!(flags & (IOMAP_WRITE | IOMAP_FAULT))))
>  		return -EINVAL;
>  
>  	if (delalloc)
> @@ -4080,6 +4080,14 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
>  		return 0;
>  
> +	/*
> +	 * iomap_page_mkwrite() will never fail in a way that requires delalloc
> +	 * extents that it allocated to be revoked.  Hence never try to release
> +	 * them here.
> +	 */
> +	if (flags & IOMAP_FAULT)
> +		return 0;
> +
>  	/* Nothing to do if we've written the entire delalloc extent */
>  	start_byte = iomap_last_written_block(inode, offset, written);
>  	end_byte = round_up(offset + length, i_blocksize(inode));
> @@ -7191,6 +7199,23 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>  	return ret;
>  }
>  
> +static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	const struct iomap_ops *iomap_ops;
> +
> +	/*
> +	 * ext4_nonda_switch() could writeback this folio, so have to
> +	 * call it before lock folio.
> +	 */
> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> +	else
> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> +
> +	return iomap_page_mkwrite(vmf, iomap_ops, NULL);
> +}
> +
>  vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> @@ -7213,6 +7238,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  
>  	filemap_invalidate_lock_shared(mapping);
>  
> +	if (ext4_inode_buffered_iomap(inode)) {
> +		ret = ext4_iomap_page_mkwrite(vmf);
> +		goto out;
> +	}
> +
>  	err = ext4_convert_inline_data(inode);
>  	if (err)
>  		goto out_ret;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 51d87db53543..62bfe05a64bc 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -100,8 +100,12 @@ static const struct fs_parameter_spec ext4_param_specs[];
>   * Lock ordering
>   *
>   * page fault path:
> - * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) -> transaction start
> - *   -> page lock -> i_data_sem (rw)
> + * - buffer_head path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     transaction start -> folio lock -> i_data_sem (rw)
> + * - iomap path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     folio lock -> transaction start -> i_data_sem (rw)
>   *
>   * buffered write path:
>   * sb_start_write -> i_rwsem (w) -> mmap_lock
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
From: Jan Kara @ 2026-06-16 11:47 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-10-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:29, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add the iomap writeback path for ext4 buffered I/O. This introduces:
> 
>  - ext4_iomap_writepages(): the main writeback entry point.
>  - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>    block mapping and I/O submission.
>  - A new end I/O worker for converting unwritten extents, updating file
>    size, and handling DATA_ERR_ABORT after I/O completion.
> 
> Core implementation details:
> 
>  - ->writeback_range() callback
>    Calls ext4_iomap_map_writeback_range() to query the longest range of
>    existing mapped extents. For performance, when a block range is not
>    yet allocated, it allocates based on the writeback length and delalloc
>    extent length, rather than allocating for a single folio at a time.
>    The folio is then added to an iomap_ioend instance.
> 
>  - ->writeback_submit() callback
>    Registers ext4_iomap_end_bio() as the end bio callback. This callback
>    schedules a worker to handle:
>    - Unwritten extent conversion.
>    - i_disksize update after data is written back.
>    - Journal abort on writeback I/O failure.
> 
> Key changes and considerations:
> 
> - Append write and unwritten extents
>   Since data=ordered mode is not used to prevent stale data exposure
>   during append writebacks, new blocks are always allocated as unwritten
>   extents (i.e. always enable dioread_nolock), and i_disksize update is
>   postponed until I/O completion. Additionally, the deadlock that the
>   reserve handle was expected to resolve does not occur anymore.
>   Therefore, the end I/O worker can start a normal journal handle
>   instead of a reserve handle when converting unwritten extents.
> 
> - Lock ordering
>   The ->writeback_range() callback runs under the folio lock, requiring
>   the journal handle to be started under that same lock. This reverses
>   the order compared to the buffer_head writeback path. The lock ordering
>   documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h        |   4 +
>  fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
>  fs/ext4/super.c       |   7 +-
>  fs/iomap/ioend.c      |   3 +-
>  include/linux/iomap.h |   1 +
>  6 files changed, 346 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4832e7f7db82..078feda47e36 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
>  	 */
>  	struct list_head i_rsv_conversion_list;
>  	struct work_struct i_rsv_conversion_work;
> +	struct list_head i_iomap_ioend_list;
> +	struct work_struct i_iomap_ioend_work;

Ugh, this adds 48 bytes to ext4 inode. That's pretty heavy. Cannot we reuse
i_rsv_conversion_list / work for this? For each inode only one of them
should be used AFAICS.

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1ae7d3f4a1c8..a80195bd6f20 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -44,6 +44,7 @@
>  #include <linux/iversion.h>
>  
>  #include "ext4_jbd2.h"
> +#include "ext4_extents.h"
>  #include "xattr.h"
>  #include "acl.h"
>  #include "truncate.h"
> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
>  	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>  }
>  
> +static int ext4_iomap_map_one_extent(struct inode *inode,
> +				     struct ext4_map_blocks *map)
> +{
> +	struct extent_status es;
> +	handle_t *handle = NULL;
> +	int credits, map_flags;
> +	int retval;
> +
> +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	map->m_flags = 0;
> +	/*
> +	 * It is necessary to look up extent and map blocks under i_data_sem
> +	 * in write mode, otherwise, the delalloc extent may become stale
> +	 * during concurrent truncate operations.
> +	 */
> +	ext4_fc_track_inode(handle, inode);
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> +		map->m_len = min_t(unsigned int, retval, map->m_len);
> +
> +		if (ext4_es_is_delayed(&es)) {
> +			map->m_flags |= EXT4_MAP_DELAYED;
> +			trace_ext4_da_write_pages_extent(inode, map);
> +			/*
> +			 * Call ext4_map_create_blocks() to allocate any
> +			 * delayed allocation blocks. It is possible that
> +			 * we're going to need more metadata blocks, however
> +			 * we must not fail because we're in writeback and
> +			 * there is nothing we can do so it might result in
> +			 * data loss. So use reserved blocks to allocate
> +			 * metadata if possible.
> +			 */
> +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
> +				    EXT4_EX_NOCACHE;
> +
> +			retval = ext4_map_create_blocks(handle, inode, map,
> +							map_flags);
> +			if (retval > 0)
> +				ext4_fc_track_range(handle, inode, map->m_lblk,
> +						map->m_lblk + map->m_len - 1);
> +			goto out;
> +		} else if (unlikely(ext4_es_is_hole(&es)))
> +			goto out;
> +
> +		/* Found written or unwritten extent. */
> +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> +		map->m_flags = ext4_es_is_written(&es) ?
> +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> +		goto out;
> +	}
> +
> +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> +out:
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_journal_stop(handle);
> +	return retval < 0 ? retval : 0;
> +}
> +
> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> +					  loff_t offset, unsigned int dirty_len)
> +{
> +	struct inode *inode = wpc->inode;
> +	struct super_block *sb = inode->i_sb;
> +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
> +	struct ext4_map_blocks map;
> +	unsigned int blkbits = inode->i_blkbits;
> +	unsigned int index = offset >> blkbits;
> +	unsigned int blk_end, blk_len;
> +	int ret;
> +
> +	ret = ext4_emergency_state(sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	/* Check validity of the cached writeback mapping. */
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length &&
> +	    ext4_iomap_valid(inode, &wpc->iomap))
> +		return 0;
> +
> +	blk_len = dirty_len >> blkbits;
> +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> +				      (UINT_MAX - 1));
> +	if (blk_end > index + blk_len)
> +		blk_len = blk_end - index + 1;
> +
> +retry:
> +	map.m_lblk = index;
> +	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> +	ret = ext4_map_blocks(NULL, inode, &map,
> +			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * The map is not a delalloc extent, it must either be a hole
> +	 * or an extent which have already been allocated.
> +	 */
> +	if (!(map.m_flags & EXT4_MAP_DELAYED))
> +		goto out;
> +
> +	/* Map one delalloc extent. */
> +	ret = ext4_iomap_map_one_extent(inode, &map);

So it looks somewhat strange that here we call ext4_map_blocks() (which
consults extent status tree and then possibly on-disk extent tree) and then
we call ext4_iomap_map_one_extent() which manipulates with the extent
status tree and possibly extent tree as well. Is all this complexity to
avoid starting a jbd2 handle unless really needed? If yes, is that really
worth it? Given iomap code caches the extent we'd start the transaction
only once per mapped extent which shouldn't be that bad?

If you have some benchmark showing this is really worth it, then I'd
probably prefer coming up with an ext4_get_blocks flag which tells it to
start a transaction on its own if we need to allocate blocks... That would
be much simpler than opencoding all this.

> +	if (ret < 0) {
> +		if (ext4_emergency_state(sb))
> +			return ret;
> +
> +		/*
> +		 * Retry transient ENOSPC errors, if
> +		 * ext4_count_free_blocks() is non-zero, a commit
> +		 * should free up blocks.
> +		 */
> +		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> +			jbd2_journal_force_commit_nested(journal);
> +			goto retry;
> +		}
> +
> +		ext4_msg(sb, KERN_CRIT,
> +			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> +			 inode->i_ino, (unsigned long long)map.m_lblk,
> +			 (unsigned int)map.m_len, -ret);
> +		ext4_msg(sb, KERN_CRIT,
> +			 "This should not happen!! Data will be lost\n");
> +		if (ret == -ENOSPC)
> +			ext4_print_free_blocks(inode);
> +		return ret;
> +	}
> +out:
> +	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> +	return 0;
> +}
> +

...

> +void ext4_iomap_end_bio(struct bio *bio)
> +{
> +	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
> +	struct inode *inode = ioend->io_inode;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	unsigned long flags;
> +
> +	/* Needs to convert unwritten extents or update the i_disksize. */
> +	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
> +	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
> +		goto defer;
> +
> +	/* Needs to abort the journal on data_err=abort.  */
> +	if (unlikely(ioend->io_bio.bi_status))
> +		goto defer;
> +
> +	iomap_finish_ioend(ioend, 0);
> +	return;
> +defer:
> +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> +	if (list_empty(&ei->i_iomap_ioend_list))
> +		queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
> +	list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
> +	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> +}

For now, I'd prefer to do what XFS does and offload everything. Then you
don't have to export iomap_finish_ioend() (which would need to be in a
separate patch and acked by iomap maintainers) and the code is more
standard. There's a patchset in the works which adds general ioend offloading
infrastructure into iomap [1] and when that lands we should get all these
bells and whistles (even better ones with percpu work queues, batching,
etc.) for free.

[1] https://lore.kernel.org/all/20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu/

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Jan Kara @ 2026-06-16 10:45 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-9-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:28, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce two new iomap_ops instances for ext4 buffered writes:
> 
>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>    ext4_da_map_blocks() to map delalloc extents.
>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>    ext4_iomap_get_blocks() to directly allocate blocks.
> 
> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> validity.
> 
> Key changes and considerations:
> 
>  - Unwritten extents for new blocks (dioread_nolock always on)
>    Since data=ordered mode is not used to prevent stale data exposure in
>    the non-delayed allocation path, new blocks are always allocated as
>    unwritten extents.
> 
>  - Short write and write failure handling
>    a. Delalloc path: On short write or failure, the stale delalloc range
>       must be dropped and its space reservation released. Otherwise, a
>       clean folio may cover leftover delalloc extents, causing
>       inaccurate space reservation accounting.
>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>       short write.
> 
>  - Lock ordering reversal
>    The folio lock and transaction start ordering is reversed compared to
>    the buffer_head buffered write path. To handle this, the journal
>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>    documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good to me - besides the IOMAP_F_NEW bugs Ojaswin found. One
observation I have here is that since the old indirect block based on-disk
format doesn't support unwritten extents we can never transition it to the
iomap scheme used here. So we'll have to figure out some way to avoid
maintaining two (actually three if we count data=journal) buffered write /
writeback paths in the long term. But let's address that once things settle
for the common paths.

								Honza

> ---
>  fs/ext4/ext4.h  |   4 ++
>  fs/ext4/file.c  |  20 +++++-
>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  10 ++-
>  4 files changed, 192 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1e27d73d7427..4832e7f7db82 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>  				struct buffer_head *bh);
>  void ext4_set_inode_mapping_order(struct inode *inode);
> +int ext4_nonda_switch(struct super_block *sb);
>  #define FALL_BACK_TO_NONDELALLOC 1
>  #define CONVERT_INLINE_DATA	 2
>  
> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
>  
>  extern const struct iomap_ops ext4_iomap_ops;
>  extern const struct iomap_ops ext4_iomap_report_ops;
> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> +extern const struct iomap_write_ops ext4_iomap_write_ops;
>  
>  static inline int ext4_buffer_uptodate(struct buffer_head *bh)
>  {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..7f9bfbbc4a4e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	return count;
>  }
>  
> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> +					 struct iov_iter *from)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	const struct iomap_ops *iomap_ops;
> +
> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> +	else
> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> +
> +	return iomap_file_buffered_write(iocb, from, iomap_ops,
> +					 &ext4_iomap_write_ops, NULL);
> +}
> +
>  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  					struct iov_iter *from)
>  {
> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  	if (ret <= 0)
>  		goto out;
>  
> -	ret = generic_perform_write(iocb, from);
> +	if (ext4_inode_buffered_iomap(inode))
> +		ret = ext4_iomap_buffered_write(iocb, from);
> +	else
> +		ret = generic_perform_write(iocb, from);
>  
>  out:
>  	inode_unlock(inode);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 39577a6b65b9..1ae7d3f4a1c8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
>  	return ret;
>  }
>  
> -static int ext4_nonda_switch(struct super_block *sb)
> +int ext4_nonda_switch(struct super_block *sb)
>  {
>  	s64 free_clusters, dirty_clusters;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
>  	return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
>  }
>  
> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> +{
> +	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> +}
> +
> +const struct iomap_write_ops ext4_iomap_write_ops = {
> +	.iomap_valid = ext4_iomap_valid,
> +};
> +
>  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>  			   struct ext4_map_blocks *map, loff_t offset,
>  			   loff_t length, unsigned int flags)
> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>  	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>  		iomap->flags |= IOMAP_F_MERGED;
>  
> +	iomap->validity_cookie = map->m_seq;
> +
>  	/*
>  	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
>  	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
>  	.iomap_begin = ext4_iomap_begin_report,
>  };
>  
> +/* Map blocks */
> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> +
>  static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> -		loff_t length, struct ext4_map_blocks *map)
> +		loff_t length, ext4_get_blocks_t get_blocks,
> +		struct ext4_map_blocks *map)
>  {
>  	u8 blkbits = inode->i_blkbits;
>  
> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
>  	map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>  			   EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
>  
> +	if (get_blocks)
> +		return get_blocks(inode, map);
> +
>  	return ext4_map_blocks(NULL, inode, map, 0);
>  }
>  
> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
>  		return -ERANGE;
>  
> -	ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	return 0;
>  }
>  
> +static int ext4_iomap_get_blocks(struct inode *inode,
> +				 struct ext4_map_blocks *map)
> +{
> +	loff_t i_size = i_size_read(inode);
> +	handle_t *handle;
> +	int ret;
> +
> +	/*
> +	 * Check if the blocks have already been allocated, this could
> +	 * avoid initiating a new journal transaction and return the
> +	 * mapping information directly.
> +	 */
> +	if ((map->m_lblk + map->m_len) <=
> +	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> +		ret = ext4_map_blocks(NULL, inode, map, 0);
> +		if (ret < 0)
> +			return ret;
> +		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> +				    EXT4_MAP_DELAYED))
> +			return 0;
> +	}
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> +			ext4_chunk_trans_blocks(inode, map->m_len));
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	ret = ext4_map_blocks(handle, inode, map,
> +			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> +	/*
> +	 * Stop handle here following the lock ordering of the folio lock
> +	 * and the transaction start.
> +	 */
> +	ext4_journal_stop(handle);
> +
> +	return ret;
> +}
> +
> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> +{
> +	int ret, retries = 0;
> +	struct ext4_map_blocks map;
> +	ext4_get_blocks_t *get_blocks;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	/* Inline data and non-extent are not supported. */
> +	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> +		return -ERANGE;
> +	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> +		return -EINVAL;
> +	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> +		return -EINVAL;
> +
> +	if (delalloc)
> +		get_blocks = ext4_da_map_blocks;
> +	else
> +		get_blocks = ext4_iomap_get_blocks;
> +retry:
> +	ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> +		goto retry;
> +	if (ret < 0)
> +		return ret;
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	return 0;
> +}
> +
> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, false);
> +}
> +
> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> +						  iomap, srcmap, true);
> +}
> +
> +/*
> + * On write failure, drop the stale delayed allocation range and release
> + * its reserved space for both start and end blocks. Otherwise, we may
> + * leave a range of delayed extents covered by a clean folio, which can
> + * result in inaccurate space reservation accounting.
> + */
> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> +				     loff_t length, struct iomap *iomap)
> +{
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> +					    loff_t length, ssize_t written,
> +					    unsigned int flags,
> +					    struct iomap *iomap)
> +{
> +	loff_t start_byte, end_byte;
> +
> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> +		return 0;
> +
> +	/* Nothing to do if we've written the entire delalloc extent */
> +	start_byte = iomap_last_written_block(inode, offset, written);
> +	end_byte = round_up(offset + length, i_blocksize(inode));
> +	if (start_byte >= end_byte)
> +		return 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> +				     iomap, ext4_iomap_punch_delalloc);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	return 0;
> +}
> +
> +/*
> + * Since we always allocate unwritten extents, there is no need for
> + * iomap_end to clean up allocated blocks on a short write.
> + */
> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> +	.iomap_begin = ext4_iomap_buffered_write_begin,
> +};
> +
> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> +	.iomap_begin = ext4_iomap_buffered_da_write_begin,
> +	.iomap_end = ext4_iomap_buffered_da_write_end,
> +};
> +
>  const struct iomap_ops ext4_iomap_buffered_read_ops = {
>  	.iomap_begin = ext4_iomap_buffered_read_begin,
>  };
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..9bc294b769db 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
>   *   -> page lock -> i_data_sem (rw)
>   *
>   * buffered write path:
> - * sb_start_write -> i_mutex -> mmap_lock
> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> - *   i_data_sem (rw)
> + * sb_start_write -> i_rwsem (w) -> mmap_lock
> + * - buffer_head path:
> + *   sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> + *     i_data_sem (rw)
> + * - iomap path:
> + *   sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> + *   sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
>   *
>   * truncate:
>   * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
From: Jan Kara @ 2026-06-16 10:04 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-7-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:26, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The iomap buffered write path does not hold any locks between querying
> inode extent mapping information and performing buffered writes. It
> relies on the sequence counter saved in the inode to detect stale
> mappings.

Now that I'm looking at it again, I've got a bit confused here. Buffered
write path is holding i_rwsem between mapping blocks and using them so
there shouldn't be races.  Perhaps you mean buffered *writeback* path? But
then ext4_da_map_blocks() should not ever get called in the writeback path
because it is allocating delayed blocks... So this change looks unnecessary
to me now. Am I missing something?

								Honza

> 
> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
> blocks") added the m_seq field to ext4_map_blocks to pass out extent
> sequence numbers, but it missed two callsites within
> ext4_da_map_blocks(). These callsites are on the delayed allocation
> path, which is also used in the iomap buffered write path. Pass out the
> sequence counter to ensure stale mappings can be detected.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6c4d9137b279..39577a6b65b9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>  	ext4_check_map_extents_env(inode);
>  
>  	/* Lookup extent status tree firstly */
> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>  		map->m_len = min_t(unsigned int, map->m_len,
>  				   es.es_len - (map->m_lblk - es.es_lblk));
>  
> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>  	 * is held in write mode, before inserting a new da entry in
>  	 * the extent status tree.
>  	 */
> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>  		map->m_len = min_t(unsigned int, map->m_len,
>  				   es.es_len - (map->m_lblk - es.es_lblk));
>  
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
From: Jan Kara @ 2026-06-16 10:01 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-8-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:27, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The data=ordered mode introduces two fundamental conflicts with the
> iomap buffered write path, leading to potential deadlocks.
> 
> 1) Lock ordering conflict
>    In the iomap writeback path, each folio is processed sequentially:
>    the folio lock is acquired first, followed by starting a transaction
>    to create block mappings. In data=ordered mode, writeback triggered
>    by the journal commit process may attempt to acquire a folio lock
>    that is already held by iomap. Meanwhile, iomap, under that same
>    folio lock, may start a new transaction and wait for the currently
>    committing transaction to finish, resulting in a deadlock.
> 
> 2) Partial folio submission not supported
>    When block size is smaller than folio size, a folio may contain both
>    mapped and unmapped blocks. In data=ordered mode, if the journal
>    waits for such a folio to be written back while the regular writeback
>    process has already started committing it (with the writeback flag
>    set), mapping the remaining unmapped blocks can deadlock. This is
>    because the writeback flag is cleared only after the entire folio is
>    processed and committed.
> 
> To support data=ordered mode, the iomap core would need two invasive
> changes:
>  - Acquire the transaction handle before locking any folio for
>    writeback.
>  - Support partial folio submission.
> 
> Both changes are complicated and risk performance regressions.
> Therefore, we must avoid using data=ordered mode when converting to the
> iomap path.
> 
> Currently, data=ordered mode is used in three scenarios:
>  - Append write
>  - Post-EOF partial block truncate-up followed by append write
>  - Online defragmentation
> 
> We can address the first two without data=ordered mode:
>  - For append write: always allocate unwritten blocks (i.e. always
>    enable dioread_nolock), preserving the behavior of current
>    extent-type inodes.
>  - For post-EOF truncate-up + append write: postpone updating i_disksize
>    until after the zeroed partial block has been written back.
> 
> Online defragmentation does not yet support iomap; this can be resolved
> separately in the future.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4_jbd2.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 63d17c5201b5..26999f173870 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
>  
>  static inline int ext4_should_order_data(struct inode *inode)
>  {
> -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> +	/*
> +	 * inodes using the iomap buffered I/O path do not use the
> +	 * data=ordered mode.
> +	 */
> +	return !ext4_inode_buffered_iomap(inode) &&
> +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
>  }
>  
>  static inline int ext4_should_writeback_data(struct inode *inode)
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v3 0/3] f2fs: support encrypted inline data
From: LiaoYuanhong-vivo @ 2026-06-16  9:46 UTC (permalink / raw)
  To: ebiggers
  Cc: chao, corbet, jaegeuk, liaoyuanhong, linux-doc, linux-ext4,
	linux-f2fs-devel, linux-fscrypt, linux-kernel, skhan, tytso
In-Reply-To: <20260615193728.GA1764@quark>

Hi Eric,

Thanks for the explanation.

I understand the concern about deriving software contents keys from
sw_secret for hardware-wrapped-key files. I agree this is not the right
security model, and I will stop pursuing this direction for now.

Could you share more about the direction you have in mind for simplifying
f2fs/ext4 contents encryption around blk-crypto?

For f2fs inline_data, there is still a real space-saving benefit on phones,
since many encrypted files are smaller than 4K. Is there any acceptable
future direction to support this kind of inode-resident data with
blk-crypto or hardware-wrapped keys?

Thanks,
Liao Yuanhong

^ permalink raw reply

* Re: [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr()
From: Jan Kara @ 2026-06-16  9:36 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-4-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:23, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Refactor the error handling in ext4_setattr() for better clarity:
> 
>  - Return directly on ext4_break_layouts() failure.
>  - Propagate ext4_truncate() errors using the existing error variable
>    and jump to the common 'err_out' label.
>  - Propagate posix_acl_chmod() errors also through the error variable,
>    as it theoretically does not return a non-fatal error.
> 
> With these changes, every error path either returns immediately or jumps
> to err_out. Consequently, the "if (!error)" condition guarding
> setattr_copy() and mark_inode_dirty() becomes unreachable for error
> cases. Remove this redundant check and the unused rc variable can be
> removed as well.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 32 +++++++++++++++-----------------
>  1 file changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 35e958f89bd5..b1ef706987c3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5989,7 +5989,7 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  		 struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> -	int error, rc = 0;
> +	int error;
>  	int orphan = 0;
>  	const unsigned int ia_valid = attr->ia_valid;
>  	bool inc_ivers = true;
> @@ -6102,10 +6102,10 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  
>  		filemap_invalidate_lock(inode->i_mapping);
>  
> -		rc = ext4_break_layouts(inode);
> -		if (rc) {
> +		error = ext4_break_layouts(inode);
> +		if (error) {
>  			filemap_invalidate_unlock(inode->i_mapping);
> -			goto err_out;
> +			return error;
>  		}
>  
>  		if (attr->ia_size > oldsize)
> @@ -6117,15 +6117,19 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  		}
>  
>  		filemap_invalidate_unlock(inode->i_mapping);
> +		if (error)
> +			goto err_out;
>  	}
>  
> -	if (!error) {
> -		if (inc_ivers)
> -			inode_inc_iversion(inode);
> -		setattr_copy(idmap, inode, attr);
> -		mark_inode_dirty(inode);
> -	}
> +	if (inc_ivers)
> +		inode_inc_iversion(inode);
> +	setattr_copy(idmap, inode, attr);
> +	mark_inode_dirty(inode);
>  
> +	if (ia_valid & ATTR_MODE)
> +		error = posix_acl_chmod(idmap, dentry, inode->i_mode);
> +
> +err_out:
>  	/*
>  	 * If the call to ext4_truncate failed to get a transaction handle at
>  	 * all, we need to clean up the in-core orphan list manually.
> @@ -6133,14 +6137,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  	if (orphan && inode->i_nlink)
>  		ext4_orphan_del(NULL, inode);
>  
> -	if (!error && (ia_valid & ATTR_MODE))
> -		rc = posix_acl_chmod(idmap, dentry, inode->i_mode);
> -
> -err_out:
> -	if  (error)
> +	if (error)
>  		ext4_std_error(inode->i_sb, error);
> -	if (!error)
> -		error = rc;
>  	return error;
>  }
>  
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]()
From: Jan Kara @ 2026-06-16  9:31 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-3-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:22, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Refactor ext4_setattr() by introducing two helper functions,
> ext4_truncate_up() and ext4_truncate_down(), to handle size changes. The
> current ATTR_SIZE processing consolidates checks for both shrinking and
> non-shrinking cases, leading to cluttered code. Separating the
> truncation paths improves readability.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Nice! Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 199 +++++++++++++++++++++++++++---------------------
>  1 file changed, 112 insertions(+), 87 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0751dc55e94f..35e958f89bd5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5855,6 +5855,112 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
>  	}
>  }
>  
> +/*
> + * Set i_size and i_disksize to 'newsize'.
> + *
> + * Both i_rwsem and i_data_sem are required here to avoid races between
> + * generic append writeback and concurrent truncate that also modify
> + * i_size and i_disksize.
> + */
> +static inline void ext4_set_inode_size(struct inode *inode, loff_t newsize)
> +{
> +	WARN_ON_ONCE(S_ISREG(inode->i_mode) && !inode_is_locked(inode));
> +
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	i_size_write(inode, newsize);
> +	EXT4_I(inode)->i_disksize = newsize;
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +}
> +
> +static int ext4_truncate_up(struct inode *inode, loff_t oldsize, loff_t newsize)
> +{
> +	ext4_lblk_t old_lblk, new_lblk;
> +	handle_t *handle;
> +	int ret;
> +
> +	if (!IS_ALIGNED(oldsize | newsize, i_blocksize(inode))) {
> +		ret = ext4_inode_attach_jinode(inode);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> +	if (!IS_ALIGNED(oldsize, i_blocksize(inode))) {
> +		ret = ext4_block_zero_eof(inode, oldsize, LLONG_MAX);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	old_lblk = oldsize > 0 ? (oldsize - 1) >> inode->i_blkbits : 0;
> +	new_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> +	ext4_fc_track_range(handle, inode, old_lblk, new_lblk);
> +
> +	ext4_set_inode_size(inode, newsize);
> +
> +	ret = ext4_mark_inode_dirty(handle, inode);
> +	ext4_journal_stop(handle);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * isize extend must be called outside an active handle due to
> +	 * the lock ordering of transaction start and folio lock in the
> +	 * iomap buffered I/O path (folio lock -> transaction start).
> +	 */
> +	pagecache_isize_extended(inode, oldsize, newsize);
> +	return 0;
> +}
> +
> +static int ext4_truncate_down(struct inode *inode, loff_t oldsize,
> +			      loff_t newsize, int *orphan)
> +{
> +	ext4_lblk_t start_lblk;
> +	handle_t *handle;
> +	int ret;
> +
> +	/* Do not change i_size. */
> +	if (newsize == oldsize)
> +		goto truncate;
> +
> +	/* Shrink. */
> +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	if (ext4_handle_valid(handle)) {
> +		ret = ext4_orphan_add(handle, inode);
> +		*orphan = 1;
> +		if (ret) {
> +			ext4_journal_stop(handle);
> +			return ret;
> +		}
> +	}
> +
> +	start_lblk = newsize > 0 ? (newsize - 1) >> inode->i_blkbits : 0;
> +	ext4_fc_track_range(handle, inode, start_lblk, EXT_MAX_BLOCKS - 1);
> +
> +	ext4_set_inode_size(inode, newsize);
> +
> +	ret = ext4_mark_inode_dirty(handle, inode);
> +	ext4_journal_stop(handle);
> +	if (ret)
> +		return ret;
> +
> +	if (ext4_should_journal_data(inode))
> +		ext4_wait_for_tail_page_commit(inode);
> +truncate:
> +	/*
> +	 * Truncate pagecache after we've waited for commit in data=journal
> +	 * mode to make pages freeable.  Call ext4_truncate() even if
> +	 * i_size didn't change to truncatea possible preallocated blocks.
> +	 */
> +	truncate_pagecache(inode, newsize);
> +	return ext4_truncate(inode);
> +}
> +
>  /*
>   * ext4_setattr()
>   *
> @@ -5951,7 +6057,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  	}
>  
>  	if (attr->ia_valid & ATTR_SIZE) {
> -		handle_t *handle;
>  		loff_t oldsize = inode->i_size;
>  		int shrink = (attr->ia_size < inode->i_size);
>  
> @@ -6003,94 +6108,14 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  			goto err_out;
>  		}
>  
> -		if (attr->ia_size != inode->i_size) {
> -			/* attach jbd2 jinode for EOF folio tail zeroing */
> -			if (attr->ia_size & (inode->i_sb->s_blocksize - 1) ||
> -			    oldsize & (inode->i_sb->s_blocksize - 1)) {
> -				error = ext4_inode_attach_jinode(inode);
> -				if (error)
> -					goto out_mmap_sem;
> -			}
> -
> -			/*
> -			 * Update c/mtime and tail zero the EOF folio on
> -			 * truncate up. ext4_truncate() handles the shrink case
> -			 * below.
> -			 */
> -			if (!shrink) {
> -				inode_set_mtime_to_ts(inode,
> -						      inode_set_ctime_current(inode));
> -				if (oldsize & (inode->i_sb->s_blocksize - 1)) {
> -					error = ext4_block_zero_eof(inode,
> -							oldsize, LLONG_MAX);
> -					if (error)
> -						goto out_mmap_sem;
> -				}
> -			}
> -
> -			handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
> -			if (IS_ERR(handle)) {
> -				error = PTR_ERR(handle);
> -				goto out_mmap_sem;
> -			}
> -			if (ext4_handle_valid(handle) && shrink) {
> -				error = ext4_orphan_add(handle, inode);
> -				orphan = 1;
> -				if (error)
> -					goto out_handle;
> -			}
> -
> -			if (shrink)
> -				ext4_fc_track_range(handle, inode,
> -					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> -					inode->i_sb->s_blocksize_bits,
> -					EXT_MAX_BLOCKS - 1);
> -			else
> -				ext4_fc_track_range(
> -					handle, inode,
> -					(oldsize > 0 ? oldsize - 1 : oldsize) >>
> -					inode->i_sb->s_blocksize_bits,
> -					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> -					inode->i_sb->s_blocksize_bits);
> -
> -			/*
> -			 * We have to update i_size under i_data_sem together
> -			 * with i_disksize to avoid races with writeback code
> -			 * updating disksize in mpage_map_and_submit_extent().
> -			 */
> -			down_write(&EXT4_I(inode)->i_data_sem);
> -			i_size_write(inode, attr->ia_size);
> -			EXT4_I(inode)->i_disksize = attr->ia_size;
> -			up_write(&EXT4_I(inode)->i_data_sem);
> -
> -			error = ext4_mark_inode_dirty(handle, inode);
> -out_handle:
> -			ext4_journal_stop(handle);
> -			if (error)
> -				goto out_mmap_sem;
> -			if (!shrink) {
> -				pagecache_isize_extended(inode, oldsize,
> -							 inode->i_size);
> -			} else if (ext4_should_journal_data(inode)) {
> -				ext4_wait_for_tail_page_commit(inode);
> -			}
> +		if (attr->ia_size > oldsize)
> +			error = ext4_truncate_up(inode, oldsize, attr->ia_size);
> +		else {
> +			/* Shrink or do not change i_size. */
> +			error = ext4_truncate_down(inode, oldsize,
> +						   attr->ia_size, &orphan);
>  		}
>  
> -		/*
> -		 * Truncate pagecache after we've waited for commit
> -		 * in data=journal mode to make pages freeable.
> -		 */
> -		truncate_pagecache(inode, inode->i_size);
> -		/*
> -		 * Call ext4_truncate() even if i_size didn't change to
> -		 * truncate possible preallocated blocks.
> -		 */
> -		if (attr->ia_size <= oldsize) {
> -			rc = ext4_truncate(inode);
> -			if (rc)
> -				error = rc;
> -		}
> -out_mmap_sem:
>  		filemap_invalidate_unlock(inode->i_mapping);
>  	}
>  
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v3] ext4: fix circular lock dependency in ext4_ext_migrate
From: Jan Kara @ 2026-06-16  9:07 UTC (permalink / raw)
  To: Zhou, Yun
  Cc: jack, linux-ext4, linux-kernel, tytso, adilger.kernel, libaokun,
	ojaswin, ritesh.list, yi.zhang, ebiggers
In-Reply-To: <00673f65-cfd4-4042-93cf-cb04ad1d92fb@windriver.com>

Hello Yun!

On Tue 16-06-26 15:51:13, Zhou, Yun wrote:
> > Move iput(tmp_inode) after ext4_writepages_up_write() to avoid a
> > circular lock dependency between s_writepages_rwsem and sb_internal
> > (freeze protection).
> > 
> > The deadlock scenario:
> > 
> >    CPU0 (EXT4_IOC_MIGRATE)        CPU1 (orphan cleanup during mount)
> >    ----                           ----
> >    ext4_ext_migrate()
> >      ext4_writepages_down_write()
> >        s_writepages_rwsem (write)
> >                                   ext4_evict_inode()
> >                                     sb_start_intwrite()   [sb_internal]
> >                                     ...
> >                                       ext4_writepages()
> >                                         s_writepages_rwsem (read) [BLOCKED]
> >      iput(tmp_inode)
> >        ext4_evict_inode()
> >          sb_start_intwrite()         [BLOCKED]
> > 
> > The tmp_inode is a temporary inode with nlink=0 created solely for
> > building the extent tree.  Its eviction does not require
> > s_writepages_rwsem protection, so deferring iput() until after
> > releasing the rwsem is safe.
> > 
> > Reported-by: syzbot+212e8f62790f8e0bc63b@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?extid=212e8f62790f8e0bc63b
> > Fixes: cb85f4d23f79 ("ext4: fix race between writepages and enabling EXT4_EXTENTS_FL")
> > Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > ---
> > v3: fixes Reported-by tag and Closes tag.
> > 
> > v2: remove redundant null pointer check for iput(tmp_inode).
> > 
> >   fs/ext4/migrate.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Thank you very much for taking the time to review these patches and
> providing your valuable suggestions. I am eager to solve these long-standing
> deadlock issues on Syzkaller, but I do not have much community experience.
> I'd like to know, regarding this patch, should I launch a new RR thread or
> continue waiting? BR, Yun

Please keep waiting. On Sunday the merge window for 7.2-rc1 has opened.
That means that about week before and at least for the next week,
maintainers often aren't taking any patches in their trees. I expect Ted to
pick your patch later when he collects fixes to send for 7.2-rc2 or so - he
sends email about that as a reply to the patch. If nothing happens for next
two weeks, I suggest you send a ping asking whether the patch didn't get
lost as a reply to your patch submission. Thanks for your fixes!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v3] ext4: fix circular lock dependency in ext4_ext_migrate
From: Zhou, Yun @ 2026-06-16  7:51 UTC (permalink / raw)
  To: jack
  Cc: linux-ext4, linux-kernel, tytso, adilger.kernel, libaokun,
	ojaswin, ritesh.list, yi.zhang, ebiggers
In-Reply-To: <20260612005330.1930804-1-yun.zhou@windriver.com>

> Move iput(tmp_inode) after ext4_writepages_up_write() to avoid a
> circular lock dependency between s_writepages_rwsem and sb_internal
> (freeze protection).
>
> The deadlock scenario:
>
>    CPU0 (EXT4_IOC_MIGRATE)        CPU1 (orphan cleanup during mount)
>    ----                           ----
>    ext4_ext_migrate()
>      ext4_writepages_down_write()
>        s_writepages_rwsem (write)
>                                   ext4_evict_inode()
>                                     sb_start_intwrite()   [sb_internal]
>                                     ...
>                                       ext4_writepages()
>                                         s_writepages_rwsem (read) [BLOCKED]
>      iput(tmp_inode)
>        ext4_evict_inode()
>          sb_start_intwrite()         [BLOCKED]
>
> The tmp_inode is a temporary inode with nlink=0 created solely for
> building the extent tree.  Its eviction does not require
> s_writepages_rwsem protection, so deferring iput() until after
> releasing the rwsem is safe.
>
> Reported-by: syzbot+212e8f62790f8e0bc63b@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=212e8f62790f8e0bc63b
> Fixes: cb85f4d23f79 ("ext4: fix race between writepages and enabling EXT4_EXTENTS_FL")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
> v3: fixes Reported-by tag and Closes tag.
>
> v2: remove redundant null pointer check for iput(tmp_inode).
>
>   fs/ext4/migrate.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
Hi Honza,

Thank you very much for taking the time to review these patches and 
providing your valuable suggestions. I am eager to solve these 
long-standing deadlock issues on Syzkaller, but I do not have much 
community experience. I'd like to know, regarding this patch, should I 
launch a new RR thread or continue waiting? BR, Yun

^ permalink raw reply

* Re: [GIT PULL] udf, isofs, ext2, quota fixes and cleanups for 7.2-rc1
From: pr-tracker-bot @ 2026-06-16  7:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linus Torvalds, linux-fsdevel, linux-ext4
In-Reply-To: <jpb2wjkfh7j73dpinuvwe5qddkr54px64vmvx6bclzd5ter6hk@mjwwwgohsyxy>

The pull request you sent on Mon, 15 Jun 2026 16:51:01 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fs_for_v7.2-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/974b3dec2bfb4b2726a6194105cb048a9dab0626

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH] ext4: Remove ext4_end_buffer_io_sync()
From: Jan Kara @ 2026-06-15 20:41 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Theodore Ts'o, Harshad Shirwadkar, Andreas Dilger, Baokun Li,
	Jan Kara, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4
In-Reply-To: <20260615182527.2208479-1-willy@infradead.org>

On Mon 15-06-26 19:25:25, Matthew Wilcox (Oracle) wrote:
> There's no need for a custom end_io routine here.  We lose some
> tracing of I/O completions, but we gain better error handling.
> Well, consistent error handling anyway.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Yes, I think this is better. I was also considering that when revieweing
your patch series but decided to leave that as a future cleanup. The future
is now! :) Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/fast_commit.c | 21 +--------------------
>  1 file changed, 1 insertion(+), 20 deletions(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 5773b85e43cb..6b00a7285344 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -184,25 +184,6 @@
>  #include <trace/events/ext4.h>
>  static struct kmem_cache *ext4_fc_dentry_cachep;
>  
> -static void ext4_end_buffer_io_sync(struct bio *bio)
> -{
> -	struct buffer_head *bh;
> -	bool uptodate = bio_endio_bh(bio, &bh);
> -
> -	BUFFER_TRACE(bh, "");
> -	if (uptodate) {
> -		ext4_debug("%s: Block %lld up-to-date",
> -			   __func__, bh->b_blocknr);
> -		set_buffer_uptodate(bh);
> -	} else {
> -		ext4_debug("%s: Block %lld not up-to-date",
> -			   __func__, bh->b_blocknr);
> -		clear_buffer_uptodate(bh);
> -	}
> -
> -	unlock_buffer(bh);
> -}
> -
>  static inline void ext4_fc_reset_inode(struct inode *inode)
>  {
>  	struct ext4_inode_info *ei = EXT4_I(inode);
> @@ -662,7 +643,7 @@ static void ext4_fc_submit_bh(struct super_block *sb, bool is_tail)
>  	lock_buffer(bh);
>  	set_buffer_dirty(bh);
>  	set_buffer_uptodate(bh);
> -	bh_submit(bh, REQ_OP_WRITE | write_flags, ext4_end_buffer_io_sync);
> +	bh_submit(bh, REQ_OP_WRITE | write_flags, bh_end_write);
>  	EXT4_SB(sb)->s_fc_bh = NULL;
>  }
>  
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH ext4 v4] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Jan Kara @ 2026-06-15 20:35 UTC (permalink / raw)
  To: Xiang Mei
  Cc: linux-ext4, Jan Kara, Theodore Ts'o, Andreas Dilger,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <20260615190519.946736-1-xmei5@asu.edu>

On Mon 15-06-26 12:05:19, Xiang Mei wrote:
> ext4_read_inline_dir() can read a dirent header past the end of its inline
> buffer, triggering a slab-out-of-bounds read during getdents64():
> 
>   BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
>   Read of size 2 at addr ffff88800f3dd23c by task exploit/148
>    ...
>    __ext4_check_dir_entry
>    ext4_read_inline_dir
>    iterate_dir
> 
> The dirent payload lives in a buffer of exactly inline_size bytes:
> 
> 	dir_buf = kmalloc(inline_size, GFP_NOFS);
> 
> but iteration runs in a position space extra_offset bytes larger
> (extra_size = extra_offset + inline_size) so the synthetic "." and ".."
> land at their block-dir offsets. A dirent is formed at "dir_buf + pos -
> extra_offset", yet the ext4_check_dir_entry() length argument uses the
> larger extra_size. A position whose dirent header would extend past
> extra_size is therefore accepted, and the rescan loop's rec_len probe and
> ext4_check_dir_entry() dereference de->rec_len before the entry is rejected.
> 
> Reject a position whose minimum-size dirent header would not fit within
> extra_size before forming de, in both the rescan and main loops, and pass
> inline_size rather than extra_size to ext4_check_dir_entry() so the length
> check matches the physical buffer.
> 
> Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Xiang Mei <xmei5@asu.edu>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> v4: use Jan's suggested logical-space bound (i/ctx->pos + rec_len >
>     extra_size), consistent with the loop condition.
> v3: dropped the unreachable ctx->pos < extra_offset underflow check.
> v2: validate against inline_size instead of extra_size.
> 
>  fs/ext4/inline.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 8045e4ff270c..f1f7104d3dac 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1454,6 +1454,8 @@ int ext4_read_inline_dir(struct file *file,
>  			/* for other entry, the real offset in
>  			 * the buf has to be tuned accordingly.
>  			 */
> +			if (i + ext4_dir_rec_len(1, NULL) > extra_size)
> +				break;
>  			de = (struct ext4_dir_entry_2 *)
>  				(dir_buf + i - extra_offset);
>  			/* It's too expensive to do a full
> @@ -1488,10 +1490,17 @@ int ext4_read_inline_dir(struct file *file,
>  			continue;
>  		}
>  
> +		/*
> +		 * de lives at dir_buf + ctx->pos - extra_offset, within the
> +		 * kmalloc(inline_size) buffer.  Make sure its header fits before
> +		 * ext4_check_dir_entry() dereferences de->rec_len.
> +		 */
> +		if (ctx->pos + ext4_dir_rec_len(1, NULL) > extra_size)
> +			goto out;
>  		de = (struct ext4_dir_entry_2 *)
>  			(dir_buf + ctx->pos - extra_offset);
>  		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> -					 extra_size, ctx->pos))
> +					 inline_size, ctx->pos))
>  			goto out;
>  		if (le32_to_cpu(de->inode)) {
>  			if (!dir_emit(ctx, de->name, de->name_len,
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v3 0/3] f2fs: support encrypted inline data
From: Eric Biggers @ 2026-06-15 19:37 UTC (permalink / raw)
  To: LiaoYuanhong-vivo
  Cc: Jaegeuk Kim, Chao Yu, Jonathan Corbet, Shuah Khan,
	Theodore Y. Ts'o, open list:F2FS FILE SYSTEM, open list,
	open list:DOCUMENTATION,
	open list:FSCRYPT: FILE SYSTEM LEVEL ENCRYPTION SUPPORT,
	linux-ext4
In-Reply-To: <20260615125517.362294-1-liaoyuanhong@vivo.com>

[+Cc linux-ext4@vger.kernel.org]

On Mon, Jun 15, 2026 at 08:55:12PM +0800, LiaoYuanhong-vivo wrote:
> F2FS currently disables inline data for encrypted regular files because the
> inline payload is stored in the inode block and does not go through the
> regular bio-based fscrypt path.  This wastes space for small encrypted
> files on Android devices using F2FS inlinecrypt.
> 
> This series adds an encrypted_inline_data on-disk feature for F2FS.
> With this feature enabled, encrypted regular files may keep small contents
> in the inode block.  The inline payload is encrypted before being stored in
> the inode and decrypted back into page-cache plaintext on read.
> 
> The fscrypt changes are scoped to filesystem-managed data-unit crypto.
> F2FS first asks fscrypt whether the inode's key/policy supports this path.
> It prepares the software transform only when encrypted inline payloads are
> read or written.  Inlinecrypt support is limited to v2 IV_INO_LBLK_64 and
> IV_INO_LBLK_32 policies, including the hardware-wrapped key configurations
> supported by fscrypt.  Per-file inlinecrypt keys and DIRECT_KEY policies
> are not supported for encrypted inline data.

I still think we should hold off on this, for the reasons I gave at
https://lore.kernel.org/r/20260515184124.GA4903@quark/

As soon as you start using hardware-wrapped keys this will become
irrelevant, as it can't be used in that case.  I see you added "support"
for that case anyway by deriving contents encryption keys from the
sw_secret.  But that bypasses the security model, which isn't okay.

I'm also working to simplify ext4 and f2fs's file contents encryption
implementation by standardizing on blk-crypto.  That aligns well with
what btrfs encryption is going to do as well.  So this isn't a great
time to be making f2fs's file contents encryption implementation even
more complex by going in a different direction.

If there was demand for this feature from the ext4 side for
general-purpose Linux distros as well, that would make it a bit more
appealing, as it would show broader demand.  But with the proposal being
f2fs-specific and effectively just for Android devices that *don't* use
wrapped keys, that feels too narrow for the added complexity.

This proposal also lacks test cases in xfstests and/or Android's
vts_kernel_encryption_test that verify that the inline data is actually
being encrypted correctly.  Those tests are essential, and we *must not*
add new file contents encryption implementations without such tests.

- Eric

^ permalink raw reply

* Re: [PATCH ext4 v3] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-15 19:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <cohsy6t4qa7pxub7knepgd6nckybqx2xt7ps33zh7fwkx6p3ac@dswnh7vxc572>

On Mon, Jun 15, 2026 at 2:27 AM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 13-06-26 15:18:36, Xiang Mei wrote:
> > ext4_read_inline_dir() can read a dirent header past the end of its inline
> > buffer, triggering a slab-out-of-bounds read during getdents64():
> >
> >   BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
> >   Read of size 2 at addr ffff88800f3dd23c by task exploit/148
> >    ...
> >    __ext4_check_dir_entry
> >    ext4_read_inline_dir
> >    iterate_dir
> >
> > The dirent payload lives in a buffer of exactly inline_size bytes:
> >
> >       dir_buf = kmalloc(inline_size, GFP_NOFS);
> >
> > but iteration runs in a position space extra_offset bytes larger
> > (extra_size = extra_offset + inline_size) so the synthetic "." and ".."
> > land at their block-dir offsets. A dirent is formed at "dir_buf + pos -
> > extra_offset", yet the loop bound (ctx->pos < extra_size) and the
> > ext4_check_dir_entry() length argument both use the larger extra_size. A
> > ctx->pos that is valid in extra_size space but whose de lies past
> > inline_size is therefore accepted, and the rescan loop's rec_len probe
> > and ext4_check_dir_entry() dereference de->rec_len before the entry is
> > rejected.
> >
> > Bound the dirent header by inline_size in both loops: break out of the
> > rescan loop once a minimum-size header no longer fits, reject such a
> > position in the main loop before forming de, and pass inline_size rather
> > than extra_size to ext4_check_dir_entry().
> >
> > Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> > Reported-by: Weiming Shi <bestswngs@gmail.com>
> > Assisted-by: Claude:claude-opus-4-8
> > Signed-off-by: Xiang Mei <xmei5@asu.edu>
>
> Looks mostly good, just one simplification suggestion below:
>
> > diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> > index 8045e4ff270c..1266c8827cca 100644
> > --- a/fs/ext4/inline.c
> > +++ b/fs/ext4/inline.c
> > @@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
> >                       /* for other entry, the real offset in
> >                        * the buf has to be tuned accordingly.
> >                        */
> > +                     if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
> > +                         inline_size)
> > +                             break;
>
> Since 'i' lives in "dir logical offset space", it might be simpler to check
> this as:
>                         if (i + ext4_dir_rec_len(1, NULL) > extra_size)
>
>
> >                       de = (struct ext4_dir_entry_2 *)
> >                               (dir_buf + i - extra_offset);
> >                       /* It's too expensive to do a full
> > @@ -1488,10 +1491,18 @@ int ext4_read_inline_dir(struct file *file,
> >                       continue;
> >               }
> >
> > +             /*
> > +              * de lives at dir_buf + ctx->pos - extra_offset, within the
> > +              * kmalloc(inline_size) buffer.  Make sure its header fits before
> > +              * ext4_check_dir_entry() dereferences de->rec_len.
> > +              */
> > +             if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> > +                 inline_size)
>
> Similarly here this could be checked as:
>
>                 if (ctx->pos + ext4_dir_rec_len(1, NULL) > extra_size)
>
> which is both more consistent with the loop termination condition and saves
> the conversion from logical offset to physical offset.
>
> Otherwise the patch looks good.
Thanks so much for the tips and your review. V4 has been sent.

Xiang
>
>                                                                 Honza
>
> > +                     goto out;
> >               de = (struct ext4_dir_entry_2 *)
> >                       (dir_buf + ctx->pos - extra_offset);
> >               if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> > -                                      extra_size, ctx->pos))
> > +                                      inline_size, ctx->pos))
> >                       goto out;
> >               if (le32_to_cpu(de->inode)) {
> >                       if (!dir_emit(ctx, de->name, de->name_len,
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply

* [PATCH ext4 v4] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-15 19:05 UTC (permalink / raw)
  To: linux-ext4, Jan Kara
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, Weiming Shi, Xiang Mei

ext4_read_inline_dir() can read a dirent header past the end of its inline
buffer, triggering a slab-out-of-bounds read during getdents64():

  BUG: KASAN: slab-out-of-bounds in __ext4_check_dir_entry
  Read of size 2 at addr ffff88800f3dd23c by task exploit/148
   ...
   __ext4_check_dir_entry
   ext4_read_inline_dir
   iterate_dir

The dirent payload lives in a buffer of exactly inline_size bytes:

	dir_buf = kmalloc(inline_size, GFP_NOFS);

but iteration runs in a position space extra_offset bytes larger
(extra_size = extra_offset + inline_size) so the synthetic "." and ".."
land at their block-dir offsets. A dirent is formed at "dir_buf + pos -
extra_offset", yet the ext4_check_dir_entry() length argument uses the
larger extra_size. A position whose dirent header would extend past
extra_size is therefore accepted, and the rescan loop's rec_len probe and
ext4_check_dir_entry() dereference de->rec_len before the entry is rejected.

Reject a position whose minimum-size dirent header would not fit within
extra_size before forming de, in both the rescan and main loops, and pass
inline_size rather than extra_size to ext4_check_dir_entry() so the length
check matches the physical buffer.

Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
v4: use Jan's suggested logical-space bound (i/ctx->pos + rec_len >
    extra_size), consistent with the loop condition.
v3: dropped the unreachable ctx->pos < extra_offset underflow check.
v2: validate against inline_size instead of extra_size.

 fs/ext4/inline.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..f1f7104d3dac 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1454,6 +1454,8 @@ int ext4_read_inline_dir(struct file *file,
 			/* for other entry, the real offset in
 			 * the buf has to be tuned accordingly.
 			 */
+			if (i + ext4_dir_rec_len(1, NULL) > extra_size)
+				break;
 			de = (struct ext4_dir_entry_2 *)
 				(dir_buf + i - extra_offset);
 			/* It's too expensive to do a full
@@ -1488,10 +1490,17 @@ int ext4_read_inline_dir(struct file *file,
 			continue;
 		}

+		/*
+		 * de lives at dir_buf + ctx->pos - extra_offset, within the
+		 * kmalloc(inline_size) buffer.  Make sure its header fits before
+		 * ext4_check_dir_entry() dereferences de->rec_len.
+		 */
+		if (ctx->pos + ext4_dir_rec_len(1, NULL) > extra_size)
+			goto out;
 		de = (struct ext4_dir_entry_2 *)
 			(dir_buf + ctx->pos - extra_offset);
 		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
-					 extra_size, ctx->pos))
+					 inline_size, ctx->pos))
 			goto out;
 		if (le32_to_cpu(de->inode)) {
 			if (!dir_emit(ctx, de->name, de->name_len,
-- 
2.43.0

^ permalink raw reply related

* [PATCH] ext4: Remove ext4_end_buffer_io_sync()
From: Matthew Wilcox (Oracle) @ 2026-06-15 18:25 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Matthew Wilcox (Oracle), Harshad Shirwadkar, Andreas Dilger,
	Baokun Li, Jan Kara, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	linux-ext4

There's no need for a custom end_io routine here.  We lose some
tracing of I/O completions, but we gain better error handling.
Well, consistent error handling anyway.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/fast_commit.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 5773b85e43cb..6b00a7285344 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -184,25 +184,6 @@
 #include <trace/events/ext4.h>
 static struct kmem_cache *ext4_fc_dentry_cachep;
 
-static void ext4_end_buffer_io_sync(struct bio *bio)
-{
-	struct buffer_head *bh;
-	bool uptodate = bio_endio_bh(bio, &bh);
-
-	BUFFER_TRACE(bh, "");
-	if (uptodate) {
-		ext4_debug("%s: Block %lld up-to-date",
-			   __func__, bh->b_blocknr);
-		set_buffer_uptodate(bh);
-	} else {
-		ext4_debug("%s: Block %lld not up-to-date",
-			   __func__, bh->b_blocknr);
-		clear_buffer_uptodate(bh);
-	}
-
-	unlock_buffer(bh);
-}
-
 static inline void ext4_fc_reset_inode(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
@@ -662,7 +643,7 @@ static void ext4_fc_submit_bh(struct super_block *sb, bool is_tail)
 	lock_buffer(bh);
 	set_buffer_dirty(bh);
 	set_buffer_uptodate(bh);
-	bh_submit(bh, REQ_OP_WRITE | write_flags, ext4_end_buffer_io_sync);
+	bh_submit(bh, REQ_OP_WRITE | write_flags, bh_end_write);
 	EXT4_SB(sb)->s_fc_bh = NULL;
 }
 
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v5 3/3] ext4: defer iput() on ea_inodes to reduce lock holding scope
From: Jan Kara @ 2026-06-15 17:02 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260615115317.3549469-4-yun.zhou@windriver.com>

On Mon 15-06-26 19:53:17, Yun Zhou wrote:
> ext4_xattr_block_set(), ext4_xattr_ibody_set() and ext4_xattr_set_entry()
> call iput() on ea_inodes while xattr_sem and a jbd2 handle are held.
> This creates an unnecessarily wide lock scope and, without the preceding
> !SB_ACTIVE guard, could trigger write_inode_now() -> s_writepages_rwsem
> creating a circular dependency.
> 
> Structurally eliminate iput-under-xattr_sem by collecting ea_inodes into
> a deferred-iput array and releasing them after locks are dropped:
> 
> 1. Convert all iput(ea_inode) calls within ext4_xattr_set_entry(),
>    ext4_xattr_ibody_set(), and ext4_xattr_block_set() to deferred iput
>    via ext4_expand_inode_array().  Thread the ea_inode_array through
>    ext4_xattr_set_handle(), ext4_xattr_move_to_block(), and
>    ext4_expand_extra_isize_ea() as an output parameter.
> 
> 2. Callers that own the journal handle (ext4_xattr_set,
>    ext4_expand_extra_isize) free the array AFTER ext4_journal_stop().
>    For ext4_try_to_expand_extra_isize (called from __ext4_mark_inode_dirty
>    with the caller's handle), free after xattr_sem release -- safe
>    because the preceding patches block the !SB_ACTIVE path.
> 
> 3. Callers that cannot control the handle lifetime (ext4_initxattrs,
>    __ext4_set_acl, ext4_set_context, inline data ops) pass NULL.
>    ext4_expand_inode_array() falls back to synchronous iput() when
>    ea_inode_array is NULL -- safe because these callers only run with
>    SB_ACTIVE where iput cannot trigger write_inode_now().
> 
> 4. Use __GFP_NOFAIL in ext4_expand_inode_array() to guarantee the
>    deferred-iput array never fails to grow, eliminating any fallback
>    to synchronous iput() under locks.
> 
> 5. Pass ea_inode_array directly to ext4_xattr_release_block() in
>    ext4_xattr_block_set() instead of using a local array freed
>    synchronously under xattr_sem.
> 
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Frankly, I hate how many places this touches and how this complicates
things, in particular when we don't really need this to fix the deadlock
because the previous two patches addressed the really problematic places.

That being said I fully agree dropping of ea_inodes xattr refs is a mess,
fragile, and there are potentially more deadlocks hiding there. The
architecture that might be more seamless would be something like having a
list of ea inodes to drop anchored in the sb and just queue work to drop
the list from a separate context to avoid the locking dependencies or doing
unexpected modifications in a transaction. That would allow even
simplifying the current code...

								Honza

> ---
>  fs/ext4/acl.c            |  2 +-
>  fs/ext4/crypto.c         |  4 +-
>  fs/ext4/inline.c         |  8 ++--
>  fs/ext4/inode.c          | 16 +++++--
>  fs/ext4/xattr.c          | 93 ++++++++++++++++++++++++----------------
>  fs/ext4/xattr.h          | 10 +++--
>  fs/ext4/xattr_security.c |  3 +-
>  7 files changed, 85 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 3bffe862f954..21de8276b558 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -215,7 +215,7 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
>  	}
>  
>  	error = ext4_xattr_set_handle(handle, inode, name_index, "",
> -				      value, size, xattr_flags);
> +				      value, size, xattr_flags, NULL);
>  
>  	kfree(value);
>  	if (!error)
> diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
> index f41f320f4437..bca760751c1d 100644
> --- a/fs/ext4/crypto.c
> +++ b/fs/ext4/crypto.c
> @@ -173,7 +173,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
>  		res = ext4_xattr_set_handle(handle, inode,
>  					    EXT4_XATTR_INDEX_ENCRYPTION,
>  					    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
> -					    ctx, len, XATTR_CREATE);
> +					    ctx, len, XATTR_CREATE, NULL);
>  		if (!res) {
>  			ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
>  			ext4_clear_inode_state(inode,
> @@ -202,7 +202,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
>  
>  	res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
>  				    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
> -				    ctx, len, 0);
> +				    ctx, len, 0, NULL);
>  	if (!res) {
>  		ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
>  		/*
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 8045e4ff270c..2bf2b771929d 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -309,7 +309,7 @@ static int ext4_create_inline_data(handle_t *handle,
>  		goto out;
>  	}
>  
> -	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
> +	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
>  	if (error) {
>  		if (error == -ENOSPC)
>  			ext4_clear_inode_state(inode,
> @@ -386,7 +386,7 @@ static int ext4_update_inline_data(handle_t *handle, struct inode *inode,
>  	i.value = value;
>  	i.value_len = len;
>  
> -	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
> +	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
>  	if (error)
>  		goto out;
>  
> @@ -469,7 +469,7 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
>  	if (error)
>  		goto out;
>  
> -	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
> +	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
>  	if (error)
>  		goto out;
>  
> @@ -1917,7 +1917,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
>  			i.value = value;
>  			i.value_len = i_size > EXT4_MIN_INLINE_DATA_SIZE ?
>  					i_size - EXT4_MIN_INLINE_DATA_SIZE : 0;
> -			err = ext4_xattr_ibody_set(handle, inode, &i, &is);
> +			err = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
>  			if (err)
>  				goto out_error;
>  		}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 60c91c098fa0..a1b0f47f8f4f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6409,7 +6409,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
>  static int __ext4_expand_extra_isize(struct inode *inode,
>  				     unsigned int new_extra_isize,
>  				     struct ext4_iloc *iloc,
> -				     handle_t *handle, int *no_expand)
> +				     handle_t *handle, int *no_expand,
> +				     struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct ext4_inode *raw_inode;
>  	struct ext4_xattr_ibody_header *header;
> @@ -6454,7 +6455,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
>  
>  	/* try to expand with EAs present */
>  	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
> -					   raw_inode, handle);
> +					   raw_inode, handle, ea_inode_array);
>  	if (error) {
>  		/*
>  		 * Inode size expansion failed; don't try again
> @@ -6476,6 +6477,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
>  {
>  	int no_expand;
>  	int error;
> +	struct ext4_xattr_inode_array *ea_inode_array = NULL;
>  
>  	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
>  		return -EOVERFLOW;
> @@ -6507,8 +6509,11 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
>  		return -EBUSY;
>  
>  	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
> -					  handle, &no_expand);
> +					  handle, &no_expand,
> +					  &ea_inode_array);
>  	ext4_write_unlock_xattr(inode, &no_expand);
> +	/* Safe with caller's handle active: !SB_ACTIVE is blocked above */
> +	ext4_xattr_inode_array_free(ea_inode_array);
>  
>  	return error;
>  }
> @@ -6520,6 +6525,7 @@ int ext4_expand_extra_isize(struct inode *inode,
>  	handle_t *handle;
>  	int no_expand;
>  	int error, rc;
> +	struct ext4_xattr_inode_array *ea_inode_array = NULL;
>  
>  	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
>  		brelse(iloc->bh);
> @@ -6545,7 +6551,8 @@ int ext4_expand_extra_isize(struct inode *inode,
>  	}
>  
>  	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
> -					  handle, &no_expand);
> +					  handle, &no_expand,
> +					  &ea_inode_array);
>  
>  	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
>  	if (!error)
> @@ -6554,6 +6561,7 @@ int ext4_expand_extra_isize(struct inode *inode,
>  out_unlock:
>  	ext4_write_unlock_xattr(inode, &no_expand);
>  	ext4_journal_stop(handle);
> +	ext4_xattr_inode_array_free(ea_inode_array);
>  	return error;
>  }
>  
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 982a1f831e22..4eb83917e6b4 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1630,7 +1630,8 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
>  				struct ext4_xattr_search *s,
>  				handle_t *handle, struct inode *inode,
>  				struct inode *new_ea_inode,
> -				bool is_block)
> +				bool is_block,
> +				struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct ext4_xattr_entry *last, *next;
>  	struct ext4_xattr_entry *here = s->here;
> @@ -1848,7 +1849,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
>  
>  	ret = 0;
>  out:
> -	iput(old_ea_inode);
> +	ext4_expand_inode_array(ea_inode_array, old_ea_inode);
>  	return ret;
>  }
>  
> @@ -1898,7 +1899,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
>  static int
>  ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  		     struct ext4_xattr_info *i,
> -		     struct ext4_xattr_block_find *bs)
> +		     struct ext4_xattr_block_find *bs,
> +		     struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct super_block *sb = inode->i_sb;
>  	struct buffer_head *new_bh = NULL;
> @@ -1961,7 +1963,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  			}
>  			ea_bdebug(bs->bh, "modifying in-place");
>  			error = ext4_xattr_set_entry(i, s, handle, inode,
> -					     ea_inode, true /* is_block */);
> +					     ea_inode, true /* is_block */,
> +					     ea_inode_array);
>  			ext4_xattr_block_csum_set(inode, bs->bh);
>  			unlock_buffer(bs->bh);
>  			if (error == -EFSCORRUPTED)
> @@ -2030,7 +2033,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  	}
>  
>  	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
> -				     true /* is_block */);
> +				     true /* is_block */, ea_inode_array);
>  	if (error == -EFSCORRUPTED)
>  		goto bad_block;
>  	if (error)
> @@ -2150,7 +2153,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  					ext4_warning_inode(ea_inode,
>  							   "dec ref error=%d",
>  							   error);
> -				iput(ea_inode);
> +				ext4_expand_inode_array(ea_inode_array,
> +							ea_inode);
>  				ea_inode = NULL;
>  			}
>  
> @@ -2182,12 +2186,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  
>  	/* Drop the previous xattr block. */
>  	if (bs->bh && bs->bh != new_bh) {
> -		struct ext4_xattr_inode_array *ea_inode_array = NULL;
> -
>  		ext4_xattr_release_block(handle, inode, bs->bh,
> -					 &ea_inode_array,
> +					 ea_inode_array,
>  					 0 /* extra_credits */);
> -		ext4_xattr_inode_array_free(ea_inode_array);
>  	}
>  	error = 0;
>  
> @@ -2203,7 +2204,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>  			ext4_xattr_inode_free_quota(inode, ea_inode,
>  						    i_size_read(ea_inode));
>  		}
> -		iput(ea_inode);
> +		ext4_expand_inode_array(ea_inode_array, ea_inode);
>  	}
>  	if (ce)
>  		mb_cache_entry_put(ea_block_cache, ce);
> @@ -2253,7 +2254,8 @@ int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
>  
>  int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
>  				struct ext4_xattr_info *i,
> -				struct ext4_xattr_ibody_find *is)
> +				struct ext4_xattr_ibody_find *is,
> +				struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct ext4_xattr_ibody_header *header;
>  	struct ext4_xattr_search *s = &is->s;
> @@ -2273,7 +2275,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
>  			return PTR_ERR(ea_inode);
>  	}
>  	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
> -				     false /* is_block */);
> +				     false /* is_block */, ea_inode_array);
>  	if (error) {
>  		if (ea_inode) {
>  			int error2;
> @@ -2285,7 +2287,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
>  
>  			ext4_xattr_inode_free_quota(inode, ea_inode,
>  						    i_size_read(ea_inode));
> -			iput(ea_inode);
> +			ext4_expand_inode_array(ea_inode_array, ea_inode);
>  		}
>  		return error;
>  	}
> @@ -2297,7 +2299,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
>  		header->h_magic = cpu_to_le32(0);
>  		ext4_clear_inode_state(inode, EXT4_STATE_XATTR);
>  	}
> -	iput(ea_inode);
> +	ext4_expand_inode_array(ea_inode_array, ea_inode);
>  	return 0;
>  }
>  
> @@ -2348,7 +2350,8 @@ static struct buffer_head *ext4_xattr_get_block(struct inode *inode)
>  int
>  ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
>  		      const char *name, const void *value, size_t value_len,
> -		      int flags)
> +		      int flags,
> +		      struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct ext4_xattr_info i = {
>  		.name_index = name_index,
> @@ -2428,9 +2431,11 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
>  
>  	if (!value) {
>  		if (!is.s.not_found)
> -			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
> +			error = ext4_xattr_ibody_set(handle, inode, &i, &is,
> +						     ea_inode_array);
>  		else if (!bs.s.not_found)
> -			error = ext4_xattr_block_set(handle, inode, &i, &bs);
> +			error = ext4_xattr_block_set(handle, inode, &i, &bs,
> +						     ea_inode_array);
>  	} else {
>  		error = 0;
>  		/* Xattr value did not change? Save us some work and bail out */
> @@ -2444,10 +2449,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
>  			EXT4_XATTR_MIN_LARGE_EA_SIZE(inode->i_sb->s_blocksize)))
>  			i.in_inode = 1;
>  retry_inode:
> -		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
> +		error = ext4_xattr_ibody_set(handle, inode, &i, &is,
> +					     ea_inode_array);
>  		if (!error && !bs.s.not_found) {
>  			i.value = NULL;
> -			error = ext4_xattr_block_set(handle, inode, &i, &bs);
> +			error = ext4_xattr_block_set(handle, inode, &i, &bs,
> +						     ea_inode_array);
>  		} else if (error == -ENOSPC) {
>  			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
>  				brelse(bs.bh);
> @@ -2456,11 +2463,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
>  				if (error)
>  					goto cleanup;
>  			}
> -			error = ext4_xattr_block_set(handle, inode, &i, &bs);
> +			error = ext4_xattr_block_set(handle, inode, &i, &bs,
> +						     ea_inode_array);
>  			if (!error && !is.s.not_found) {
>  				i.value = NULL;
>  				error = ext4_xattr_ibody_set(handle, inode, &i,
> -							     &is);
> +							     &is, ea_inode_array);
>  			} else if (error == -ENOSPC) {
>  				/*
>  				 * Xattr does not fit in the block, store at
> @@ -2539,6 +2547,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
>  {
>  	handle_t *handle;
>  	struct super_block *sb = inode->i_sb;
> +	struct ext4_xattr_inode_array *ea_inode_array = NULL;
>  	int error, retries = 0;
>  	int credits;
>  
> @@ -2559,10 +2568,13 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
>  		int error2;
>  
>  		error = ext4_xattr_set_handle(handle, inode, name_index, name,
> -					      value, value_len, flags);
> +					      value, value_len, flags,
> +					      &ea_inode_array);
>  		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR,
>  					handle);
>  		error2 = ext4_journal_stop(handle);
> +		ext4_xattr_inode_array_free(ea_inode_array);
> +		ea_inode_array = NULL;
>  		if (error == -ENOSPC &&
>  		    ext4_should_retry_alloc(sb, &retries))
>  			goto retry;
> @@ -2604,7 +2616,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
>   */
>  static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
>  				    struct ext4_inode *raw_inode,
> -				    struct ext4_xattr_entry *entry)
> +				    struct ext4_xattr_entry *entry,
> +				    struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct ext4_xattr_ibody_find *is = NULL;
>  	struct ext4_xattr_block_find *bs = NULL;
> @@ -2668,14 +2681,14 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
>  		goto out;
>  
>  	/* Move ea entry from the inode into the block */
> -	error = ext4_xattr_block_set(handle, inode, &i, bs);
> +	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
>  	if (error)
>  		goto out;
>  
>  	/* Remove the chosen entry from the inode */
>  	i.value = NULL;
>  	i.value_len = 0;
> -	error = ext4_xattr_ibody_set(handle, inode, &i, is);
> +	error = ext4_xattr_ibody_set(handle, inode, &i, is, ea_inode_array);
>  
>  out:
>  	kfree(b_entry_name);
> @@ -2694,7 +2707,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
>  static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
>  				       struct ext4_inode *raw_inode,
>  				       int isize_diff, size_t ifree,
> -				       size_t bfree, int *total_ino)
> +				       size_t bfree, int *total_ino,
> +				       struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
>  	struct ext4_xattr_entry *small_entry;
> @@ -2744,7 +2758,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
>  			total_size += EXT4_XATTR_SIZE(
>  					      le32_to_cpu(entry->e_value_size));
>  		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
> -						 entry);
> +						 entry, ea_inode_array);
>  		if (error)
>  			return error;
>  
> @@ -2761,7 +2775,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
>   * Returns 0 on success or negative error number on failure.
>   */
>  int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> -			       struct ext4_inode *raw_inode, handle_t *handle)
> +			       struct ext4_inode *raw_inode, handle_t *handle,
> +			       struct ext4_xattr_inode_array **ea_inode_array)
>  {
>  	struct ext4_xattr_ibody_header *header;
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> @@ -2833,7 +2848,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
>  
>  	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
>  					    isize_diff, ifree, bfree,
> -					    &total_ino);
> +					    &total_ino, ea_inode_array);
>  	if (error) {
>  		if (error == -ENOSPC && !tried_min_extra_isize &&
>  		    s_min_extra_isize) {
> @@ -2869,19 +2884,27 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
>  /* Add the large xattr @inode into @ea_inode_array for deferred iput().
>   * If @ea_inode_array is new or full it will be grown and the old
>   * contents copied over.
> + *
> + * If @inode is NULL this is a no-op.  If @ea_inode_array is NULL the
> + * caller guarantees SB_ACTIVE so synchronous iput is safe.
>   */
>  static int
>  ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
>  			struct inode *inode)
>  {
> +	if (!inode)
> +		return 0;
> +	if (!ea_inode_array) {
> +		iput(inode);
> +		return 0;
> +	}
>  	if (*ea_inode_array == NULL) {
>  		/*
>  		 * Start with 15 inodes, so it fits into a power-of-two size.
>  		 */
>  		(*ea_inode_array) = kmalloc_flex(**ea_inode_array, inodes,
> -						 EIA_MASK, GFP_NOFS);
> -		if (*ea_inode_array == NULL)
> -			return -ENOMEM;
> +						 EIA_MASK,
> +						 GFP_NOFS | __GFP_NOFAIL);
>  		(*ea_inode_array)->count = 0;
>  	} else if (((*ea_inode_array)->count & EIA_MASK) == EIA_MASK) {
>  		/* expand the array once all 15 + n * 16 slots are full */
> @@ -2889,9 +2912,7 @@ ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
>  
>  		new_array = kmalloc_flex(**ea_inode_array, inodes,
>  					 (*ea_inode_array)->count + EIA_INCR,
> -					 GFP_NOFS);
> -		if (new_array == NULL)
> -			return -ENOMEM;
> +					 GFP_NOFS | __GFP_NOFAIL);
>  		memcpy(new_array, *ea_inode_array,
>  		       struct_size(*ea_inode_array, inodes,
>  				   (*ea_inode_array)->count));
> diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
> index 1fedf44d4fb6..6771d00d3fa4 100644
> --- a/fs/ext4/xattr.h
> +++ b/fs/ext4/xattr.h
> @@ -179,7 +179,9 @@ extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);
>  
>  extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
>  extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
> -extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
> +extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *,
> +		const void *, size_t, int,
> +		struct ext4_xattr_inode_array **ea_inode_array);
>  extern int ext4_xattr_set_credits(struct inode *inode, size_t value_len,
>  				  bool is_create, int *credits);
>  extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
> @@ -192,7 +194,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>  extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
>  
>  extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
> -			    struct ext4_inode *raw_inode, handle_t *handle);
> +			    struct ext4_inode *raw_inode, handle_t *handle,
> +			    struct ext4_xattr_inode_array **ea_inode_array);
>  extern void ext4_evict_ea_inode(struct inode *inode);
>  
>  extern const struct xattr_handler * const ext4_xattr_handlers[];
> @@ -204,7 +207,8 @@ extern int ext4_xattr_ibody_get(struct inode *inode, int name_index,
>  				void *buffer, size_t buffer_size);
>  extern int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
>  				struct ext4_xattr_info *i,
> -				struct ext4_xattr_ibody_find *is);
> +				struct ext4_xattr_ibody_find *is,
> +				struct ext4_xattr_inode_array **ea_inode_array);
>  
>  extern struct mb_cache *ext4_xattr_create_cache(void);
>  extern void ext4_xattr_destroy_cache(struct mb_cache *);
> diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
> index 776cf11d24ca..6b7ab6e703ad 100644
> --- a/fs/ext4/xattr_security.c
> +++ b/fs/ext4/xattr_security.c
> @@ -44,7 +44,8 @@ ext4_initxattrs(struct inode *inode, const struct xattr *xattr_array,
>  		err = ext4_xattr_set_handle(handle, inode,
>  					    EXT4_XATTR_INDEX_SECURITY,
>  					    xattr->name, xattr->value,
> -					    xattr->value_len, XATTR_CREATE);
> +					    xattr->value_len, XATTR_CREATE,
> +					    NULL);
>  		if (err < 0)
>  			break;
>  	}
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v6 3/3] ext4: defer iput() on ea_inodes to reduce lock holding scope
From: Yun Zhou @ 2026-06-15 16:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260615163038.1223293-1-yun.zhou@windriver.com>

ext4_xattr_block_set(), ext4_xattr_ibody_set() and ext4_xattr_set_entry()
call iput() on ea_inodes while xattr_sem and a jbd2 handle are held.
This creates an unnecessarily wide lock scope and, without the preceding
!SB_ACTIVE guard, could trigger write_inode_now() -> s_writepages_rwsem
creating a circular dependency.

Structurally eliminate iput-under-xattr_sem by collecting ea_inodes into
a deferred-iput array and releasing them after locks are dropped:

1. Convert all iput(ea_inode) calls within ext4_xattr_set_entry(),
   ext4_xattr_ibody_set(), and ext4_xattr_block_set() to deferred iput
   via ext4_expand_inode_array().  Thread the ea_inode_array through
   ext4_xattr_set_handle(), ext4_xattr_move_to_block(), and
   ext4_expand_extra_isize_ea() as an output parameter.

2. Callers that own the journal handle (ext4_xattr_set,
   ext4_expand_extra_isize, ext4_inline_data_truncate) free the array
   AFTER ext4_journal_stop().  For ext4_try_to_expand_extra_isize
   (called from __ext4_mark_inode_dirty with the caller's handle),
   free after xattr_sem release -- safe because the preceding patches
   block the !SB_ACTIVE path.

3. Callers that cannot control the handle lifetime (ext4_initxattrs,
   __ext4_set_acl, ext4_set_context, ext4_create_inline_data,
   ext4_update_inline_data, ext4_destroy_inline_data_nolock) pass NULL.
   ext4_expand_inode_array() falls back to synchronous iput() when
   ea_inode_array is NULL -- safe because these callers only run with
   SB_ACTIVE where iput cannot trigger write_inode_now().

4. Use __GFP_NOFAIL in ext4_expand_inode_array() to guarantee the
   deferred-iput array never fails to grow, eliminating any fallback
   to synchronous iput() under locks.

5. Pass ea_inode_array directly to ext4_xattr_release_block() in
   ext4_xattr_block_set() instead of using a local array freed
   synchronously under xattr_sem.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/acl.c            |  2 +-
 fs/ext4/crypto.c         |  4 +-
 fs/ext4/inline.c         | 10 +++--
 fs/ext4/inode.c          | 16 +++++--
 fs/ext4/xattr.c          | 93 ++++++++++++++++++++++++----------------
 fs/ext4/xattr.h          | 10 +++--
 fs/ext4/xattr_security.c |  3 +-
 7 files changed, 87 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 3bffe862f954..21de8276b558 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -215,7 +215,7 @@ __ext4_set_acl(handle_t *handle, struct inode *inode, int type,
 	}
 
 	error = ext4_xattr_set_handle(handle, inode, name_index, "",
-				      value, size, xattr_flags);
+				      value, size, xattr_flags, NULL);
 
 	kfree(value);
 	if (!error)
diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index f41f320f4437..bca760751c1d 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -173,7 +173,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 		res = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_ENCRYPTION,
 					    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-					    ctx, len, XATTR_CREATE);
+					    ctx, len, XATTR_CREATE, NULL);
 		if (!res) {
 			ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 			ext4_clear_inode_state(inode,
@@ -202,7 +202,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 
 	res = ext4_xattr_set_handle(handle, inode, EXT4_XATTR_INDEX_ENCRYPTION,
 				    EXT4_XATTR_NAME_ENCRYPTION_CONTEXT,
-				    ctx, len, 0);
+				    ctx, len, 0, NULL);
 	if (!res) {
 		ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
 		/*
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..6e2b242f8062 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -309,7 +309,7 @@ static int ext4_create_inline_data(handle_t *handle,
 		goto out;
 	}
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error) {
 		if (error == -ENOSPC)
 			ext4_clear_inode_state(inode,
@@ -386,7 +386,7 @@ static int ext4_update_inline_data(handle_t *handle, struct inode *inode,
 	i.value = value;
 	i.value_len = len;
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error)
 		goto out;
 
@@ -469,7 +469,7 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
 	if (error)
 		goto out;
 
-	error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, &is, NULL);
 	if (error)
 		goto out;
 
@@ -1847,6 +1847,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
 	int inline_size, value_len, needed_blocks, no_expand, err = 0;
 	size_t i_size;
 	void *value = NULL;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	struct ext4_xattr_ibody_find is = {
 		.s = { .not_found = -ENODATA, },
 	};
@@ -1917,7 +1918,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
 			i.value = value;
 			i.value_len = i_size > EXT4_MIN_INLINE_DATA_SIZE ?
 					i_size - EXT4_MIN_INLINE_DATA_SIZE : 0;
-			err = ext4_xattr_ibody_set(handle, inode, &i, &is);
+			err = ext4_xattr_ibody_set(handle, inode, &i, &is, &ea_inode_array);
 			if (err)
 				goto out_error;
 		}
@@ -1950,6 +1951,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
 			ext4_handle_sync(handle);
 	}
 	ext4_journal_stop(handle);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return err;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 60c91c098fa0..a1b0f47f8f4f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6409,7 +6409,8 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
 static int __ext4_expand_extra_isize(struct inode *inode,
 				     unsigned int new_extra_isize,
 				     struct ext4_iloc *iloc,
-				     handle_t *handle, int *no_expand)
+				     handle_t *handle, int *no_expand,
+				     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_inode *raw_inode;
 	struct ext4_xattr_ibody_header *header;
@@ -6454,7 +6455,7 @@ static int __ext4_expand_extra_isize(struct inode *inode,
 
 	/* try to expand with EAs present */
 	error = ext4_expand_extra_isize_ea(inode, new_extra_isize,
-					   raw_inode, handle);
+					   raw_inode, handle, ea_inode_array);
 	if (error) {
 		/*
 		 * Inode size expansion failed; don't try again
@@ -6476,6 +6477,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 {
 	int no_expand;
 	int error;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
@@ -6507,8 +6509,11 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 		return -EBUSY;
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, &iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 	ext4_write_unlock_xattr(inode, &no_expand);
+	/* Safe with caller's handle active: !SB_ACTIVE is blocked above */
+	ext4_xattr_inode_array_free(ea_inode_array);
 
 	return error;
 }
@@ -6520,6 +6525,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 	handle_t *handle;
 	int no_expand;
 	int error, rc;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND)) {
 		brelse(iloc->bh);
@@ -6545,7 +6551,8 @@ int ext4_expand_extra_isize(struct inode *inode,
 	}
 
 	error = __ext4_expand_extra_isize(inode, new_extra_isize, iloc,
-					  handle, &no_expand);
+					  handle, &no_expand,
+					  &ea_inode_array);
 
 	rc = ext4_mark_iloc_dirty(handle, inode, iloc);
 	if (!error)
@@ -6554,6 +6561,7 @@ int ext4_expand_extra_isize(struct inode *inode,
 out_unlock:
 	ext4_write_unlock_xattr(inode, &no_expand);
 	ext4_journal_stop(handle);
+	ext4_xattr_inode_array_free(ea_inode_array);
 	return error;
 }
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..4eb83917e6b4 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1630,7 +1630,8 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
 				struct ext4_xattr_search *s,
 				handle_t *handle, struct inode *inode,
 				struct inode *new_ea_inode,
-				bool is_block)
+				bool is_block,
+				struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_entry *last, *next;
 	struct ext4_xattr_entry *here = s->here;
@@ -1848,7 +1849,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
 
 	ret = 0;
 out:
-	iput(old_ea_inode);
+	ext4_expand_inode_array(ea_inode_array, old_ea_inode);
 	return ret;
 }
 
@@ -1898,7 +1899,8 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
 static int
 ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		     struct ext4_xattr_info *i,
-		     struct ext4_xattr_block_find *bs)
+		     struct ext4_xattr_block_find *bs,
+		     struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct super_block *sb = inode->i_sb;
 	struct buffer_head *new_bh = NULL;
@@ -1961,7 +1963,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			}
 			ea_bdebug(bs->bh, "modifying in-place");
 			error = ext4_xattr_set_entry(i, s, handle, inode,
-					     ea_inode, true /* is_block */);
+					     ea_inode, true /* is_block */,
+					     ea_inode_array);
 			ext4_xattr_block_csum_set(inode, bs->bh);
 			unlock_buffer(bs->bh);
 			if (error == -EFSCORRUPTED)
@@ -2030,7 +2033,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	}
 
 	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
-				     true /* is_block */);
+				     true /* is_block */, ea_inode_array);
 	if (error == -EFSCORRUPTED)
 		goto bad_block;
 	if (error)
@@ -2150,7 +2153,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				ext4_expand_inode_array(ea_inode_array,
+							ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2182,12 +2186,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
-
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
+					 ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
 	}
 	error = 0;
 
@@ -2203,7 +2204,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		ext4_expand_inode_array(ea_inode_array, ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2253,7 +2254,8 @@ int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
 
 int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 				struct ext4_xattr_info *i,
-				struct ext4_xattr_ibody_find *is)
+				struct ext4_xattr_ibody_find *is,
+				struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_xattr_search *s = &is->s;
@@ -2273,7 +2275,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 			return PTR_ERR(ea_inode);
 	}
 	error = ext4_xattr_set_entry(i, s, handle, inode, ea_inode,
-				     false /* is_block */);
+				     false /* is_block */, ea_inode_array);
 	if (error) {
 		if (ea_inode) {
 			int error2;
@@ -2285,7 +2287,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
-			iput(ea_inode);
+			ext4_expand_inode_array(ea_inode_array, ea_inode);
 		}
 		return error;
 	}
@@ -2297,7 +2299,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 		header->h_magic = cpu_to_le32(0);
 		ext4_clear_inode_state(inode, EXT4_STATE_XATTR);
 	}
-	iput(ea_inode);
+	ext4_expand_inode_array(ea_inode_array, ea_inode);
 	return 0;
 }
 
@@ -2348,7 +2350,8 @@ static struct buffer_head *ext4_xattr_get_block(struct inode *inode)
 int
 ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		      const char *name, const void *value, size_t value_len,
-		      int flags)
+		      int flags,
+		      struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_info i = {
 		.name_index = name_index,
@@ -2428,9 +2431,11 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 
 	if (!value) {
 		if (!is.s.not_found)
-			error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+			error = ext4_xattr_ibody_set(handle, inode, &i, &is,
+						     ea_inode_array);
 		else if (!bs.s.not_found)
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 	} else {
 		error = 0;
 		/* Xattr value did not change? Save us some work and bail out */
@@ -2444,10 +2449,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 			EXT4_XATTR_MIN_LARGE_EA_SIZE(inode->i_sb->s_blocksize)))
 			i.in_inode = 1;
 retry_inode:
-		error = ext4_xattr_ibody_set(handle, inode, &i, &is);
+		error = ext4_xattr_ibody_set(handle, inode, &i, &is,
+					     ea_inode_array);
 		if (!error && !bs.s.not_found) {
 			i.value = NULL;
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 		} else if (error == -ENOSPC) {
 			if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
 				brelse(bs.bh);
@@ -2456,11 +2463,12 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 				if (error)
 					goto cleanup;
 			}
-			error = ext4_xattr_block_set(handle, inode, &i, &bs);
+			error = ext4_xattr_block_set(handle, inode, &i, &bs,
+						     ea_inode_array);
 			if (!error && !is.s.not_found) {
 				i.value = NULL;
 				error = ext4_xattr_ibody_set(handle, inode, &i,
-							     &is);
+							     &is, ea_inode_array);
 			} else if (error == -ENOSPC) {
 				/*
 				 * Xattr does not fit in the block, store at
@@ -2539,6 +2547,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 {
 	handle_t *handle;
 	struct super_block *sb = inode->i_sb;
+	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	int error, retries = 0;
 	int credits;
 
@@ -2559,10 +2568,13 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 		int error2;
 
 		error = ext4_xattr_set_handle(handle, inode, name_index, name,
-					      value, value_len, flags);
+					      value, value_len, flags,
+					      &ea_inode_array);
 		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR,
 					handle);
 		error2 = ext4_journal_stop(handle);
+		ext4_xattr_inode_array_free(ea_inode_array);
+		ea_inode_array = NULL;
 		if (error == -ENOSPC &&
 		    ext4_should_retry_alloc(sb, &retries))
 			goto retry;
@@ -2604,7 +2616,8 @@ static void ext4_xattr_shift_entries(struct ext4_xattr_entry *entry,
  */
 static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 				    struct ext4_inode *raw_inode,
-				    struct ext4_xattr_entry *entry)
+				    struct ext4_xattr_entry *entry,
+				    struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_find *is = NULL;
 	struct ext4_xattr_block_find *bs = NULL;
@@ -2668,14 +2681,14 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Move ea entry from the inode into the block */
-	error = ext4_xattr_block_set(handle, inode, &i, bs);
+	error = ext4_xattr_block_set(handle, inode, &i, bs, ea_inode_array);
 	if (error)
 		goto out;
 
 	/* Remove the chosen entry from the inode */
 	i.value = NULL;
 	i.value_len = 0;
-	error = ext4_xattr_ibody_set(handle, inode, &i, is);
+	error = ext4_xattr_ibody_set(handle, inode, &i, is, ea_inode_array);
 
 out:
 	kfree(b_entry_name);
@@ -2694,7 +2707,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 				       struct ext4_inode *raw_inode,
 				       int isize_diff, size_t ifree,
-				       size_t bfree, int *total_ino)
+				       size_t bfree, int *total_ino,
+				       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header = IHDR(inode, raw_inode);
 	struct ext4_xattr_entry *small_entry;
@@ -2744,7 +2758,7 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
 			total_size += EXT4_XATTR_SIZE(
 					      le32_to_cpu(entry->e_value_size));
 		error = ext4_xattr_move_to_block(handle, inode, raw_inode,
-						 entry);
+						 entry, ea_inode_array);
 		if (error)
 			return error;
 
@@ -2761,7 +2775,8 @@ static int ext4_xattr_make_inode_space(handle_t *handle, struct inode *inode,
  * Returns 0 on success or negative error number on failure.
  */
 int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			       struct ext4_inode *raw_inode, handle_t *handle)
+			       struct ext4_inode *raw_inode, handle_t *handle,
+			       struct ext4_xattr_inode_array **ea_inode_array)
 {
 	struct ext4_xattr_ibody_header *header;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2833,7 +2848,7 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 
 	error = ext4_xattr_make_inode_space(handle, inode, raw_inode,
 					    isize_diff, ifree, bfree,
-					    &total_ino);
+					    &total_ino, ea_inode_array);
 	if (error) {
 		if (error == -ENOSPC && !tried_min_extra_isize &&
 		    s_min_extra_isize) {
@@ -2869,19 +2884,27 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 /* Add the large xattr @inode into @ea_inode_array for deferred iput().
  * If @ea_inode_array is new or full it will be grown and the old
  * contents copied over.
+ *
+ * If @inode is NULL this is a no-op.  If @ea_inode_array is NULL the
+ * caller guarantees SB_ACTIVE so synchronous iput is safe.
  */
 static int
 ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
 			struct inode *inode)
 {
+	if (!inode)
+		return 0;
+	if (!ea_inode_array) {
+		iput(inode);
+		return 0;
+	}
 	if (*ea_inode_array == NULL) {
 		/*
 		 * Start with 15 inodes, so it fits into a power-of-two size.
 		 */
 		(*ea_inode_array) = kmalloc_flex(**ea_inode_array, inodes,
-						 EIA_MASK, GFP_NOFS);
-		if (*ea_inode_array == NULL)
-			return -ENOMEM;
+						 EIA_MASK,
+						 GFP_NOFS | __GFP_NOFAIL);
 		(*ea_inode_array)->count = 0;
 	} else if (((*ea_inode_array)->count & EIA_MASK) == EIA_MASK) {
 		/* expand the array once all 15 + n * 16 slots are full */
@@ -2889,9 +2912,7 @@ ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
 
 		new_array = kmalloc_flex(**ea_inode_array, inodes,
 					 (*ea_inode_array)->count + EIA_INCR,
-					 GFP_NOFS);
-		if (new_array == NULL)
-			return -ENOMEM;
+					 GFP_NOFS | __GFP_NOFAIL);
 		memcpy(new_array, *ea_inode_array,
 		       struct_size(*ea_inode_array, inodes,
 				   (*ea_inode_array)->count));
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..6771d00d3fa4 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -179,7 +179,9 @@ extern ssize_t ext4_listxattr(struct dentry *, char *, size_t);
 
 extern int ext4_xattr_get(struct inode *, int, const char *, void *, size_t);
 extern int ext4_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
-extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
+extern int ext4_xattr_set_handle(handle_t *, struct inode *, int, const char *,
+		const void *, size_t, int,
+		struct ext4_xattr_inode_array **ea_inode_array);
 extern int ext4_xattr_set_credits(struct inode *inode, size_t value_len,
 				  bool is_create, int *credits);
 extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
@@ -192,7 +194,8 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
-			    struct ext4_inode *raw_inode, handle_t *handle);
+			    struct ext4_inode *raw_inode, handle_t *handle,
+			    struct ext4_xattr_inode_array **ea_inode_array);
 extern void ext4_evict_ea_inode(struct inode *inode);
 
 extern const struct xattr_handler * const ext4_xattr_handlers[];
@@ -204,7 +207,8 @@ extern int ext4_xattr_ibody_get(struct inode *inode, int name_index,
 				void *buffer, size_t buffer_size);
 extern int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 				struct ext4_xattr_info *i,
-				struct ext4_xattr_ibody_find *is);
+				struct ext4_xattr_ibody_find *is,
+				struct ext4_xattr_inode_array **ea_inode_array);
 
 extern struct mb_cache *ext4_xattr_create_cache(void);
 extern void ext4_xattr_destroy_cache(struct mb_cache *);
diff --git a/fs/ext4/xattr_security.c b/fs/ext4/xattr_security.c
index 776cf11d24ca..6b7ab6e703ad 100644
--- a/fs/ext4/xattr_security.c
+++ b/fs/ext4/xattr_security.c
@@ -44,7 +44,8 @@ ext4_initxattrs(struct inode *inode, const struct xattr *xattr_array,
 		err = ext4_xattr_set_handle(handle, inode,
 					    EXT4_XATTR_INDEX_SECURITY,
 					    xattr->name, xattr->value,
-					    xattr->value_len, XATTR_CREATE);
+					    xattr->value_len, XATTR_CREATE,
+					    NULL);
 		if (err < 0)
 			break;
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 0/3] ext4: fix xattr iput deadlock with s_writepages_rwsem
From: Yun Zhou @ 2026-06-15 16:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel

This series fixes a circular lock dependency reported by syzbot:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

The deadlock occurs when iput() on an ea_inode triggers write_inode_now()
while xattr_sem and a jbd2 handle are held.  The triggering path is
during mount-time orphan cleanup (!SB_ACTIVE) where iput_final() calls
write_inode_now() synchronously.

Patch 1 blocks the deadlock by skipping extra isize expansion when
!SB_ACTIVE -- this prevents the xattr manipulation path from being
entered during mount.

Patch 2 is a belt-and-suspenders semantic improvement: an inode under
eviction never needs extra isize expansion.

Patch 3 is a structural improvement: defer all ea_inode iput() calls
until after xattr_sem is released, reducing lock nesting depth and
preventing future code paths from reintroducing the lock ordering issue.

Link: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5

v6:
 - ext4_inline_data_truncate(): use local ea_inode_array instead of
   passing NULL, freed after ext4_journal_stop().  Fixes a deadlock
   reachable via crafted filesystem where inline data xattr entry has
   e_value_inum set: orphan cleanup -> ext4_truncate ->
   ext4_inline_data_truncate -> iput under !SB_ACTIVE.

v5:
 - Split into 3 patches for easier review.
 - Add explicit !SB_ACTIVE early-return in ext4_try_to_expand_extra_isize()
   to block ALL mount-time paths (ext4_process_orphan -> ext4_truncate ->
   ext4_mark_inode_dirty), not just the eviction path. v4 only relied on
   EXT4_STATE_NO_EXPAND which doesn't cover orphan truncation.

v4:
 - Comprehensive rewrite of the deferred iput mechanism.
 - Thread ea_inode_array through ext4_expand_extra_isize_ea() and
   ext4_xattr_move_to_block() so ALL ea_inode iputs in the expand
   path are deferred, not just those in ext4_xattr_block_set().
 - Add NULL safety to ext4_expand_inode_array(): when ea_inode_array
   pointer is NULL, fall back to synchronous iput (for callers like
   ext4_initxattrs that only run with SB_ACTIVE).
 - Use __GFP_NOFAIL to guarantee deferred array growth, eliminating
   fallback to synchronous iput under locks.
 - Update ext4_xattr_ibody_set() and ext4_xattr_set_entry() signatures
   to accept ea_inode_array, converting ALL iput(ea_inode) calls.
 - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() before
   ext4_mark_inode_dirty().

v3:
 - Check ext4_expand_inode_array() return value; fallback to
   direct iput() on ENOMEM to prevent inode leak.
 - Make ext4_xattr_set_handle() take an optional ea_inode_array
   output parameter so callers can free after ext4_journal_stop(),
   avoiding the jbd2_handle vs s_writepages_rwsem AB-BA.
 - Pass ea_inode_array directly to ext4_xattr_release_block()
   instead of using a local array freed under xattr_sem.
 - Move ext4_xattr_inode_array_free() after ext4_journal_stop()

v2:
 - Defer iput() in ext4_xattr_block_set() via ea_inode_array,
   freed after xattr_sem is released. Fixes the root cause.

v1:
 - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
   on inodes being deleted. Only fixes the syzbot reproducer, not
   the underlying lock ordering violation.

Yun Zhou (3):
  ext4: skip extra isize expansion during mount to prevent deadlock
  ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
  ext4: defer iput() on ea_inodes to reduce lock holding scope

 fs/ext4/acl.c            |  2 +-
 fs/ext4/crypto.c         |  4 +-
 fs/ext4/inline.c         | 10 +++--
 fs/ext4/inode.c          | 27 ++++++++++--
 fs/ext4/xattr.c          | 93 ++++++++++++++++++++++++----------------
 fs/ext4/xattr.h          | 10 +++--
 fs/ext4/xattr_security.c |  3 +-
 7 files changed, 98 insertions(+), 51 deletions(-)

-- 
2.43.0

^ permalink raw reply

* [PATCH v6 2/3] ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
From: Yun Zhou @ 2026-06-15 16:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260615163038.1223293-1-yun.zhou@windriver.com>

An inode being evicted will never need its extra isize expanded.  Set
EXT4_STATE_NO_EXPAND before ext4_mark_inode_dirty() in ext4_evict_inode()
to make this explicit and prevent any unnecessary work in
ext4_try_to_expand_extra_isize().

This also provides defense-in-depth for the s_writepages_rwsem deadlock
during mount-time orphan cleanup, ensuring the expand path is blocked
for inodes under eviction regardless of how they are reached.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index be1e3eaa4f23..60c91c098fa0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -264,6 +264,7 @@ void ext4_evict_inode(struct inode *inode)
 	if (ext4_inode_is_fast_symlink(inode))
 		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
 	inode->i_size = 0;
+	ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (err) {
 		ext4_warning(inode->i_sb,
-- 
2.43.0

^ permalink raw reply related

* [PATCH v6 1/3] ext4: skip extra isize expansion during mount to prevent deadlock
From: Yun Zhou @ 2026-06-15 16:30 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, yun.zhou
  Cc: linux-ext4, linux-kernel
In-Reply-To: <20260615163038.1223293-1-yun.zhou@windriver.com>

ext4_try_to_expand_extra_isize() is called from __ext4_mark_inode_dirty()
while holding an active jbd2 handle.  During mount (!SB_ACTIVE), the
expand path may move xattrs to external blocks and release ea_inodes via
iput().  When !SB_ACTIVE, iput() calls write_inode_now() which acquires
s_writepages_rwsem, creating a circular lock dependency:

  s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem

This can be triggered via:

  ext4_process_orphan() -> ext4_truncate() -> ext4_mark_inode_dirty()
    -> ext4_try_to_expand_extra_isize()

or:

  ext4_evict_inode() -> ext4_mark_inode_dirty()
    -> ext4_try_to_expand_extra_isize()

Skip expansion when !SB_ACTIVE.  This is a minor loss of functionality
(extra isize won't grow for these inodes during mount), which e2fsck
can resolve later if needed.

Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd7588a3fa45..be1e3eaa4f23 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6479,6 +6479,16 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
 		return -EOVERFLOW;
 
+	/*
+	 * Skip expansion during mount (!SB_ACTIVE).  Expanding extra isize
+	 * may move xattrs to external blocks and release ea_inodes via iput.
+	 * When !SB_ACTIVE, iput triggers write_inode_now() which acquires
+	 * s_writepages_rwsem, causing a deadlock with the caller's active
+	 * jbd2 handle (lock order: s_writepages_rwsem -> jbd2_handle).
+	 */
+	if (unlikely(!(inode->i_sb->s_flags & SB_ACTIVE)))
+		return -EBUSY;
+
 	/*
 	 * In nojournal mode, we can immediately attempt to expand
 	 * the inode.  When journaled, we first need to obtain extra
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v5 2/3] ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
From: Jan Kara @ 2026-06-15 16:28 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260615115317.3549469-3-yun.zhou@windriver.com>

On Mon 15-06-26 19:53:16, Yun Zhou wrote:
> An inode being evicted will never need its extra isize expanded.  Set
> EXT4_STATE_NO_EXPAND before ext4_mark_inode_dirty() in ext4_evict_inode()
> to make this explicit and prevent any unnecessary work in
> ext4_try_to_expand_extra_isize().
> 
> This also provides defense-in-depth for the s_writepages_rwsem deadlock
> during mount-time orphan cleanup, ensuring the expand path is blocked
> for inodes under eviction regardless of how they are reached.
> 
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Looks good. I'd move the setting somewhat earlier (just before
dquot_initialize()) so that it obviously covers all the inode dirtying that
can happen while deleting the inode. But otherwise feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index be1e3eaa4f23..60c91c098fa0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -264,6 +264,7 @@ void ext4_evict_inode(struct inode *inode)
>  	if (ext4_inode_is_fast_symlink(inode))
>  		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
>  	inode->i_size = 0;
> +	ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
>  	err = ext4_mark_inode_dirty(handle, inode);
>  	if (err) {
>  		ext4_warning(inode->i_sb,
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v5 1/3] ext4: skip extra isize expansion during mount to prevent deadlock
From: Jan Kara @ 2026-06-15 16:20 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260615115317.3549469-2-yun.zhou@windriver.com>

On Mon 15-06-26 19:53:15, Yun Zhou wrote:
> ext4_try_to_expand_extra_isize() is called from __ext4_mark_inode_dirty()
> while holding an active jbd2 handle.  During mount (!SB_ACTIVE), the
> expand path may move xattrs to external blocks and release ea_inodes via
> iput().  When !SB_ACTIVE, iput() calls write_inode_now() which acquires
> s_writepages_rwsem, creating a circular lock dependency:
> 
>   s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
> 
> This can be triggered via:
> 
>   ext4_process_orphan() -> ext4_truncate() -> ext4_mark_inode_dirty()
>     -> ext4_try_to_expand_extra_isize()
> 
> or:
> 
>   ext4_evict_inode() -> ext4_mark_inode_dirty()
>     -> ext4_try_to_expand_extra_isize()
> 
> Skip expansion when !SB_ACTIVE.  This is a minor loss of functionality
> (extra isize won't grow for these inodes during mount), which e2fsck
> can resolve later if needed.
> 
> Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
> Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Yeah, looks as a sensible compromise. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index cd7588a3fa45..be1e3eaa4f23 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6479,6 +6479,16 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
>  	if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
>  		return -EOVERFLOW;
>  
> +	/*
> +	 * Skip expansion during mount (!SB_ACTIVE).  Expanding extra isize
> +	 * may move xattrs to external blocks and release ea_inodes via iput.
> +	 * When !SB_ACTIVE, iput triggers write_inode_now() which acquires
> +	 * s_writepages_rwsem, causing a deadlock with the caller's active
> +	 * jbd2 handle (lock order: s_writepages_rwsem -> jbd2_handle).
> +	 */
> +	if (unlikely(!(inode->i_sb->s_flags & SB_ACTIVE)))
> +		return -EBUSY;
> +
>  	/*
>  	 * In nojournal mode, we can immediately attempt to expand
>  	 * the inode.  When journaled, we first need to obtain extra
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox