Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* [GIT PULL] ext4 changes for 7.2-rc1
From: Theodore Ts'o @ 2026-06-18 13:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Developers List, Ext4 Developers List

The following changes since commit 5200f5f493f79f14bbdc349e402a40dfb32f23c8:

  Linux 7.1-rc4 (2026-05-17 13:59:58 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus-7.2-rc1

for you to fetch changes up to c143957520c6c9b5cd72e0de8b52b814f0c576fe:

  ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT (2026-06-10 10:53:50 -0400)

----------------------------------------------------------------
Various ext4 updates for 7.2-rc1:

* A major rework of the fast commit mechanism to avoid lock
  contention and deadlocks.  We also export snapshot statistics
  in /proc/fs/ext4/*/fc_info.
* Performance optimization for directory hash computation by
  processing input in 4-byte chunks and removing function pointers,
  along with new KUnit tests for directory hash.
* Cleanups in JBD2 to remove special slabs and use kmalloc() instead.
* Various bug fixes, including:
   - Early validation of donor superblock in EXT4_IOC_MOVE_EXT to avoid
     cross-fs deadlock
   - Fix for a kernel BUG in ext4_write_inline_data_end under
     data=journal
   - Fix for a NULL dereference in jbd2_journal_dirty_metadata when
     handle is aborted
   - Fix for an underflow in JBD2 fast commit block initialization check
   - Fix for LOGFLUSH shutdown ordering to ensure ordered data writeback
   - Miscellaneous fixes for error path return values and KUnit assertions.

----------------------------------------------------------------
Abdellah Ouhbi (1):
      ext4: Use %pe to print PTR_ERR()

Aditya Prakash Srivastava (1):
      ext4: fix kernel BUG in ext4_write_inline_data_end

Deepanshu Kartikey (1):
      jbd2: check for aborted handle in jbd2_journal_dirty_metadata()

Guan-Chun Wu (2):
      ext4: add Kunit coverage for directory hash computation
      ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers

Hongling Zeng (1):
      ext4: fix ERR_PTR(0) in ext4_mkdir()

Junrui Luo (1):
      jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()

Li Chen (8):
      ext4: fix fast commit wait/wake bit mapping on 64-bit
      ext4: fast commit: snapshot inode state before writing log
      ext4: lockdep: handle i_data_sem subclassing for special inodes
      ext4: fast commit: avoid waiting for FC_COMMITTING
      ext4: fast commit: avoid self-deadlock in inode snapshotting
      ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
      ext4: fast commit: add lock_updates tracepoint
      ext4: fast commit: export snapshot stats in fc_info

Matthew Wilcox (Oracle) (2):
      ext4: remove mention of PageWriteback
      jbd2: remove special jbd2 slabs

Ryota Sakamoto (1):
      ext4: replace KUnit tests for memcmp() with KUNIT_ASSERT_MEMEQ()

Yun Zhou (1):
      ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT

Zhang Yi (1):
      ext4: fix LOGFLUSH shutdown ordering to allow ordered-mode data writeback

 fs/ext4/Makefile            |   2 +-
 fs/ext4/ext4.h              |  93 ++++-
 fs/ext4/extents.c           |   4 +-
 fs/ext4/fast_commit.c       | 784 ++++++++++++++++++++++++++++++++---------
 fs/ext4/hash-test.c         | 567 +++++++++++++++++++++++++++++
 fs/ext4/hash.c              |  68 ++--
 fs/ext4/inode.c             |  54 ++-
 fs/ext4/ioctl.c             |  15 +-
 fs/ext4/mballoc-test.c      |   9 +-
 fs/ext4/namei.c             |   6 +-
 fs/ext4/page-io.c           |   2 +-
 fs/ext4/super.c             |  13 +-
 fs/jbd2/commit.c            |   8 +-
 fs/jbd2/journal.c           | 127 +------
 fs/jbd2/transaction.c       |  17 +-
 include/linux/jbd2.h        |   3 -
 include/trace/events/ext4.h |  61 ++++
 17 files changed, 1495 insertions(+), 338 deletions(-)
 create mode 100644 fs/ext4/hash-test.c

^ permalink raw reply

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Jan Kara @ 2026-06-18 12:59 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, djwong, hch, yi.zhang,
	yizhang089, yangerkun, yukuai
In-Reply-To: <20260511072344.191271-18-yi.zhang@huaweicloud.com>

On Mon 11-05-26 15:23:37, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> In the generic buffered_head I/O path, we rely on the data=order mode to
> ensure that the zeroed EOF block data is written before updating
> i_disksize, thus preventing stale data from being exposed.
> 
> However, the iomap buffered I/O path cannot use this mechanism. Instead,
> we issue the I/O immediately after performing the zero operation
> (without synchronous waiting for performance). This can reduce the risk
> of exposing stale data, but it does not guarantee that the zero data
> will be flushed to disk before the metadata of i_disksize is updated.
> The subsequent patches will wait for this I/O to complete before
> updating i_disksize.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Two nits below:

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 239d387ffaf2..e013aeb03d7b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>  					zero_written);
>  }
>  
> +static int ext4_iomap_submit_zero_block(struct inode *inode,
> +					loff_t from, loff_t end)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	struct folio *folio;
> +	bool do_submit = false;
> +
> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> +	if (IS_ERR(folio))
> +		/* Already writeback and clear? */
		   ^^^ Already written back and reclaimed

> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> +
> +	folio_wait_writeback(folio);
> +	WARN_ON_ONCE(folio_test_writeback(folio));
> +
> +	if (likely(folio_test_dirty(folio)))
> +		do_submit = true;
> +	folio_unlock(folio);
> +	folio_put(folio);

So how is what you do here more efficient than just:

	filemap_fdatawrite_range(mapping, from, end - 1)

? That will also do nothing if the folio isn't dirty, won't it?

								Honza

> +
> +	/* Submit zeroed block. */
> +	if (do_submit)
> +		return filemap_fdatawrite_range(mapping, from, end - 1);
> +	return 0;
> +}
> +
>  /*
>   * Zero out a mapping from file offset 'from' up to the end of the block
>   * which corresponds to 'from' or to the given 'end' inside this block.
> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>  		return 0;
>  
> -	if (length > blocksize - offset)
> +	if (length > blocksize - offset) {
>  		length = blocksize - offset;
> +		end = from + length;
> +	}
>  
>  	err = ext4_block_zero_range(inode, from, length,
>  				    &did_zero, &zero_written);
> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	 * TODO: In the iomap path, handle this by updating i_disksize to
>  	 * i_size after the zeroed data has been written back.
>  	 */
> -	if (ext4_should_order_data(inode) &&
> -	    did_zero && zero_written && !IS_DAX(inode)) {
> -		handle_t *handle;
> +	if (did_zero && zero_written && !IS_DAX(inode)) {
> +		if (ext4_should_order_data(inode)) {
> +			handle_t *handle;
>  
> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> -		if (IS_ERR(handle))
> -			return PTR_ERR(handle);
> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> +			if (IS_ERR(handle))
> +				return PTR_ERR(handle);
>  
> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> -		ext4_journal_stop(handle);
> -		if (err)
> -			return err;
> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
> +							length);
> +			ext4_journal_stop(handle);
> +			if (err)
> +				return err;
> +		/*
> +		 * inodes using the iomap buffered I/O path do not use the
> +		 * data=ordered mode. We submit zeroed range directly here.
> +		 * Do not wait for I/O completion for performance.
> +		 *
> +		 * TODO: Any operation that extends i_disksize (including
> +		 * append write end io past the zeroed boundary, truncate up,
> +		 * and append fallocate) must wait for the relevant I/O to
> +		 * complete before updating i_disksize.
> +		 */
> +		} else if (ext4_inode_buffered_iomap(inode)) {
> +			err = ext4_iomap_submit_zero_block(inode, from, end);
> +			if (err)
> +				return err;
> +		}
>  	}
>  
>  	return 0;
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v2 8/8] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() with cache-only lookup
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

Add a nowait parameter to ext4_dio_needs_zeroing() and pass
EXT4_GET_BLOCKS_CACHED_NOWAIT flag to ext4_map_blocks() when
IOCB_NOWAIT is set. This ensures the needs_zeroing check only uses
cached extent info. If cache misses, ext4_map_blocks() returns
-EAGAIN, causing ext4_dio_needs_zeroing() to conservatively return
true (needs zeroing). The caller then tries to upgrade to exclusive
lock, which returns -EAGAIN for NOWAIT, avoiding potential sleep on
down_read(i_data_sem).

The caller in ext4_dio_write_checks() is updated to pass the
IOCB_NOWAIT flag from the kiocb.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 5ffc1afd8050..44d1658d2b5a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,7 +228,8 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
  * unwritten conversion for middle blocks are protected by i_data_sem
  * and inode_dio_begin().
  */
-static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len,
+				   bool nowait)
 {
 	struct ext4_map_blocks map;
 	unsigned int blkbits = inode->i_blkbits;
@@ -236,10 +237,14 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
 	bool head_partial, tail_partial;
 	ext4_lblk_t head_lblk, tail_lblk;
 	int err;
+	int map_flags = 0;
 
 	if (pos + len > i_size_read(inode))
 		return true;
 
+	if (nowait)
+		map_flags = EXT4_GET_BLOCKS_CACHED_NOWAIT;
+
 	head_partial = (pos & blockmask) != 0;
 	tail_partial = ((pos + len) & blockmask) != 0;
 	head_lblk = pos >> blkbits;
@@ -249,7 +254,7 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
 	if (head_partial) {
 		map.m_lblk = head_lblk;
 		map.m_len = tail_lblk - head_lblk + 1;
-		err = ext4_map_blocks(NULL, inode, &map, 0);
+		err = ext4_map_blocks(NULL, inode, &map, map_flags);
 		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
 			return true;
 		/* If this mapping already covers the tail block, we're done. */
@@ -261,7 +266,7 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
 	if (tail_partial) {
 		map.m_lblk = tail_lblk;
 		map.m_len = 1;
-		err = ext4_map_blocks(NULL, inode, &map, 0);
+		err = ext4_map_blocks(NULL, inode, &map, map_flags);
 		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
 			return true;
 	}
@@ -516,7 +521,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	 * under shared lock is safe.
 	 */
 	if (ext4_unaligned_io(inode, from, offset))
-		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
+		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count,
+						iocb->ki_flags & IOCB_NOWAIT);
 
 	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 7/8] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

Pass EXT4_GET_BLOCKS_CACHED_NOWAIT flag to ext4_map_blocks() when
IOMAP_NOWAIT is set, ensuring that extent lookups only use the cached
extent status tree. If the cache misses, ext4_map_blocks() returns
-EAGAIN instead of sleeping on down_read(i_data_sem) to read extent
tree from disk.

This applies to both write and read paths in ext4_iomap_begin(),
allowing DIO/DAX operations with RWF_NOWAIT to avoid blocking on
extent tree lookups.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/inode.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 03adbca3ec78..09f85cd6c118 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3781,6 +3781,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	struct ext4_map_blocks map;
 	u8 blkbits = inode->i_blkbits;
 	unsigned int orig_mlen;
+	int map_flags = 0;
 
 	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
 		return -EINVAL;
@@ -3795,6 +3796,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
 			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
 	orig_mlen = map.m_len;
+	/*
+	 * In NOWAIT context, only use cached extent info. If es cache misses,
+	 * return -EAGAIN to avoid sleeping on down_read(i_data_sem).
+	 */
+	if (flags & IOMAP_NOWAIT)
+		map_flags = EXT4_GET_BLOCKS_CACHED_NOWAIT;
 
 	if (flags & IOMAP_WRITE) {
 		/*
@@ -3804,7 +3811,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		 * especially in multi-threaded overwrite requests.
 		 */
 		if (offset + length <= i_size_read(inode)) {
-			ret = ext4_map_blocks(NULL, inode, &map, 0);
+			ret = ext4_map_blocks(NULL, inode, &map, map_flags);
 			/*
 			 * For DAX we convert extents to initialized ones before
 			 * copying the data, otherwise we do it after I/O so
@@ -3825,7 +3832,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		}
 		ret = ext4_iomap_alloc(inode, &map, flags);
 	} else {
-		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		ret = ext4_map_blocks(NULL, inode, &map, map_flags);
 	}
 
 	if (ret < 0)
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 6/8] ext4: return -EAGAIN from ext4_map_blocks() in NOWAIT cache miss
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

Make ext4_map_blocks() return -EAGAIN instead of 0 when
EXT4_GET_BLOCKS_CACHED_NOWAIT is set and the extent status cache
misses. This allows callers to easily distinguish between a successful
cache lookup (positive return value) and a cache miss requiring disk
access (-EAGAIN), simplifying error handling in NOWAIT paths.

The change affects two locations:
1. After cache hit: return retval ? retval : -EAGAIN
   (return -EAGAIN if cache hit is hole/delayed)
2. After cache miss: return -EAGAIN
   (instead of 0, indicating need for disk lookup)

The only existing caller using EXT4_GET_BLOCKS_CACHED_NOWAIT is the
ext4_get_link() -> ext4_getblk() path. Although ext4_getblk() now
takes a different return branch (err < 0 instead of err == 0) and
propagates -EAGAIN instead of NULL, ext4_get_link() converts both
cases to -ECHILD via IS_ERR_OR_NULL(), so the final error seen by
the VFS remains unchanged.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 832794294ccf..03adbca3ec78 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -760,7 +760,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		}

 		if (flags & EXT4_GET_BLOCKS_CACHED_NOWAIT)
-			return retval;
+			return retval ? retval : -EAGAIN;
+
 #ifdef ES_AGGRESSIVE_TEST
 		ext4_map_blocks_es_recheck(handle, inode, map,
 					   &orig_map, flags);
@@ -776,7 +777,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 * cannot find extent in the cache.
 	 */
 	if (flags & EXT4_GET_BLOCKS_CACHED_NOWAIT)
-		return 0;
+		return -EAGAIN;

 	/*
 	 * Try to see if we can get the block without requesting a new
-- 
2.43.7

^ permalink raw reply related

* [PATCH v2 4/8] ext4: base unaligned DIO lock decision on partial block zeroing
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

For unaligned DIO writes, the previous ext4_overwrite_io() required the
entire range to fall within a single written extent.  This was overly
conservative: the DIO layer only performs partial block zeroing for the
head and tail blocks when they are partially covered by the write.
Middle blocks that are fully covered are written as whole blocks
without any zeroing, so they are safe regardless of extent state.

Therefore exclusive lock is only required when partial block zeroing
will actually happen:
 - The head partial block (if any) lands on a hole or unwritten extent.
 - The tail partial block (if any) lands on a hole or unwritten extent.

Middle full-cover blocks can be in any state (hole, unwritten, or
written) - block allocation under shared lock is safe per the previous
patch's analysis (inode_dio_begin + i_data_sem protection).

Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
directly answers the question driving the lock decision.  It uses at
most two ext4_map_blocks() calls: one for the head partial block (also
catching the case where it spans through the tail), and one for the
tail partial block if not already covered.

This enables shared lock for previously-rejected scenarios such as:
 - Unaligned write spanning written extent + mid-range hole + written
   extent at the tail.
 - Unaligned write where the partial blocks land on written extents but
   the middle has unwritten extents.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
Each stripe is laid out as [written][unwritten][unwritten][written],
so the head and tail partial blocks land on written extents but the
middle is unwritten.  Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      15,547       17,381      1.12x
     2      15,910       34,172      2.15x
     4      15,014       57,567      3.83x
     8      15,022       81,947      5.46x
    16      14,586       99,126      6.80x
    32      14,047       92,519      6.59x

Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 73 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 886b73247aab..2681f148e7b8 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
 	return false;
 }
 
-/* Is IO overwriting allocated or initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode,
-			      loff_t pos, loff_t len, bool *unwritten)
+/*
+ * Does an unaligned DIO write require partial block zeroing?
+ *
+ * Partial block zeroing is performed only for the head and tail blocks
+ * when they are partially covered by the write and the underlying extent
+ * is a hole or unwritten. Middle blocks (fully covered by the write)
+ * are written as whole blocks without zeroing.
+ *
+ * When zeroing is required, two concurrent unaligned DIO writes to the
+ * same partial block can race and corrupt each other's data, so the
+ * caller must take the exclusive i_rwsem and drain in-flight DIO. When
+ * zeroing is not required, shared lock is safe -- block allocation and
+ * unwritten conversion for middle blocks are protected by i_data_sem
+ * and inode_dio_begin().
+ */
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
 {
 	struct ext4_map_blocks map;
 	unsigned int blkbits = inode->i_blkbits;
-	int err, blklen;
+	unsigned long blockmask = inode->i_sb->s_blocksize - 1;
+	bool head_partial, tail_partial;
+	ext4_lblk_t head_lblk, tail_lblk;
+	int err;
 
 	if (pos + len > i_size_read(inode))
-		return false;
+		return true;
 
-	map.m_lblk = pos >> blkbits;
-	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-	blklen = map.m_len;
+	head_partial = (pos & blockmask) != 0;
+	tail_partial = ((pos + len) & blockmask) != 0;
+	head_lblk = pos >> blkbits;
+	tail_lblk = (pos + len - 1) >> blkbits;
+
+	/* Check the head partial block. */
+	if (head_partial) {
+		map.m_lblk = head_lblk;
+		map.m_len = tail_lblk - head_lblk + 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+		/* If this mapping already covers the tail block, we're done. */
+		if (!tail_partial || map.m_lblk + err > tail_lblk)
+			return false;
+	}
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
-	return true;
+	/* Check the tail partial block. */
+	if (tail_partial) {
+		map.m_lblk = tail_lblk;
+		map.m_len = 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+	}
+
+	return false;
 }
 
 static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -452,9 +481,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  *    i_data_sem serializes concurrent extent tree modifications.
  *
  * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
- *    only safe for pure written-extent overwrites. Unwritten extents or
- *    holes require exclusive lock because concurrent partial block zeroing
- *    in the DIO layer could corrupt data.
+ *    safe unless the DIO layer needs to perform partial block zeroing --
+ *    i.e. the head or tail partial block sits on a hole or unwritten
+ *    extent. In that case upgrade to the exclusive lock and drain
+ *    in-flight DIO to avoid races with concurrent partial block zeroing.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -465,7 +495,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite = true, unaligned_io, unwritten = false;
+	bool needs_zeroing = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -475,21 +505,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	offset = iocb->ki_pos;
 	count = ret;
 
-	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
 
 	/*
-	 * For unaligned writes we need to know the extent state to determine
-	 * whether shared lock is safe. For aligned writes we skip this check
-	 * entirely since allocation under shared lock is safe.
+	 * For unaligned writes, check whether partial block zeroing will be
+	 * needed. If so, exclusive lock is required to serialize against
+	 * concurrent DIO that could race with the zeroing.
+	 *
+	 * For aligned writes we skip this check entirely since allocation
+	 * under shared lock is safe.
 	 */
-	if (unaligned_io)
-		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+	if (ext4_unaligned_io(inode, from, offset))
+		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
 
 	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend ||
-	     (unaligned_io && (!overwrite || unwritten))))) {
+	    (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -503,16 +534,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	/*
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
-	 * behavior already. The inode lock is already held exclusive if the
-	 * write is unaligned non-overwrite or extending, so drain all
-	 * outstanding dio and set the force wait dio flag.
+	 * behavior already. When holding the exclusive lock for a write that
+	 * needs partial block zeroing or is extending the file, we must wait
+	 * for the I/O to complete synchronously:
+	 *
+	 *  - needs_zeroing: drain in-flight DIO whose end_io could race with
+	 *    our partial block zeroing, and force synchronous completion so we
+	 *    don't leave in-flight zeroing bios for the next writer to drain.
+	 *
+	 *  - extend: the caller must update i_disksize after I/O completion,
+	 *    which requires the data to be on disk first.
 	 */
-	if (!*ilock_shared && (unaligned_io || *extend)) {
+	if (!*ilock_shared && (needs_zeroing || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
 		}
-		if (unaligned_io && (!overwrite || unwritten))
+		if (needs_zeroing)
 			inode_dio_wait(inode);
 		*dio_flags = IOMAP_DIO_FORCE_WAIT;
 	}
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 0/8] ext4: allow more DIO writes under shared i_rwsem
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang

Changes since v1:
  * Collect RVB from Honza and Yi. (Thank you for your review!)
  * Added Patch 1 to fix NOWAIT issues reported by Sashiko.
  * Added Patch 2 to fix ext3 DIO and DIO fallback data race issue.
    (Patch 4 increases the probability of this race)
  * Added Patches 5-8 to fix other NOWAIT issues discovered during
    investigation.

v1: https://patch.msgid.link/20260611163441.2431805-1-libaokun@linux.alibaba.com

======

Hi all,

This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.

It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.

ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):

  1. Aligned writes spanning multiple already-allocated extents (e.g.
     written + unwritten, or two discontiguous written extents).

  2. Unaligned writes whose head/tail partial blocks land on written
     extents but the fully-covered middle blocks include hole or
     unwritten extents.

Patch 1 fixes a NOWAIT issue where ext4_iomap_alloc() may sleep when
IOMAP_NOWAIT is set.

Patch 2 fixes a data race between DIO completion and buffered I/O
fallback on ext3 (no-extent inodes). This race was made more likely
by Patch 4.

Patch 3 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.

Patch 4 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.

Patch 5 fixes a NOWAIT issue by using kiocb_modified instead of
file_modified in DIO/DAX write paths.

Patch 6 makes ext4_map_blocks() to return -EAGAIN instead of 0 when
EXT4_GET_BLOCKS_CACHED_NOWAIT is set and cache lookup misses.

Patch 7 adds cache-only lookup support to ext4_iomap_begin() for
IOMAP_NOWAIT requests.

Patch 8 adds cache-only lookup support to ext4_dio_needs_zeroing()
for IOCB_NOWAIT requests.

Testing
=======

"kvm-xfstests -c ext4/all -g auto" passes with no new failures.

Performance
===========

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.

  JOBS         base    +patch 3    +patch 3+4    speedup
  ----    ---------    --------    ----------    -------
     1       42,322      43,329        43,087      1.02x
     2       68,516      70,677        66,958      1.03x
     4       62,489      97,072       101,468      1.62x
     8       58,701     110,819       113,679      1.94x
    16       58,569     116,392       115,272      1.97x
    32       60,860     117,244       119,621      1.97x

Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.

Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.

  JOBS         base    +patch 3    +patch 3+4    speedup
  ----    ---------    --------    ----------    -------
     1       15,547      15,975        17,381      1.12x
     2       15,910      14,808        34,172      2.15x
     4       15,014      14,828        57,567      3.83x
     8       15,022      14,648        81,947      5.46x
    16       14,586      14,262        99,126      6.80x
    32       14,047      13,809        92,519      6.59x

Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.

In test 2, patch 3 alone has no effect (slight noise) because patch 3
only touches the aligned write path. Patch 4 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.

Comments and questions are, as always, welcome.

Thanks,
Baokun

[1]: https://patch.msgid.link/20260607124935.6168-1-peng_wang@linux.alibaba.com

Baokun Li (8):
  ext4: prevent sleeping allocation in NOWAIT write path
  ext4: drain in-flight DIO before buffered write fallback
  ext4: skip overwrite check for aligned non-extending DIO writes
  ext4: base unaligned DIO lock decision on partial block zeroing
  ext4: use kiocb_modified instead of file_modified in DIO/DAX write
    path
  ext4: return -EAGAIN from ext4_map_blocks() in NOWAIT cache miss
  ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup
  ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() with cache-only
    lookup

 fs/ext4/file.c  | 148 +++++++++++++++++++++++++++++++++---------------
 fs/ext4/inode.c |  19 +++++--
 2 files changed, 118 insertions(+), 49 deletions(-)

-- 
2.43.7

^ permalink raw reply

* [PATCH v2 1/8] ext4: prevent sleeping allocation in NOWAIT write path
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang, Sashiko
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

Block allocation requires journal access which may sleep, violating
NOWAIT semantics. Return -EAGAIN early when IOMAP_NOWAIT is set,
allowing the caller to retry without the NOWAIT constraint.

This ensures that write paths using IOMAP_NOWAIT (e.g., DIO with
RWF_NOWAIT) will not block on journal operations when blocks need
to be allocated.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260611163441.2431805-1-libaokun@linux.alibaba.com?part=1
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..832794294ccf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3672,6 +3672,9 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	int ret, dio_credits, m_flags = 0, retries = 0;
 	bool force_commit = false;
 
+	if (flags & IOMAP_NOWAIT)
+		return -EAGAIN;
+
 	/*
 	 * Trim the mapping request to the maximum value that we can map at
 	 * once for direct I/O.
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 5/8] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

file_modified() passes flags=0 which drops IOCB_NOWAIT, causing
file_update_time() to sleep in ext4_journal_start() via
ext4_dirty_inode() even in non-blocking contexts.

kiocb_modified(iocb) propagates iocb->ki_flags so that
generic_update_time() correctly returns -EAGAIN when IOCB_NOWAIT
is set and ->dirty_inode could block, matching the behavior
already adopted by XFS, FUSE, and ext2.

Affected paths:
- ext4_dio_write_checks(): DIO NOWAIT write
- ext4_write_checks(): shared by buffered (rejects NOWAIT upfront)
  and DAX write (supports NOWAIT)

ext4_fallocate() in extents.c is not affected as it has no kiocb.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2681f148e7b8..5ffc1afd8050 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -307,7 +307,7 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (count <= 0)
 		return count;
 
-	ret = file_modified(iocb->ki_filp);
+	ret = kiocb_modified(iocb);
 	if (ret)
 		return ret;
 
@@ -465,7 +465,7 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  *
  * The decision is layered, evaluated in this order:
  *
- * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ * 1. If kiocb_modified() needs to update security info (!IS_NOSEC), upgrade
  *    to the exclusive lock -- the security update itself requires it,
  *    regardless of whether the write extends the file or is aligned.
  *
@@ -555,7 +555,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 		*dio_flags = IOMAP_DIO_FORCE_WAIT;
 	}
 
-	ret = file_modified(file);
+	ret = kiocb_modified(iocb);
 	if (ret < 0)
 		goto out;
 
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 3/8] ext4: skip overwrite check for aligned non-extending DIO writes
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.

After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.

Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:

 - truncate/punch_hole are kept out: they require exclusive i_rwsem
   (blocked by our shared lock during allocation), and inode_dio_begin()
   keeps their inode_dio_wait() blocked until in-flight bios complete.
 - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
   extent tree modifications (parallel writers to the same hole).
 - The journal handle is per-thread and does not require i_rwsem
   exclusion.
 - i_disksize and orphan list are not involved in non-extending writes.

Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      42,322       43,329      1.02x
     2      68,516       70,677      1.03x
     4      62,489       97,072      1.55x
     8      58,701      110,819      1.89x
    16      58,569      116,392      1.99x
    32      60,860      117,244      1.93x

Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 9f9bc0b13772..886b73247aab 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -434,16 +434,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  * condition requires an exclusive inode lock. If yes, then we restart the
  * whole operation by releasing the shared lock and acquiring exclusive lock.
  *
- * - For unaligned_io we never take shared lock as it may cause data corruption
- *   when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
  *
- * - For extending writes case we don't take the shared lock, since it requires
- *   updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ *    to the exclusive lock -- the security update itself requires it,
+ *    regardless of whether the write extends the file or is aligned.
  *
- * - shared locking will only be true mostly with overwrites, including
- *   initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ *    lock to safely update i_disksize and the orphan list, regardless of
+ *    alignment.
  *
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ *    sufficient regardless of extent state (written, unwritten, or hole).
+ *    truncate/punch_hole cannot run while we hold the shared i_rwsem
+ *    (they need it exclusively); after we release it, inode_dio_begin()
+ *    keeps their inode_dio_wait() blocked until in-flight bios complete.
+ *    i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ *    only safe for pure written-extent overwrites. Unwritten extents or
+ *    holes require exclusive lock because concurrent partial block zeroing
+ *    in the DIO layer could corrupt data.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -454,7 +465,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite, unaligned_io, unwritten;
+	bool overwrite = true, unaligned_io, unwritten = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -466,22 +477,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 
 	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
-	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
 
 	/*
-	 * Determine whether we need to upgrade to an exclusive lock. This is
-	 * required to change security info in file_modified(), for extending
-	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
-	 * extents (as partial block zeroing may be required).
-	 *
-	 * Note that unaligned writes are allowed under shared lock so long as
-	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
-	 * data corruption due to partial block zeroing in the dio layer, and so
-	 * the I/O must occur exclusively.
+	 * For unaligned writes we need to know the extent state to determine
+	 * whether shared lock is safe. For aligned writes we skip this check
+	 * entirely since allocation under shared lock is safe.
 	 */
+	if (unaligned_io)
+		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
-	     (unaligned_io && unwritten)))) {
+	    ((!IS_NOSEC(inode) || *extend ||
+	     (unaligned_io && (!overwrite || unwritten))))) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -496,8 +504,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
 	 * behavior already. The inode lock is already held exclusive if the
-	 * write is non-overwrite or extending, so drain all outstanding dio and
-	 * set the force wait dio flag.
+	 * write is unaligned non-overwrite or extending, so drain all
+	 * outstanding dio and set the force wait dio flag.
 	 */
 	if (!*ilock_shared && (unaligned_io || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 2/8] ext4: drain in-flight DIO before buffered write fallback
From: Baokun Li @ 2026-06-18 12:57 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <20260618125735.4156639-1-libaokun@linux.alibaba.com>

generic/746 started failing intermittently on ext3 (no-extent inodes).
The test triggers 'Page cache invalidation failure on direct I/O'
warnings and subsequent fsync returns -EIO. Adding a 50ms delay
between ext4_buffered_write_iter() and filemap_write_and_wait_range()
in ext4_dio_write_iter() makes the race almost always reproducible.

On no-extent inodes, DIO writes to holes cannot use unwritten extents,
so ext4_iomap_alloc() leaves m_flags=0 and ext4_map_blocks() returns 0.
The iomap layer then returns -ENOTBLK, causing fallback to buffered I/O.

The fallback path in ext4_dio_write_iter() calls
ext4_buffered_write_iter() which dirties pages, then does flush and
invalidate. However, there's an unprotected window between
ext4_buffered_write_iter() returning (with inode lock released) and
the subsequent flush+invalidate.

Concurrent async DIO completions from other threads can run
kiocb_invalidate_post_direct_write() during this window. If pages have
been re-dirtied, post-invalidation finds dirty pages and triggers the
warning, setting -EIO in the error sequence.

Consider a file with two 4k extents: [hole][written]. Thread A does
DIO to the written extent, while thread B does DIO spanning both:

  kworker A (4k DIO, allocated block)    kworker B (8k DIO, fallback)
  -----------------------------------    ----------------------------
  inode_lock_shared()                    inode_lock_shared()
  iomap_dio_rw():                        iomap_dio_rw():
    kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
    submit_bio (async)                     dio->size = 0
  inode_unlock_shared()                  inode_unlock_shared()

  [bio pending in block layer]           /* fallback: lock released */
                                         ext4_buffered_write_iter()
                                           inode_lock(exclusive)
                                           generic_perform_write()
                                             -> dirty pages [0, 8k]
                                           inode_unlock(exclusive)

                                         /* pages dirty, no lock */
  [bio completes]                        filemap_write_and_wait_range()
  iomap_dio_complete()                     -> flush dirty pages
    kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
      invalidate_inode_pages2_range()
      -> finds dirty page!
      -> dio_warn_stale_pagecache()
      -> errseq_set(-EIO)

This issue can be triggered through normal I/O paths, not just
intentionally overlapping DIO writes from userspace. For example,
generic/746 uses a loop device where multiple kworkers issue concurrent
I/O to the backing file. Additionally, when block_size < folio_size,
non-overlapping DIO writes that share a large folio can also trigger
the race.

Add inode_dio_wait() in ext4_buffered_write_iter() before
generic_perform_write() to drain all in-flight DIO. This ensures
that all DIO clears existing pages before submitting IO (via
kiocb_invalidate_pages()), and all BIO waits for all DIO to
complete (via inode_dio_wait()), thus eliminating the race.

Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/d1adcf7c-c276-458d-9cac-68a4410f7626@gmail.com
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..9f9bc0b13772 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -313,6 +313,12 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	if (ret <= 0)
 		goto out;

+	/*
+	 * Prevent concurrent DIO and BIO to the same file range.
+	 * Wait for all in-flight DIO to complete before dirtying pages.
+	 */
+	inode_dio_wait(inode);
+
 	ret = generic_perform_write(iocb, from);

 out:
-- 
2.43.7

^ permalink raw reply related

* Re: [syzbot] [overlayfs?] possible deadlock in ovl_copy_up_start (5)
From: Amir Goldstein @ 2026-06-18 10:47 UTC (permalink / raw)
  To: syzbot; +Cc: linux-kernel, linux-unionfs, miklos, syzkaller-bugs, Ext4
In-Reply-To: <6a330d0e.6d5abbec.a50f.0020.GAE@google.com>

On Wed, Jun 17, 2026 at 11:09 PM syzbot
<syzbot+d4e3aadc5bd19f7e71ff@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:    596d152bc5e3 Merge branch 'for-next/core' into for-kernelci
> git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
> console output: https://syzkaller.appspot.com/x/log.txt?x=10aba8ae580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=eb40ba923a822433
> dashboard link: https://syzkaller.appspot.com/bug?extid=d4e3aadc5bd19f7e71ff
> compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> userspace arch: arm64
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/32d15acc8c01/disk-596d152b.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/83b3d8d84761/vmlinux-596d152b.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/8edfcc3bf911/Image-596d152b.gz.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+d4e3aadc5bd19f7e71ff@syzkaller.appspotmail.com
>
> ======================================================
> WARNING: possible circular locking dependency detected
> syzkaller #0 Tainted: G             L
> ------------------------------------------------------
> syz.3.611/8517 is trying to acquire lock:
> ffff0000eee140f8 (&ovl_i_lock_key[depth]){+.+.}-{4:4}, at: ovl_inode_lock_interruptible fs/overlayfs/overlayfs.h:705 [inline]
> ffff0000eee140f8 (&ovl_i_lock_key[depth]){+.+.}-{4:4}, at: ovl_copy_up_start+0x58/0x264 fs/overlayfs/util.c:735
>
> but task is already holding lock:
> ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
> ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: lock_two_nondirectories+0xe8/0x148 fs/inode.c:1256
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}:
>        __lock_release kernel/locking/lockdep.c:5574 [inline]
>        lock_release+0x178/0x3b0 kernel/locking/lockdep.c:5889
>        up_write+0x3c/0x5d8 kernel/locking/rwsem.c:1681
>        inode_unlock include/linux/fs.h:1039 [inline]
>        unlock_two_nondirectories+0x60/0x118 fs/inode.c:1269
>        ext4_move_extents+0x468/0x3580 fs/ext4/move_extent.c:656
>        __ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
>        ext4_ioctl+0x2a14/0x4234 fs/ext4/ioctl.c:1922
>        vfs_ioctl fs/ioctl.c:51 [inline]
>        __do_sys_ioctl fs/ioctl.c:597 [inline]
>        __se_sys_ioctl fs/ioctl.c:583 [inline]
>        __arm64_sys_ioctl+0x14c/0x1c4 fs/ioctl.c:583
>        __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>        invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>        el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>        el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>        el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>        el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
> -> #1 (sb_writers#3){.+.+}-{0:0}:
>        percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
>        percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
>        __sb_start_write include/linux/fs/super.h:19 [inline]
>        sb_start_write include/linux/fs/super.h:125 [inline]
>        ovl_start_write+0xf0/0x324 fs/overlayfs/util.c:32
>        ovl_do_copy_up fs/overlayfs/copy_up.c:977 [inline]
>        ovl_copy_up_one fs/overlayfs/copy_up.c:1189 [inline]
>        ovl_copy_up_flags+0x980/0x28a4 fs/overlayfs/copy_up.c:1243
>        ovl_maybe_copy_up+0x108/0x148 fs/overlayfs/copy_up.c:1272
>        ovl_open+0x12c/0x2c0 fs/overlayfs/file.c:211
>        do_dentry_open+0x5c4/0xfc8 fs/open.c:947
>        vfs_open+0x44/0x2d4 fs/open.c:1079
>        do_open fs/namei.c:4699 [inline]
>        path_openat+0x2234/0x2a6c fs/namei.c:4858
>        do_file_open+0x1c4/0x2e4 fs/namei.c:4887
>        do_sys_openat2+0x114/0x1e8 fs/open.c:1364
>        do_sys_open+0xac/0xdc fs/open.c:1370
>        __do_sys_openat fs/open.c:1386 [inline]
>        __se_sys_openat fs/open.c:1381 [inline]
>        __arm64_sys_openat+0x9c/0xb8 fs/open.c:1381
>        __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>        invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>        el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>        el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>        el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>        el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
> -> #0 (&ovl_i_lock_key[depth]){+.+.}-{4:4}:
>        check_prev_add kernel/locking/lockdep.c:3165 [inline]
>        check_prevs_add kernel/locking/lockdep.c:3284 [inline]
>        validate_chain kernel/locking/lockdep.c:3908 [inline]
>        __lock_acquire+0x1780/0x2f44 kernel/locking/lockdep.c:5237
>        lock_acquire+0x140/0x368 kernel/locking/lockdep.c:5868
>        __mutex_lock_common kernel/locking/mutex.c:646 [inline]
>        __mutex_lock+0x160/0xef8 kernel/locking/mutex.c:820
>        mutex_lock_interruptible_nested+0x24/0x30 kernel/locking/mutex.c:898
>        ovl_inode_lock_interruptible fs/overlayfs/overlayfs.h:705 [inline]
>        ovl_copy_up_start+0x58/0x264 fs/overlayfs/util.c:735
>        ovl_copy_up_one fs/overlayfs/copy_up.c:1182 [inline]
>        ovl_copy_up_flags+0x768/0x28a4 fs/overlayfs/copy_up.c:1243
>        ovl_copy_up+0x24/0x34 fs/overlayfs/copy_up.c:1282
>        ovl_rename_start fs/overlayfs/dir.c:1176 [inline]
>        ovl_rename+0x2d8/0xfec fs/overlayfs/dir.c:1363
>        vfs_rename+0xa78/0xd48 fs/namei.c:6064
>        filename_renameat2+0x66c/0x730 fs/namei.c:6182
>        __do_sys_renameat2 fs/namei.c:6211 [inline]
>        __se_sys_renameat2 fs/namei.c:6206 [inline]
>        __arm64_sys_renameat2+0xe4/0x114 fs/namei.c:6206
>        __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>        invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>        el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>        el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>        el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>        el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
> other info that might help us debug this:
>
> Chain exists of:
>   &ovl_i_lock_key[depth] --> sb_writers#3 --> &ovl_i_mutex_key[depth]/4
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&ovl_i_mutex_key[depth]/4);
>                                lock(sb_writers#3);
>                                lock(&ovl_i_mutex_key[depth]/4);
>   lock(&ovl_i_lock_key[depth]);
>
>  *** DEADLOCK ***
>
> 5 locks held by syz.3.611/8517:
>  #0: ffff00010bddc410 (sb_writers#17){.+.+}-{0:0}, at: mnt_want_write+0x44/0x9c fs/namespace.c:493
>  #1: ffff00010bddc718 (&type->s_vfs_rename_key#3){+.+.}-{4:4}, at: lock_rename fs/namei.c:3791 [inline]
>  #1: ffff00010bddc718 (&type->s_vfs_rename_key#3){+.+.}-{4:4}, at: __start_renaming+0xec/0x33c fs/namei.c:3880
>  #2: ffff0000eee14300 (&ovl_i_mutex_dir_key[depth]/1){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
>  #2: ffff0000eee14300 (&ovl_i_mutex_dir_key[depth]/1){+.+.}-{4:4}, at: lock_two_directories+0x19c/0x214 fs/namei.c:3767
>  #3: ffff0000eee13810 (&ovl_i_mutex_dir_key[depth]/5){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
>  #3: ffff0000eee13810 (&ovl_i_mutex_dir_key[depth]/5){+.+.}-{4:4}, at: lock_two_directories+0x1c4/0x214 fs/namei.c:3768
>  #4: ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: inode_lock_nested include/linux/fs.h:1074 [inline]
>  #4: ffff0000eee13d88 (&ovl_i_mutex_key[depth]/4){+.+.}-{4:4}, at: lock_two_nondirectories+0xe8/0x148 fs/inode.c:1256
>
> stack backtrace:
> CPU: 1 UID: 0 PID: 8517 Comm: syz.3.611 Tainted: G             L      syzkaller #0 PREEMPT
> Tainted: [L]=SOFTLOCKUP
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026
> Call trace:
>  show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:499 (C)
>  __dump_stack+0x30/0x40 lib/dump_stack.c:94
>  dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120
>  dump_stack+0x1c/0x28 lib/dump_stack.c:129
>  print_circular_bug+0x328/0x330 kernel/locking/lockdep.c:2043
>  check_noncircular+0x158/0x174 kernel/locking/lockdep.c:2175
>  check_prev_add kernel/locking/lockdep.c:3165 [inline]
>  check_prevs_add kernel/locking/lockdep.c:3284 [inline]
>  validate_chain kernel/locking/lockdep.c:3908 [inline]
>  __lock_acquire+0x1780/0x2f44 kernel/locking/lockdep.c:5237
>  lock_acquire+0x140/0x368 kernel/locking/lockdep.c:5868
>  __mutex_lock_common kernel/locking/mutex.c:646 [inline]
>  __mutex_lock+0x160/0xef8 kernel/locking/mutex.c:820
>  mutex_lock_interruptible_nested+0x24/0x30 kernel/locking/mutex.c:898
>  ovl_inode_lock_interruptible fs/overlayfs/overlayfs.h:705 [inline]
>  ovl_copy_up_start+0x58/0x264 fs/overlayfs/util.c:735
>  ovl_copy_up_one fs/overlayfs/copy_up.c:1182 [inline]
>  ovl_copy_up_flags+0x768/0x28a4 fs/overlayfs/copy_up.c:1243
>  ovl_copy_up+0x24/0x34 fs/overlayfs/copy_up.c:1282
>  ovl_rename_start fs/overlayfs/dir.c:1176 [inline]
>  ovl_rename+0x2d8/0xfec fs/overlayfs/dir.c:1363
>  vfs_rename+0xa78/0xd48 fs/namei.c:6064
>  filename_renameat2+0x66c/0x730 fs/namei.c:6182
>  __do_sys_renameat2 fs/namei.c:6211 [inline]
>  __se_sys_renameat2 fs/namei.c:6206 [inline]
>  __arm64_sys_renameat2+0xe4/0x114 fs/namei.c:6206
>  __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
>  invoke_syscall+0x98/0x244 arch/arm64/kernel/syscall.c:49
>  el0_svc_common+0xe8/0x23c arch/arm64/kernel/syscall.c:121
>  do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:140
>  el0_svc+0x64/0x260 arch/arm64/kernel/entry-common.c:736
>  el0t_64_sync_handler+0x48/0x148 arch/arm64/kernel/entry-common.c:755
>  el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:594
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
>
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
>
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report

#syz dup: possible deadlock in lock_two_nondirectories (2)

ext4 bug should be fixed in linux-next I think.

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Baokun Li @ 2026-06-18  9:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zhang Yi, linux-ext4, tytso, adilger.kernel, yi.zhang, ojaswin,
	ritesh.list, peng_wang
In-Reply-To: <wbgohsiks4355iejpa2xhvjdgyd4hfpg2gjcpgrv2wcbnevwao@mhik7uhkzzhx>

On 2026/6/17 19:08, Jan Kara wrote:
> On Wed 17-06-26 15:52:24, Baokun Li wrote:
>> On 2026/6/17 10:45, Zhang Yi wrote:
>>> On 6/16/2026 9:10 PM, Baokun Li wrote:
>>>> Thank you for your review!
>>>>
>>>> After extensive testing, I found that after merging this patch,
>>>> generic/746
>>>> started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The
>>>> test
>>>> triggers a "Page cache invalidation failure on direct I/O" warning, and
>>>> subsequent fsync returns -EIO.
>>>>
>>>> The underlying race existed before this patch, but this patch appears to
>>>> have widened the reproduction window considerably, so I thought it worth
>>>> trying to address.  Here is my analysis:
>>>>
>>>> On no-extent inodes, DIO writes that hit holes cannot use unwritten
>>>> extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
>>>> returns 0 for a hole, and:
>>>>
>>>>          if (!m_flags && !ret)
>>>>                  ret = -ENOTBLK;
>>>>
>>>> The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
>>>> I/O.  The fallback path dirties pages in the page cache, then flushes
>>>> and invalidates them.  However, concurrent async DIO completions to
>>>> other blocks on the same inode can run
>>>> kiocb_invalidate_post_direct_write()
>>>> without holding the inode lock.
>>>>
>>>> Consider a file with two 4k extents: [hole][written].  Thread A does DIO
>>>> to the written extent, while thread B does DIO spanning both extents:
>>>>
>>>>    kworker A (4k DIO, allocated block)    kworker B (8k DIO,
>>>> hole->fallback)
>>>>    -----------------------------------   
>>>> -----------------------------------
>>>>    inode_lock_shared()                    inode_lock_shared()
>>>>    iomap_dio_rw():                        iomap_dio_rw():
>>>>      kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
>>>>      submit_bio (async)                     dio->size = 0
>>>>    inode_unlock_shared()                  inode_unlock_shared()
>>>>
>>>>    [bio pending in block layer]           /* fallback: inode lock
>>>> released */
>>>>                                           ext4_buffered_write_iter()
>>>>                                             inode_lock(exclusive)
>>>>                                             generic_perform_write()
>>>>                                               -> dirty pages [0, 8k]
>>>>                                             inode_unlock(exclusive)
>>>>
>>>>                                           /* pages still dirty here */
>>>>    [bio completes]                        filemap_write_and_wait_range()
>>>>    iomap_dio_complete()                     -> flush dirty pages
>>>>      kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
>>>>        invalidate_inode_pages2_range()
>>>>        -> finds dirty page!               /* window closed */
>>>>        -> dio_warn_stale_pagecache()
>>>>        -> errseq_set(-EIO)
>>>>
>>> It looks like this issue occurs when invalidate_inode_pages2_range()
>>> checks beyond the DIO write range, which may only happen when folio size
>>> is larger than block size. Is that correct?
>> Thanks for looking at this!
>>
>> Not quite — the scenario involves an 8k file with layout
>>
>>  [hole at 0-4k] [written extent at 4k-8k]
>>
>> and two DIO threads. Thread A does a 4k DIO write at offset 4k; since
>> the target block is a written extent, no fallback occurs. Thread B
>> does an 8k DIO write at offset 0; since blocks 0-4k are a hole on an
>> indirect-block inode and ext3 does not support unwritten extents,
>> iomap returns -ENOTBLK and the entire 8k write falls back to buffered
>> I/O.
> Right, but for this to happen userspace had to submit two overlapping
> direct IO writes. This always had undefined behavior so some inconsistent
> content in the file is more or less acceptable. But as Zhang pointed out,
> the same failure can also appear when block_size < folio_size and there we
> should really strive to provide consistent data.
Agreed, overlapping DIO is UB from userspace's perspective. However,
neither the loop device usage in generic/746 nor the BS < PS scenario
that Yi mentioned involves userspace intentionally submitting
overlapping DIO writes. Both are triggered through normal I/O paths,
so the issue still needs to be addressed.
>
>>>> The critical window is the gap between ext4_buffered_write_iter()
>>>> dirtying
>>>> pages and filemap_write_and_wait_range() flushing them.  In this
>>>> window the
>>>> inode lock is not held, so another thread's async DIO completion is
>>>> free to
>>>> invalidate the still-dirty pages in the page cache.
>>>>
>>>> This race has always existed on ext3 because indirect-block inodes lack
>>>> unwritten-extent support.  However, the window was extremely narrow in
>>>> practice, because the old ext4_overwrite_io() checked every block and
>>>> would conservatively take an exclusive lock.  This patch replaced it
>>>> with ext4_dio_needs_zeroing(), which only checks head and tail blocks,
>>>> making unaligned DIO more likely to take a shared lock and
>>>> proportionally increasing the chance of hitting the race.
>>>>
>>>> I tried a couple of alternatives before settling on the patch below:
>>>>
>>>> 1. Force exclusive lock + IOMAP_DIO_FORCE_WAIT for all no-extent DIO.
>>>>     This closes the window for new DIO submissions, but does not protect
>>>>     against bio completions from previously submitted async DIO, which
>>>>     run independently of the inode lock.
>>>>
>>>> 2. Wrap the fallback dirty+flush+invalidate sequence in
>>>>     filemap_invalidate_lock().  However, the ext4 DIO and iomap layers
>>>>     do not use this lock, so it would not serialise against DIO
>>>>     completions.
>>>>
>>> Could we add a call to inode_dio_wait() before falling back to buffered
>>> I/O? That is, in thread B, when falling back to buffered I/O, could we
>>> acquire the exclusive inode lock and then call inode_dio_wait() to wait
>>> for in-flight DIO to complete? This should close the race window. Since
>>> scenarios where DIO writes to holes on ext3 are relatively rare, the
>>> performance impact should be minimal (I suppose).
>>>
>> That's a great idea, thank you!
>>
>> I had been trying to fix this on the DIO side and didn't consider
>> waiting from the buffered fallback path.
>>
>> I've tested the approach locally and it closes the race; I'll add a
>> patch using it in the next version.
> Yes, this looks like the best solution so far. The fallback doesn't have to
> be fast. It was always - you are doing something stupid and we try to fixup
> for you - kind of thing and bad performance is acceptable in that case.
Agreed.
>>>> One straightforward approach that seems correct is to skip direct I/O
>>>> for no-extent inodes entirely, by returning 0 from ext4_dio_alignment():
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -6131,6 +6131,8 @@ u32 ext4_dio_alignment(struct inode *inode)
>>>>   {
>>>>          if (fsverity_active(inode))
>>>>                  return 0;
>>>> +       if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>>> +               return 0;
>>>>          if (ext4_should_journal_data(inode))
>>>>                  return 0;
>>>>          if (ext4_has_inline_data(inode))
>>>>
>>>> With this, ext4_should_use_dio() returns false for no-extent inodes, and
>>>> all I/O goes through ext4_buffered_write_iter() directly, bypassing the
>>>> DIO path entirely.  On ext3, DIO to a hole already falls back to
>>>> buffered
>>>> I/O, so there is essentially no performance benefit to using DIO in the
>>>> first place.
>>>>
>>>> Note that with this change, the fallback branch in
>>>> ext4_dio_write_iter():
>>>>
>>>>          if (ret >= 0 && iov_iter_count(from)) {
>>>>                  /* buffered fallback */
>>>>          }
>>>>
>>>> would also become dead code for extent-based inodes (since unwritten
>>>> extents guarantee iomap_dio_rw() never returns zero with unconsumed
>>>> data), and could be removed in a follow-up cleanup.
>>>>
>>>> Thoughts?  Is there a reason to preserve DIO on no-extent inodes that
>>>> I'm missing?
>>>>
>>> Hmm, this would also cause DIO to fall back to buffered I/O in common
>>> extending write cases, which I think would be unacceptable.
>> Fair point, the regression on extending writes is hard to justify.  That
>> said, until we had a better fix, I'd argue a behavioural change was
>> still preferable to potential data corruption. With the inode_dio_wait()
>> approach above, this trade-off goes away. 
> But heavily regressing performance for overwrites or extending DIO writes
> even on indirect block based files is not really acceptable. There are
> still users who for whatever reasons stay with old filesystems having
> indirect block based files and they'd likely notice the regression.
>
> 								Honza

Agreed. Significantly degrading performance for common workloads
(overwrites, extending writes) to handle a rare race condition is
unacceptable for production systems. The current fix with
inode_dio_wait() only impacts the fallback path (DIO → buffered I/O),
which is already an exceptional and slow path, so the performance
impact there is acceptable.


Thanks,
Baokun


^ permalink raw reply

* Re: [PATCH v4] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Jan Kara @ 2026-06-18  7:21 UTC (permalink / raw)
  To: Zhou, Yun
  Cc: Jan Kara, tytso, adilger.kernel, libaokun, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <ee05e46d-6ec8-4291-a61e-213812934da3@windriver.com>

On Thu 18-06-26 09:02:10, Zhou, Yun wrote:
> 
> 
> On 6/18/26 04:25, Jan Kara wrote:
> > On Mon 15-06-26 13:35:12, Yun Zhou wrote:
> > 
> > --- a/fs/ext4/xattr.c
> > +++ b/fs/ext4/xattr.c
> > @@ -464,6 +464,33 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
> >                inode_unlock(inode);
> >        }
> > 
> > +     /*
> > +      * Since this function resolves references from active xattr entries,
> > +      * the EA inode must be in active state (i_nlink=1, ref_count>0).
> > +      * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
> > +      * (inconsistent with an active entry) all indicate corruption or
> > +      * a concurrent last-reference drop.
> > +      */
> > +     if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
> > +             ext4_error(parent->i_sb,
> > +                        "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
> > +                        ea_ino, inode->i_nlink,
> > +                        ext4_xattr_inode_get_ref(inode));
> > Hum, given motivation of this is syzbot corrupted fs image, I'd just put
> > check in ext4_iget() verifying ext4_xattr_inode_get_ref() is > 0 and be
> > done with it. Much simpler and catches at least the obvious cases. The
> > consistency of xattr refcount and i_nlink is otherwise guarded by
> > ext4_xattr_inode_update_ref() and it can never be perfect as in the kernel
> > we don't have the full view of the filesystem and so cannot ascertain that
> > xattr ref count matches reality...
> > 
> Thanks very much for your review and suggestions, The issue should be easily
> fixed after "introduce ext4_put_ea_inode() for safe deferred iput" merged.
> Just need to defer iput() and no longer to make inode bad.

OK, just to make it clear, calling make_bad_inode() is never a right thing
to do once the inode has been exposed to other users. It is only usable
within iget functions when loading inode into memory and finding out the
content isn't valid. This function is one of the traps we have in VFS...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
From: Zhang Yi @ 2026-06-18  1:43 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai
In-Reply-To: <xt3zkmgktl7wpbwt4de76wh4q576fkmytw6udeojmj4goi6ul6@np4zgi2zinkg>

On 6/18/2026 4:29 AM, Jan Kara wrote:
> On Tue 16-06-26 20:37:00, Zhang Yi wrote:
>> On 6/16/2026 6:04 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:26, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> The iomap buffered write path does not hold any locks between querying
>>>> inode extent mapping information and performing buffered writes. It
>>>> relies on the sequence counter saved in the inode to detect stale
>>>> mappings.
>>>
>>> Now that I'm looking at it again, I've got a bit confused here. Buffered
>>> write path is holding i_rwsem between mapping blocks and using them so
>>> there shouldn't be races.  Perhaps you mean buffered *writeback* path? But
>>> then ext4_da_map_blocks() should not ever get called in the writeback path
>>> because it is allocating delayed blocks... So this change looks unnecessary
>>> to me now. Am I missing something?
>>>
>>> 								Honza
>>>
>>
>> Hi Jan,
>>
>> Thanks for coming back to this series. Sorry for the inaccurate
>> description in the commit message. However, this change is still needed.
>>
>> As mentioned in the comment before the ->iomap_valid() callback in
>> iomap_write_begin(), consider the following scenario — a buffered write
>> to a dirty unwritten extent, with this concurrent race:
>>
>>   Buffer write (holds i_rwsem)    Writeback (no i_rwsem)
>>   ext4_da_map_blocks()
>>     // ext4_es_lookup_extent()
>>     // finds UNWRITTEN extent
>>   ext4_set_iomap()
>>     // type = IOMAP_UNWRITTEN
>>     // validity_cookie = m_seq
>>                                   ext4_iomap_writepages()
>>                                     // writeback for unwritten extent
>>                                     // ext4_convert_unwritten_extents()
>>                                     // extent tree: UNWRITTEN → WRITTEN
>>                                     // advancing i_es_seq
>>   __iomap_write_begin()
>>     // ext4_iomap_valid(): cookie != i_es_seq
>>     // iomap invalidated, re-maps
>>     // gets type = IOMAP_MAPPED (WRITTEN)
>>     // iomap_block_needs_zeroing(): FALSE
>>
>> Without passing out m_seq, the stale IOMAP_UNWRITTEN type from the iomap
>> would cause __iomap_write_begin()->iomap_block_needs_zeroing() to zero
>> out blocks that have already been written, corrupting data on partial
>> writes.
> 
> Ah, right, thanks for explanation. So please update the description to
> explicitely mention that iomap doesn't hold folio lock between mapping the
> extent and copying data and thus can race with writeback modifying the
> extent type. Thanks!
> 
> 								Honza

Sure, thanks for pointing this out.

Cheers,
Yi.

> 
>>
>> Thanks,
>> Yi
>>
>>>>
>>>> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
>>>> blocks") added the m_seq field to ext4_map_blocks to pass out extent
>>>> sequence numbers, but it missed two callsites within
>>>> ext4_da_map_blocks(). These callsites are on the delayed allocation
>>>> path, which is also used in the iomap buffered write path. Pass out the
>>>> sequence counter to ensure stale mappings can be detected.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>> ---
>>>>   fs/ext4/inode.c | 4 ++--
>>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 6c4d9137b279..39577a6b65b9 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>>>   	ext4_check_map_extents_env(inode);
>>>>   	/* Lookup extent status tree firstly */
>>>> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>>>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>>>   		map->m_len = min_t(unsigned int, map->m_len,
>>>>   				   es.es_len - (map->m_lblk - es.es_lblk));
>>>> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>>>   	 * is held in write mode, before inserting a new da entry in
>>>>   	 * the extent status tree.
>>>>   	 */
>>>> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>>>> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>>>   		map->m_len = min_t(unsigned int, map->m_len,
>>>>   				   es.es_len - (map->m_lblk - es.es_lblk));
>>>> -- 
>>>> 2.52.0
>>>>
>>


^ permalink raw reply

* Re: [PATCH v4] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Zhou, Yun @ 2026-06-18  1:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, yi.zhang,
	ebiggers, linux-ext4, linux-kernel
In-Reply-To: <xzd6pbxhxxwf3ngj5v5pz2ilhykh7vsegohe6iwfkfytfuuv4o@5obqeoc5wocr>



On 6/18/26 04:25, Jan Kara wrote:
> On Mon 15-06-26 13:35:12, Yun Zhou wrote:
>
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -464,6 +464,33 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
>                inode_unlock(inode);
>        }
>
> +     /*
> +      * Since this function resolves references from active xattr entries,
> +      * the EA inode must be in active state (i_nlink=1, ref_count>0).
> +      * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
> +      * (inconsistent with an active entry) all indicate corruption or
> +      * a concurrent last-reference drop.
> +      */
> +     if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
> +             ext4_error(parent->i_sb,
> +                        "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
> +                        ea_ino, inode->i_nlink,
> +                        ext4_xattr_inode_get_ref(inode));
> Hum, given motivation of this is syzbot corrupted fs image, I'd just put
> check in ext4_iget() verifying ext4_xattr_inode_get_ref() is > 0 and be
> done with it. Much simpler and catches at least the obvious cases. The
> consistency of xattr refcount and i_nlink is otherwise guarded by
> ext4_xattr_inode_update_ref() and it can never be perfect as in the kernel
> we don't have the full view of the filesystem and so cannot ascertain that
> xattr ref count matches reality...
>
Thanks very much for your review and suggestions, The issue should be easily
fixed after "introduce ext4_put_ea_inode() for safe deferred iput" 
merged. Just
need to defer iput() and no longer to make inode bad.

^ permalink raw reply

* Re: [PATCH v7 0/4] ext4: fix xattr iput deadlock with s_writepages_rwsem
From: Zhou, Yun @ 2026-06-18  0:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, yi.zhang,
	linux-ext4, linux-kernel
In-Reply-To: <o63nwdmeoqscllaitti32enjhet4fcvstv5eh4wooviwxmsosl@mooyvfa5dfqm>



On 6/18/26 02:13, Jan Kara wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue 16-06-26 23:15:54, Yun Zhou wrote:
>> This series fixes a circular lock dependency reported by syzbot:
>>
>>    s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
>>
>> The deadlock occurs when iput() on an EA inode triggers write_inode_now()
>> while xattr_sem and a jbd2 handle are held.  The triggering path is
>> during mount-time orphan cleanup (!SB_ACTIVE) where iput_final() calls
>> write_inode_now() synchronously.
>>
>> Patch 1 blocks the deadlock by skipping extra isize expansion when
>> !SB_ACTIVE -- this prevents the xattr manipulation path from being
>> entered during mount.
>>
>> Patch 2 is a belt-and-suspenders semantic improvement: an inode under
>> eviction never needs extra isize expansion.
>>
>> Patches 3-4 are a structural improvement using a per-sb workqueue:
>>
>>    Patch 3 introduces ext4_put_ea_inode(), which does direct iput() when
>>    SB_ACTIVE (zero overhead) and defers to a workqueue when !SB_ACTIVE.
>>    It also converts the first call site (ext4_xattr_block_set release
>>    path) which previously called iput under xattr_sem + jbd2 handle.
>>
>>    Patch 4 converts the remaining EA inode iput() calls that execute
>>    under locks.  Sites where direct iput() is provably safe (i_nlink=0
>>    after dec_ref, or lookup-only paths) are left unchanged with comments.
>>
>> Link: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
> Please don't send the series so quickly. I'd say twice per week is about
> maximum sensible cadence. It takes time (easily several days) for people to
> get to look at your patches and sending your patches sometimes even several
> times per day just creates a mess in the mailbox.

I admit I've been pushing patches too frequently, but I'll definitely tone
it down going forward. The reason is that I get feedback from the 'AI 
Review'
almost immediately after submitting. It flags a lot of edge cases—some I
introduced, others that were pre-existing, or even deadlock paths I missed
despite claiming they were fixed. Most of these are genuine issues. Even if
some are extremely strict, I just want to resolve them as quickly as 
possible.

I mistakenly thought that once the AI Reviewer raised concerns, human 
reviewers
wouldn't check it again. So I kept pushing new patches to fix those 
potential
issues, only for the bot to immediately flag new ones... It would be 
great if there
were a way to send patches directly to the AI bot for a separate review.
> Also in some previous version, I gave my Reviewed-by tag for patch 1 and
> some comment and Reviewed-by tag for patch 2. So please reflect that in the
> next posting.
Got it, I will add it in the next version.

Thanks,
Yun


^ permalink raw reply

* Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
From: Jan Kara @ 2026-06-17 20:29 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, djwong,
	hch, yi.zhang, yangerkun, yukuai
In-Reply-To: <58ee3009-8a0e-4657-a473-475ea998316e@gmail.com>

On Tue 16-06-26 20:37:00, Zhang Yi wrote:
> On 6/16/2026 6:04 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:26, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > The iomap buffered write path does not hold any locks between querying
> > > inode extent mapping information and performing buffered writes. It
> > > relies on the sequence counter saved in the inode to detect stale
> > > mappings.
> > 
> > Now that I'm looking at it again, I've got a bit confused here. Buffered
> > write path is holding i_rwsem between mapping blocks and using them so
> > there shouldn't be races.  Perhaps you mean buffered *writeback* path? But
> > then ext4_da_map_blocks() should not ever get called in the writeback path
> > because it is allocating delayed blocks... So this change looks unnecessary
> > to me now. Am I missing something?
> > 
> > 								Honza
> > 
> 
> Hi Jan,
> 
> Thanks for coming back to this series. Sorry for the inaccurate
> description in the commit message. However, this change is still needed.
> 
> As mentioned in the comment before the ->iomap_valid() callback in
> iomap_write_begin(), consider the following scenario — a buffered write
> to a dirty unwritten extent, with this concurrent race:
> 
>   Buffer write (holds i_rwsem)    Writeback (no i_rwsem)
>   ext4_da_map_blocks()
>     // ext4_es_lookup_extent()
>     // finds UNWRITTEN extent
>   ext4_set_iomap()
>     // type = IOMAP_UNWRITTEN
>     // validity_cookie = m_seq
>                                   ext4_iomap_writepages()
>                                     // writeback for unwritten extent
>                                     // ext4_convert_unwritten_extents()
>                                     // extent tree: UNWRITTEN → WRITTEN
>                                     // advancing i_es_seq
>   __iomap_write_begin()
>     // ext4_iomap_valid(): cookie != i_es_seq
>     // iomap invalidated, re-maps
>     // gets type = IOMAP_MAPPED (WRITTEN)
>     // iomap_block_needs_zeroing(): FALSE
> 
> Without passing out m_seq, the stale IOMAP_UNWRITTEN type from the iomap
> would cause __iomap_write_begin()->iomap_block_needs_zeroing() to zero
> out blocks that have already been written, corrupting data on partial
> writes.

Ah, right, thanks for explanation. So please update the description to
explicitely mention that iomap doesn't hold folio lock between mapping the
extent and copying data and thus can race with writeback modifying the
extent type. Thanks!

								Honza

> 
> Thanks,
> Yi
> 
> > > 
> > > Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
> > > blocks") added the m_seq field to ext4_map_blocks to pass out extent
> > > sequence numbers, but it missed two callsites within
> > > ext4_da_map_blocks(). These callsites are on the delayed allocation
> > > path, which is also used in the iomap buffered write path. Pass out the
> > > sequence counter to ensure stale mappings can be detected.
> > > 
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > ---
> > >   fs/ext4/inode.c | 4 ++--
> > >   1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 6c4d9137b279..39577a6b65b9 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> > >   	ext4_check_map_extents_env(inode);
> > >   	/* Lookup extent status tree firstly */
> > > -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> > > +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> > >   		map->m_len = min_t(unsigned int, map->m_len,
> > >   				   es.es_len - (map->m_lblk - es.es_lblk));
> > > @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
> > >   	 * is held in write mode, before inserting a new da entry in
> > >   	 * the extent status tree.
> > >   	 */
> > > -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
> > > +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> > >   		map->m_len = min_t(unsigned int, map->m_len,
> > >   				   es.es_len - (map->m_lblk - es.es_lblk));
> > > -- 
> > > 2.52.0
> > > 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4] ext4: validate EA inode i_nlink in ext4_xattr_inode_iget
From: Jan Kara @ 2026-06-17 20:25 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, ebiggers, linux-ext4, linux-kernel
In-Reply-To: <20260615053512.1315992-1-yun.zhou@windriver.com>

On Mon 15-06-26 13:35:12, Yun Zhou wrote:
> Validate EA inode state in ext4_xattr_inode_iget() to prevent
> WARN_ONCE triggers in ext4_xattr_inode_update_ref() and reject
> corrupted EA inodes before they can cause further damage.
> 
> When a corrupted ext4 image has an EA inode with inconsistent i_nlink
> and ref_count values, the code currently allows it through and later
> hits WARN_ONCE when ref_count transitions cross the 0/1 boundary.
> While this is not a security or stability issue -- it only fires on
> crafted filesystem images and merely prints a call trace -- it is
> better handled as an early sanity check that returns -EFSCORRUPTED,
> consistent with how ext4 treats other on-disk corruption.
> 
> Since ext4_xattr_inode_iget() resolves references from active xattr
> entries, the target EA inode must be in active state (i_nlink=1,
> ref_count>0).  Reject any inode that does not satisfy this.
> 
> Reported-by: syzbot+76916a45d2294b551fd9@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
> Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
> ---
> v4:
>  - Take I_MUTEX_XATTR before checking orphan state to safely decide
>    whether to call make_bad_inode(), avoiding orphan list corruption
>    if another thread is concurrently freeing the EA inode
> 
> v3:
>  - Move check after Lustre branch to avoid false positives on Lustre EA inodes
>  - Merge into single condition: i_nlink != 1 || !ref_count
>  - Add make_bad_inode() before iput() to avoid truncation in active txn
> 
> v2:
>  - Add ref_count validation to also catch i_nlink=1, ref_count=0 case
> 
>  fs/ext4/xattr.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 982a1f831e22..8efd6368f956 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -464,6 +464,33 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
>  		inode_unlock(inode);
>  	}
>  
> +	/*
> +	 * Since this function resolves references from active xattr entries,
> +	 * the EA inode must be in active state (i_nlink=1, ref_count>0).
> +	 * i_nlink > 1, i_nlink == 0 (dangling reference), or ref_count == 0
> +	 * (inconsistent with an active entry) all indicate corruption or
> +	 * a concurrent last-reference drop.
> +	 */
> +	if (inode->i_nlink != 1 || !ext4_xattr_inode_get_ref(inode)) {
> +		ext4_error(parent->i_sb,
> +			   "EA inode %lu has unexpected i_nlink=%u ref_count=%llu",
> +			   ea_ino, inode->i_nlink,
> +			   ext4_xattr_inode_get_ref(inode));

Hum, given motivation of this is syzbot corrupted fs image, I'd just put
check in ext4_iget() verifying ext4_xattr_inode_get_ref() is > 0 and be
done with it. Much simpler and catches at least the obvious cases. The
consistency of xattr refcount and i_nlink is otherwise guarded by
ext4_xattr_inode_update_ref() and it can never be perfect as in the kernel
we don't have the full view of the filesystem and so cannot ascertain that
xattr ref count matches reality...

								Honza

> +		/*
> +		 * Mark rejected inode to prevent ext4_evict_inode() from
> +		 * attempting truncation on a corrupted inode within an active
> +		 * transaction, which could exhaust journal credits. The lock
> +		 * serializes against ext4_xattr_inode_update_ref() which
> +		 * does clear_nlink() + ext4_orphan_add() under the same lock.
> +		 */
> +		inode_lock_nested(inode, I_MUTEX_XATTR);
> +		if (!ext4_inode_orphan_tracked(inode))
> +			make_bad_inode(inode);
> +		inode_unlock(inode);
> +		iput(inode);
> +		return -EFSCORRUPTED;
> +	}
> +
>  	*ea_inode = inode;
>  	return 0;
>  }
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v3] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Zorro Lang @ 2026-06-17 19:33 UTC (permalink / raw)
  To: Disha Goel
  Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
	ojaswin, djwong
In-Reply-To: <20260608102328.40916-1-disgoel@linux.ibm.com>

On Mon, Jun 08, 2026 at 03:53:28PM +0530, Disha Goel wrote:
> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
> on DAX files.
> 
> Add an ext4-specific check in _require_defrag() to skip tests when DAX
> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
> generic/018.
> 
> XFS defrag works with DAX, so this check is ext4-specific.
> 
> Suggested-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> Changes in v3:
> - Move the DAX check inside the ext4 case statement as
>   suggested by Darrick

Make sense to me,

Reviewed-by: Zorro Lang <zlang@kernel.org>

> 
>  common/defrag | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/common/defrag b/common/defrag
> index 055d0d0e..baf05d94 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -13,6 +13,8 @@ _require_defrag()
>          DEFRAG_PROG="$XFS_FSR_PROG"
>  	;;
>      ext4)
> +        __scratch_uses_fsdax && _notrun "ext4 online defrag not supported with DAX"
> +
>  	testfile="$TEST_DIR/$$-test.defrag"
>  	donorfile="$TEST_DIR/$$-donor.defrag"
>  	bsize=`_get_block_size $TEST_DIR`
> -- 
> 2.45.1
> 
> 

^ permalink raw reply

* Re: [PATCH v7 3/4] ext4: introduce ext4_put_ea_inode() for safe deferred iput
From: Jan Kara @ 2026-06-17 18:42 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, linux-ext4, linux-kernel
In-Reply-To: <20260616151558.1728881-4-yun.zhou@windriver.com>

On Tue 16-06-26 23:15:57, Yun Zhou wrote:
> Calling iput() on EA inodes while holding xattr_sem or a jbd2 handle
> can trigger write_inode_now() -> ext4_writepages() -> s_writepages_rwsem,
> creating a lock ordering issue during mount (!SB_ACTIVE).
> 
> Add ext4_put_ea_inode() which safely releases EA inode references:
> when SB_ACTIVE, it calls iput() directly (write_inode_now cannot be
> triggered); during mount (!SB_ACTIVE), it queues the inode on a per-sb
> lock-free llist and schedules a worker to call iput() in a clean
> context without holding any ext4 locks.
> 
> Convert the iput in ext4_xattr_block_set()'s "Drop the previous xattr
> block" path to use ext4_xattr_inode_array_free_deferred(), which
> releases EA inodes via ext4_put_ea_inode().  This path previously called
> ext4_xattr_inode_array_free() (synchronous iput) while holding xattr_sem
> and a jbd2 handle.
> 
> The worker is flushed in ext4_put_super() before journal destruction to
> ensure all pending EA inode cleanup completes while the journal is still
> available.
> 
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Yes, this goes in the direction I intended. But I have couple of
suggestions:

> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..690202303269 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1706,6 +1706,11 @@ struct ext4_sb_info {
>  	struct ext4_es_stats s_es_stats;
>  	struct mb_cache *s_ea_block_cache;
>  	struct mb_cache *s_ea_inode_cache;
> +
> +	/* Deferred iput for EA inodes to avoid lock ordering issues */
> +	struct llist_head s_ea_inode_to_free;
> +	struct work_struct s_ea_inode_work;
> +

I'd probably use delayed work and schedule it with a delay of one jiffie so
that some inodes can accumulate before we process them which should reduce
the amount of task switching to workqueues.

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..b777bb0a81ea 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1308,6 +1308,9 @@ static void ext4_put_super(struct super_block *sb)
>  	destroy_workqueue(sbi->rsv_conversion_wq);
>  	ext4_release_orphan_info(sb);
>  
> +	/* Flush deferred EA inode iputs before destroying journal */
> +	flush_work(&sbi->s_ea_inode_work);
> +

This should happen earlier in ext4_put_super(). At this place quotas were
already turned off and so quota accounting would go wrong.

> @@ -3025,6 +3028,74 @@ void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *ea_inode_array)
>  	kfree(ea_inode_array);
>  }
>  
> +static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
> +				struct ext4_xattr_inode_array *array)
> +{
> +	int idx;
> +
> +	if (array == NULL)
> +		return;
> +
> +	for (idx = 0; idx < array->count; ++idx)
> +		ext4_put_ea_inode(sb, array->inodes[idx]);
> +	kfree(array);
> +}

The array of EA inodes used in xattr handling is just another mechanism
used for delaying iput() of EA inodes. It doesn't make sense to stack these
to one on top of another. Just completely replace the array mechanism with
always deferring iput of EA inode into the workqueue.

> +struct ext4_ea_iput_entry {
> +	struct llist_node node;
> +	struct inode *inode;
> +};
> +
> +/*
> + * Worker function for deferred EA inode iput.  Processes all inodes queued
> + * on s_ea_inode_to_free in a context free of xattr_sem/jbd2 handle locks.
> + */
> +void ext4_ea_inode_work(struct work_struct *work)
> +{
> +	struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
> +						s_ea_inode_work);
> +	struct llist_node *node = llist_del_all(&sbi->s_ea_inode_to_free);
> +	struct llist_node *next;
> +
> +	while (node) {
> +		struct ext4_ea_iput_entry *entry = container_of(node,
> +				struct ext4_ea_iput_entry, node);
> +		next = node->next;
> +		iput(entry->inode);
> +		kfree(entry);
> +		node = next;
> +	}
> +}

Allocating ext4_ea_iput_entry for dropping each inode is somewhat wasteful.
I want to suggest another scheme (somewhat more involved but more efficient
scheme):

1) Create a VFS helper bool iput_if_not_last(struct inode *inode) which
drops inode reference if it is not the last one (and returns true in that
case). Basically:

bool iput_if_not_last(struct inode *inode)
{
	return atomic_add_unless(&inode->i_count, -1, 1);
}

This needs to be a separate patch as it should get vetting from VFS
maintainers.

2) Use iput_if_not_last() in ext4_put_ea_inode(). If it returns true, we
are done. Otherwise we know we were at least for a moment holders of the
last inode reference, so we link the inode to the list of inodes to drop
through llist_node embedded in ext4_inode_info. We cannot race with anybody
else trying to link the same inode into the list because we hold one inode
ref and so nobody else can hit this "I was holding the last ref" path.
I'd union this llist_node say with xattr_sem which is unused for EA inodes
to avoid growing ext4_inode_info.

This way we avoid offloading unless really necessary and we don't have to
do allocations just to drop EA inode ref.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v7 0/4] ext4: fix xattr iput deadlock with s_writepages_rwsem
From: Jan Kara @ 2026-06-17 18:13 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, linux-ext4, linux-kernel
In-Reply-To: <20260616151558.1728881-1-yun.zhou@windriver.com>

On Tue 16-06-26 23:15:54, Yun Zhou wrote:
> This series fixes a circular lock dependency reported by syzbot:
> 
>   s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
> 
> The deadlock occurs when iput() on an EA inode triggers write_inode_now()
> while xattr_sem and a jbd2 handle are held.  The triggering path is
> during mount-time orphan cleanup (!SB_ACTIVE) where iput_final() calls
> write_inode_now() synchronously.
> 
> Patch 1 blocks the deadlock by skipping extra isize expansion when
> !SB_ACTIVE -- this prevents the xattr manipulation path from being
> entered during mount.
> 
> Patch 2 is a belt-and-suspenders semantic improvement: an inode under
> eviction never needs extra isize expansion.
> 
> Patches 3-4 are a structural improvement using a per-sb workqueue:
> 
>   Patch 3 introduces ext4_put_ea_inode(), which does direct iput() when
>   SB_ACTIVE (zero overhead) and defers to a workqueue when !SB_ACTIVE.
>   It also converts the first call site (ext4_xattr_block_set release
>   path) which previously called iput under xattr_sem + jbd2 handle.
> 
>   Patch 4 converts the remaining EA inode iput() calls that execute
>   under locks.  Sites where direct iput() is provably safe (i_nlink=0
>   after dec_ref, or lookup-only paths) are left unchanged with comments.
> 
> Link: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5

Please don't send the series so quickly. I'd say twice per week is about
maximum sensible cadence. It takes time (easily several days) for people to
get to look at your patches and sending your patches sometimes even several
times per day just creates a mess in the mailbox.

Also in some previous version, I gave my Reviewed-by tag for patch 1 and
some comment and Reviewed-by tag for patch 2. So please reflect that in the
next posting.

								Honza

> v7:
>  - Replaced the deferred-iput array threading approach (v4-v6) with a
>    simpler per-sb workqueue + lock-free llist design.  No function
>    signature changes needed.  ext4_put_ea_inode() does direct iput when
>    SB_ACTIVE (zero overhead in normal operation) and defers to the
>    workqueue only during mount (!SB_ACTIVE).
>  - Converted the iput in ext4_xattr_delete_inode()'s quota accounting
>    loop to ext4_put_ea_inode() to eliminate a lockdep-reportable lock
>    ordering violation (jbd2_handle -> iput -> s_writepages_rwsem).
>  - Moved flush_work() before the if (sbi->s_journal) check in
>    ext4_put_super() to cover nojournal mode.
> 
> v6:
>  - ext4_inline_data_truncate(): use local ea_inode_array instead of
>    passing NULL, freed after ext4_journal_stop().  Fixes a deadlock
>    reachable via crafted filesystem where inline data xattr entry has
>    e_value_inum set: orphan cleanup -> ext4_truncate ->
>    ext4_inline_data_truncate -> iput under !SB_ACTIVE.
> 
> v5:
>  - Split into 3 patches for easier review.
>  - Add explicit !SB_ACTIVE early-return in ext4_try_to_expand_extra_isize()
>    to block ALL mount-time paths (ext4_process_orphan -> ext4_truncate ->
>    ext4_mark_inode_dirty), not just the eviction path. v4 only relied on
>    EXT4_STATE_NO_EXPAND which doesn't cover orphan truncation.
> 
> v4:
>  - Comprehensive rewrite of the deferred iput mechanism.
>  - Thread ea_inode_array through ext4_expand_extra_isize_ea() and
>    ext4_xattr_move_to_block() so ALL ea_inode iputs in the expand
>    path are deferred, not just those in ext4_xattr_block_set().
>  - Add NULL safety to ext4_expand_inode_array(): when ea_inode_array
>    pointer is NULL, fall back to synchronous iput (for callers like
>    ext4_initxattrs that only run with SB_ACTIVE).
>  - Use __GFP_NOFAIL to guarantee deferred array growth, eliminating
>    fallback to synchronous iput under locks.
>  - Update ext4_xattr_ibody_set() and ext4_xattr_set_entry() signatures
>    to accept ea_inode_array, converting ALL iput(ea_inode) calls.
>  - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() before
>    ext4_mark_inode_dirty().
> 
> v3:
>  - Check ext4_expand_inode_array() return value; fallback to
>    direct iput() on ENOMEM to prevent inode leak.
>  - Make ext4_xattr_set_handle() take an optional ea_inode_array
>    output parameter so callers can free after ext4_journal_stop(),
>    avoiding the jbd2_handle vs s_writepages_rwsem AB-BA.
>  - Pass ea_inode_array directly to ext4_xattr_release_block()
>    instead of using a local array freed under xattr_sem.
>  - Move ext4_xattr_inode_array_free() after ext4_journal_stop()
> 
> v2:
>  - Defer iput() in ext4_xattr_block_set() via ea_inode_array,
>    freed after xattr_sem is released. Fixes the root cause.
> 
> v1:
>  - Set EXT4_STATE_NO_EXPAND in ext4_evict_inode() to skip expand
>    on inodes being deleted. Only fixes the syzbot reproducer, not
>    the underlying lock ordering violation.
> 
> Yun Zhou (4):
>   ext4: skip extra isize expansion during mount to prevent deadlock
>   ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
>   ext4: introduce ext4_put_ea_inode() for safe deferred iput
>   ext4: convert remaining EA inode iput() calls to ext4_put_ea_inode()
> 
>  fs/ext4/ext4.h  |   5 +++
>  fs/ext4/inode.c |  11 +++++
>  fs/ext4/super.c |   6 +++
>  fs/ext4/xattr.c | 105 +++++++++++++++++++++++++++++++++++++++++++-----
>  fs/ext4/xattr.h |   2 +
>  5 files changed, 120 insertions(+), 9 deletions(-)
> 
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Zhang Yi @ 2026-06-17 13:22 UTC (permalink / raw)
  To: Brian Foster, Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, ojaswin, ritesh.list, djwong, hch,
	yi.zhang, yangerkun, yukuai
In-Reply-To: <ajJ9b91fTbM6qApg@bfoster>

On 6/17/2026 6:56 PM, Brian Foster wrote:
> On Wed, Jun 17, 2026 at 04:14:40PM +0800, Zhang Yi wrote:
>> On 6/16/2026 8:28 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:34, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
>>>> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
>>>> infrastructure for ext4.
>>>>
>>>> ext4_iomap_block_zero_range() calls iomap_zero_range() with
>>>> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
>>>> out either a mapped partial block or a dirty, unwritten partial block.
>>>>
>>>> Important constraints:
>>>>
>>>> Zeroing out under an active journal handle can cause deadlock, because
>>>> the order of acquiring the folio lock and starting a handle is
>>>> inconsistent with the iomap writeback path.
>>>>
>>>> Therefore, ext4_iomap_block_zero_range():
>>>> - Must NOT be called under an active handle.
>>>> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>>>>    before updating i_disksize (for the cases of post-EOF append write,
>>>>    post-EOF fallocate, and truncate up). In subsequent patches, we will
>>>>    address this by synchronizing commit I/O but doesn't waiting for
>>>>    completion, and updating i_disksize to i_size only after the zeroed
>>>>    data has been written back.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>   fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 92 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index c6fe42d012fc..e0dae2501292 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static int ext4_iomap_zero_begin(struct inode *inode,
>>>> +		loff_t offset, loff_t length, unsigned int flags,
>>>> +		struct iomap *iomap, struct iomap *srcmap)
>>>> +{
>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>
>>> This looks like a layering violation to me. I don't think you can safely
>>> assume the iomap you're passed is a part of iomap_iter...
>>>
>>>> +	struct ext4_map_blocks map;
>>>> +	u8 blkbits = inode->i_blkbits;
>>>> +	unsigned int iomap_flags = 0;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
>>>> +		return -EINVAL;
>>>> +
>>>> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>> +	 */
>>>> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
>>>> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
>>>> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
>>>> +
>>>> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
>>>> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
>>>> +			map.m_len = (start >> blkbits) - map.m_lblk;
>>>> +	}
>>>
>>> ... and you need access to iter only for this which seems to be really a
>>> hack that's trying to outsmart the iomap code. I have to admit I don't
>>> fully understand what you are trying to achieve here. Are you trying to
>>> avoid flushing of the range that will be zeroed out?
>>
>> This logic is copied from the XFS and iomap infrastructure. Its primary
>> purpose is to optimize the zeroing operations on dirty written extents.
>> It was introduced by Brian in [1].
>>
>> The history as I understand it: originally, the iomap infrastructure
>> could not zero dirty unwritten extents during zero range processing,
>> which led to stale data exposure. XFS had to flush dirty ranges itself
>> before zeroing — a workaround that was not generic.
>>
>> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
>> extents"), Brian added an unconditional flush in the iomap
>> infrastructure, ensuring that by the time zeroing runs the extent has
>> already been converted to written so the zero can proceed correctly.
>> However, this flush was too heavy and introduced noticeable performance
>> overhead.
>>
>> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
>> conditional on unwritten mappings"), which restricts flushing to only
>> dirty pagecache over unwritten or hole mappings.
>>
>> Brian later proposed a different approach: rather than relying on flush
>> to convert the extent type, find dirty folios ahead of the zero range
>> and zero the dirty unwritten extents directly. In [1] he added this
>> lookup logic. The filesystem now supplies a folio batch (a collection of
>> dirty folios) via the iomap begin callback, and zero range iterates over
>> these dirty folios to perform zeroing. Clean regions not covered by the
>> batch are simply skipped. This entirely eliminates the need to flush.
>>
>> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/
>>
>> If I understand correctly, the current approach is a compromise, and
>> Brian is still working on this. Perhaps ext4 and XFS could work together
>> on improvements in the future?
>>
> 
> I think that about covers it!
> 
> I do agree wrt to the iomap_iter thing in that it doesn't seem like the
> most elegant thing. I considered that a bit of a roadblock when first
> hacking on the batch stuff, but IIRC somebody pointed out that there was
> precedent already so I didn't think too hard about it after that. Indeed
> if you poke around, other filesystems use a similar pattern to access
> iter->private for whatever private context is carried around.
> 
> FWIW, one of the longer term thoughts for the dirty folio stuff was to
> eventually lift it out of the callback and just have iomap do it for the
> fs. That would eliminate this particular pattern and probably clean
> things up a bit, but there were also some other caveats with that that
> aren't top of mind atm (IIRC, things like dealing with map trimming,
> etc., but I haven't had a chance to think about it in a while).

Thanks for the additional information. This looks like a promising
direction. The usage on the ext4 side is relatively simple at the
moment, so the main concern is whether this would cause any issues for
XFS and other filesystems. This can be investigated later when time
permits — it's not urgent at this point.

> 
> Also note that this isn't necessarily a hard requirement. It's an
> optional optimization. iomap will flush and retry in the dirty
> pagecache+unwritten extent case if the fs hasn't otherwised provided
> folios to make sure it zeroes properly, it's just that performance of
> that may or may not be acceptable for your use case.
> 
> Brian

Yes, we will try to adopt the folio batch approach as much as possible,
hoping to reduce unnecessary performance overhead through it.

Thanks,
Yi.

> 
>>>
>>>> +	ret = iomap_zero_range(inode, from, length, did_zero,
>>>> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
>>>> +			       NULL);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * TODO: The iomap does not distinguish between different types of
>>>> +	 * zeroing and always sets zero_written if a zeroing operation is
>>>> +	 * performed, which may result in unnecessary order operations.
>>>> +	 */
>>>
>>> Is this still true after your fix to did_zero handling?
>>
>> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
>> operation has occurred through did_zero parameter, but it cannot
>> distinguish whether the zeroed range is a written extent that already
>> exists on disk. That is, even if the zeroing is performed on a delalloc
>> extent, did_zero will still return true.
>>
>> Thanks,
>> Yi.
>>
>>>
>>>> +	if (did_zero && zero_written)
>>>> +		*zero_written = *did_zero;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>   /*
>>>>    * Zeros out a mapping of length 'length' starting from file offset
>>>>    * 'from'.  The range to be zero'd must be contained with in one block.
>>>
>>> 								Honza
>>
> 


^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Zhang Yi @ 2026-06-17 13:00 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai, Brian Foster
In-Reply-To: <n3bxam2j3iwdo3u6od5esn4pjqrtgxbdzdyemzpufpor4u2lbh@itrun3kmz52e>

On 6/17/2026 6:50 PM, Jan Kara wrote:
> On Wed 17-06-26 16:14:40, Zhang Yi wrote:
>> On 6/16/2026 8:28 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:34, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
>>>> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
>>>> infrastructure for ext4.
>>>>
>>>> ext4_iomap_block_zero_range() calls iomap_zero_range() with
>>>> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
>>>> out either a mapped partial block or a dirty, unwritten partial block.
>>>>
>>>> Important constraints:
>>>>
>>>> Zeroing out under an active journal handle can cause deadlock, because
>>>> the order of acquiring the folio lock and starting a handle is
>>>> inconsistent with the iomap writeback path.
>>>>
>>>> Therefore, ext4_iomap_block_zero_range():
>>>> - Must NOT be called under an active handle.
>>>> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>>>>    before updating i_disksize (for the cases of post-EOF append write,
>>>>    post-EOF fallocate, and truncate up). In subsequent patches, we will
>>>>    address this by synchronizing commit I/O but doesn't waiting for
>>>>    completion, and updating i_disksize to i_size only after the zeroed
>>>>    data has been written back.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>   fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 92 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index c6fe42d012fc..e0dae2501292 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>   	return 0;
>>>>   }
>>>>   
>>>> +static int ext4_iomap_zero_begin(struct inode *inode,
>>>> +		loff_t offset, loff_t length, unsigned int flags,
>>>> +		struct iomap *iomap, struct iomap *srcmap)
>>>> +{
>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>
>>> This looks like a layering violation to me. I don't think you can safely
>>> assume the iomap you're passed is a part of iomap_iter...
>>>
>>>> +	struct ext4_map_blocks map;
>>>> +	u8 blkbits = inode->i_blkbits;
>>>> +	unsigned int iomap_flags = 0;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
>>>> +		return -EINVAL;
>>>> +
>>>> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>> +	 */
>>>> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
>>>> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
>>>> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
>>>> +
>>>> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
>>>> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
>>>> +			map.m_len = (start >> blkbits) - map.m_lblk;
>>>> +	}
>>>
>>> ... and you need access to iter only for this which seems to be really a
>>> hack that's trying to outsmart the iomap code. I have to admit I don't
>>> fully understand what you are trying to achieve here. Are you trying to
>>> avoid flushing of the range that will be zeroed out?
>>
>> This logic is copied from the XFS and iomap infrastructure. Its primary
>> purpose is to optimize the zeroing operations on dirty written extents.
>> It was introduced by Brian in [1].
> 
> Ah, I see. I still find it hacky but apparently it is an established hack
> in iomap :). Fair.
> 
>> The history as I understand it: originally, the iomap infrastructure
>> could not zero dirty unwritten extents during zero range processing,
>> which led to stale data exposure. XFS had to flush dirty ranges itself
>> before zeroing — a workaround that was not generic.
>>
>> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
>> extents"), Brian added an unconditional flush in the iomap
>> infrastructure, ensuring that by the time zeroing runs the extent has
>> already been converted to written so the zero can proceed correctly.
>> However, this flush was too heavy and introduced noticeable performance
>> overhead.
>>
>> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
>> conditional on unwritten mappings"), which restricts flushing to only
>> dirty pagecache over unwritten or hole mappings.
>>
>> Brian later proposed a different approach: rather than relying on flush
>> to convert the extent type, find dirty folios ahead of the zero range
>> and zero the dirty unwritten extents directly. In [1] he added this
>> lookup logic. The filesystem now supplies a folio batch (a collection of
>> dirty folios) via the iomap begin callback, and zero range iterates over
>> these dirty folios to perform zeroing. Clean regions not covered by the
>> batch are simply skipped. This entirely eliminates the need to flush.
>>
>> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/
> 
> Thanks for the summary! So I was confused because somehow I thought this is
> about fallocate(FALLOC_FL_ZERO_RANGE) and so I was wondering why we just
> cannot evict the page cache and be done with that. Only after reading
> everything again I've realized this is about zeroing partial blocks on hole
> punch etc. And we may need to really handle multiple folios because XFS
> also uses this mechanism to implement FALLOC_FL_ZERO_RANGE for zoned
> storage. Ugh. OK, anyway for now this looks like your patch is following
> how things are expected to be done so feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
>>>> +	/*
>>>> +	 * TODO: The iomap does not distinguish between different types of
>>>> +	 * zeroing and always sets zero_written if a zeroing operation is
>>>> +	 * performed, which may result in unnecessary order operations.
>>>> +	 */
>>>
>>> Is this still true after your fix to did_zero handling?
>>
>> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
>> operation has occurred through did_zero parameter, but it cannot
>> distinguish whether the zeroed range is a written extent that already
>> exists on disk. That is, even if the zeroing is performed on a delalloc
>> extent, did_zero will still return true.
> 
> So maybe write in the comment explicitely, that this may result in
> unnecessary flushing of folios if zeroing happened in
> delayed-not-yet-allocated blocks?
> 
> 								Honza

Sure, I'll include it in the next iteration. :)

Thanks,
Yi.




^ permalink raw reply

* [PATCH v7 11/11] fstests: test UUID consistency for clones with metadata_uuid
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Btrfs and xfs uses the metadata_uuid superblock feature to change the
on-disk UUID without rewriting every block header. This patch adds a
sanity check to ensure UUID consistency when a filesystem with
metadata_uuid enabled is cloned.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/806     | 74 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/806.out |  6 ++++
 2 files changed, 80 insertions(+)
 create mode 100644 tests/generic/806
 create mode 100644 tests/generic/806.out

diff --git a/tests/generic/806 b/tests/generic/806
new file mode 100644
index 000000000000..6d3166491006
--- /dev/null
+++ b/tests/generic/806
@@ -0,0 +1,74 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 806
+#
+# Verify that the cloned filesystem UUID remains consistent, even when the
+# `metadata_uuid` feature is enabled.
+#
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+filter_pool()
+{
+	sed -e "s|${devs[0]}|DEV1|g" -e "s|${mnt1}|MNT1|g" \
+	    -e "s|${devs[1]}|DEV2|g" -e "s|${mnt2}|MNT2|g" | _filter_spaces
+}
+
+# Create base loop device and its clone, applying the metadata_uuid tuning
+# callback to the base filesystem before the copy occurs.
+devs=()
+_loop_image_create_clone devs _change_metadata_uuid
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Get the uuid from the source device
+fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+# Mount both clone and baseline
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+# Cycle mounts and reverse the initialization order to ensure UUID tracking
+# doesn't mismatch or flip when metadata_uuid optimization is active.
+echo "**** mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+status=0
+exit
diff --git a/tests/generic/806.out b/tests/generic/806.out
new file mode 100644
index 000000000000..918f422ecddf
--- /dev/null
+++ b/tests/generic/806.out
@@ -0,0 +1,6 @@
+QA output created by 806
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
+**** mount cycle ****
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox